A Clinical Terminology Annotation Dashboard
(Supports LOINC®, SNOMED CT, ICD-10-CM, OMOP v5)
Table of Contents
AnnoDash is a deployable clinical terminology annotation dashboard developed primarily in Python using Plotly Dash. It allows users to annotate medical concepts on a straightforward interface supported by visualizations of associated patient data and natural language processing.
The dashboard seeks to provide a flexible and customizable solution for clinical annotation. Recent large language models (LLMs) are supported to aid the annotation process. Additional extensions, such as machine learning-powered plugins and search algorithms, can be easily added by technical experts.
A demo with chartevents
& d_items
from the MIMIC-IV v2.2 icu
module is available under releases.
Previously featured on Plotly & Dash 500!
The top left section of the dashboard features a dropdown to keep track of target concepts the user wishes to annotate. The target vocabulary is also selected in a dropdown in this section. The top right module contains the data visualization component. The bottom half of the dashboard includes modules dedicated to querying and displaying candidate ontology codes.
The dashboard is supported by visualization of relevant patient data. For any given target concept, patient observations
are queried from the source data. The Distribution Overview
tab contains a distribution summarizing all patient
observations. Sample Records
selects the top 5 patients (as ranked by most observations) and displays their
records over a 96-hour window. Both numerical and text data are supported. The format of the source data is detailed
below in Usage.
The user annotates target concepts by first selecting the to-be annotated item in the first dropdown. The following
dropdown allows users to select the target ontology. Several default vocabularies are available, but users are free to
modify the dashboard for additional ontology support via scripts detailed in Other
Relevant Files. Code suggestions are then generated in the bottom table. Users are able to select their target
annotation via clicking and the appropriate data is saved in .json
files after submission.
The dashboard automatically generates ontology code suggestions based on the target concept. A string search supported by PyLucene and the Porter stemming algorithm sorts results by relevance, as indicated by the colour of the circle icons. Several other methods of string search are available, such as full text search using SQLite3's FTS5 or ElasticSearch, vector search using TF-IDF, and similarity scoring using Jaro-Winkler/Fuzzy partial ratios. NLM UMLS API is also available for the SNOMED CT ontology.
After searching, the dashboard is able to re-rank ontology codes using LLMs. Currently, OpenAI's GPT-3.5 API and CohereAI's re-ranking API endpoint is supported by default. LLM re-ranking is disabled by default; however, if desired, API keys will be required along with associated costs.
Below are steps to download, install, and run the dashboard locally. Leave all configuration fields unchanged to run the demo using MIMIC-IV data.
The dashboard requires the following major Python packages to run:
All other packages are listed in requirements.txt
.
Additionally, the latest version of the dashboard requires PyLucene 8 for its primary ontology code searching algorithm. Please follow setup instructions available here.
- A
.csv
file containing all patient observations/data (missingness allowed, except for theitemid
column):itemid,subject_id,charttime,value,valueuom 52038,123,2150-01-01 10:00:00,5,mEq/L 52038,123,2150-01-01 11:00:00,6,ug/mL ...
- A
.csv
file containing all concepts to be annotated in id-label pairs, {id: label}:itemid,label 52038,Base Excess 52041,pH ...
- The
config.yaml
:- Define results directory (default:
/results-json/demo
) - Define location of the source data
.csv
(default:/demo-data/CHARTEVENTS.csv
) - Define location of the concepts
.csv
(default:/demo-data/demo_chartevents_user_1.csv
) - Define location of ontology SQLite3 databases (default:
/ontology
) - Define string search algorithm (default:
pylucene
) - Define ranking algorithm (default:
None
) - Define dashboard aesthetics for graphs (defaults are shown in the configuration file)
- Define results directory (default:
To utilize ElasticSearch as the string search algorithm, run a local ElasticSearch cluster via Docker and specify ' elastic' in the appropriate configuration field:
docker run --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.7.0
If desired, please define your API keys (OpenAI, CohereAI, NLM UMLS) as environment variables prior to running the dashboard. This can be done explicitly via editing the Docker Compose file below.
-
Clone repository:
git clone https://github.com/justin13601/AnnoDash.git
-
Install packages needed to generate a configuration file:
pip install PyYAML ml_collections
-
Edit
/src/generate_config.py
with desired directories and configurations and run:python3 generate_config.py
This creates the
config.yaml
required by the dashboard. -
Build dashboard image:
docker build -t annodash .
-
Retrieve the Docker image ID and run the Docker container:
Get
<IMAGE ID>
:docker images
Copy appropriate
<IMAGE ID>
and start the container:docker run --publish 8080:8080 <IMAGE ID>
-
Clone repository:
git clone https://github.com/justin13601/AnnoDash.git
-
Install requirements:
pip install -r requirements.txt
-
Install PyLucene and associated Java libraries.
# use shell scripts to install jcc and pylucene
-
Edit
/src/generate_config.py
with desired directories and configurations and run:python3 generate_config.py
This creates the
config.yaml
required by the dashboard. -
Run dashboard:
python3 main.py
Install/run the dashboard and visit http://127.0.0.1:8080/ or http://localhost:8080/.
/src/generate_config.py
is used to generate the config.yaml
file.
/src/generate_ontology_database.py
uses SQLite3 to generate the .db
database files used to store the
ontology vocabulary. This is needed when defining custom vocabularies outside the default list of available ones.
External packages are required to execute this script. In particular, PyMedTermino is needed to generate SNOMED CT's
database file. Please see installation instructions here.
/src/generate_pylucene_index.py
is used to generate the index used by PyLucene for ontology querying. This is
needed when defining custom vocabularies outside the default list of available ones.
/src/generate_elastic_index.py
is used to generate the index used by ElasticSearch for ontology querying. This is
needed when defining custom vocabularies outside the default list of available ones. This can be run only after a local
ElasticSearch cluster is created via Docker.
/src/search.py
includes classes for ontology searching.
/src/rank.py
includes classes for ontology ranking.
Demo data and respective licenses are included in the demo-data folder.
-
MIMIC-IV Clinical Database demo is available on Physionet (Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV Clinical Database Demo (version 2.2). PhysioNet. https://doi.org/10.13026/dp1f-ex47).
-
LOINC® Ontology Codes are available at https://loinc.org.
-
SNOMED CT Ontology Codes are available at https://www.nlm.nih.gov/healthit/snomedct/index.html.
-
ICD-10-CM Codes are available at https://www.cms.gov/medicare/icd-10/2022-icd-10-cm.
-
OMOP v5 Codes are available at https://athena.ohdsi.org/search-terms/start.
Distributed under the MIT License.
- Alistair Johnson, DPhil | The Hospital for Sick Children | Scientist
- Mjaye Mazwi, MBChB, MD | The Hospital for Sick Children | Staff Physician
- Danny Eytan, MD, PhD | The Hospital for Sick Children | Staff Physician
- Oshri Zaulan, MD | The Hospital for Sick Children | Staff Intensivist
- Azadeh Assadi, MN | The Hospital for Sick Children | Pediatric Nurse Practitioner