https://github.com/sepmein/emr_sql

SQL for electronic medical record data
https://github.com/sepmein/emr_sql
Last synced: 5 months ago
JSON representation
SQL for electronic medical record data
Host: GitHub
URL: https://github.com/sepmein/emr_sql
Owner: sepmein
License: apache-2.0
Created: 2025-04-14T06:56:46.000Z (6 months ago)
Default Branch: master
Last Pushed: 2025-04-27T04:35:54.000Z (5 months ago)
Last Synced: 2025-04-27T05:26:59.434Z (5 months ago)
Size: 10.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # emr_sql

SQL for electronic medical record data

## Problem Setting

### Information Extraction from Electronic Health Records (2019–2024)

Electronic health records (EHR) contain vast amounts of unstructured clinical text. Automated **information extraction (IE)** (e.g. named entity recognition of clinical concepts, relation extraction between entities, section classification) and annotation tools are needed to convert this text into structured data. Over the last five years, the field has seen a clear shift from rule-based and feature-engineered methods toward deep learning, especially transformer-based models. Table 1 summarizes representative tasks, datasets, and top performances. We discuss each class of methods below, along with datasets, benchmarks, metrics, and emerging trends.

| **Task (Dataset)**                           | **Top Method(s)**                   | **Best F1 (approx.)**        | **Notes / Reference**                                          |

|---------------------------------------------|-------------------------------------|------------------------------|---------------------------------------------------------------|

| Clinical concept NER (i2b2 2010)            | Transformer (BERT, ClinicalBERT)    | ~90.3% ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations))           | SOTA achieved by clinical-domain BERT ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations))            |

|Medication/name NER (n2c2 2018)            | BiLSTM-CRF with embeddings          | ~95–97% ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1))        | High performance on drug names/frequency/route ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)) |

| Adverse drug event NER (MADE 2018)         | BiLSTM-CRF                          | ~53–64% ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=recognition%20of%20ADEs%20seems%20to,constrained%20type%20of%20natural%20language))        | Remains challenging (best ≈64% F1) ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=recognition%20of%20ADEs%20seems%20to,constrained%20type%20of%20natural%20language))           |

| Med–attribute relation extraction (n2c2)    | CNN+RNN + rules (joint model)       | ~98% (freq/route/dose), ~85% (ADE) ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=who%20achieved%20top%20F%20_,percentage%20points)) | Hybrid NN+rule approach yields SOTA ([Medical Information Extraction in the Age of Deep Learning - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=The%20top%20performers%20for%20the,based%29%20BiLSTM%2C%20with%20preferences))            |

**Table 1:** *Selected IE tasks, top-performing methods, and example F1 scores from recent studies on clinical text (F1 ~ harmonic mean of precision and recall). References cite reported results on public benchmarks.*

#### Traditional and Rule-Based NLP Methods  

Early EHR IE systems relied on handcrafted rules, lexicons, and syntactic patterns. Tools like **cTAKES**, **MetaMap** or UMLS-based annotators use dictionaries, regular expressions and grammar rules to identify medical terms (diseases, drugs, labs, etc.) ([

            Clinical named-entity recognition: A short comparison - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC9028678/#:~:text=clinical%20data,version%20of%20an%20annotated%20dataset)) ([

            A Hybrid Approach to Extracting Disorder Mentions from Clinical Notes - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC4525272/#:~:text=a%20hybrid%20approach%20to%20disorder,participating%20systems%20in%20the%20Challenge)).  Such rule-based pipelines can achieve high precision for well-defined patterns (e.g. medication dosages), but often miss variations and incur low recall.  They require extensive domain knowledge engineering and do not generalize easily to new tasks.  Even in recent clinical NLP, rule-based modules remain in use (for negation detection, section headings, or as post-processing steps) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=The%20top%20performers%20for%20the,based%29%20BiLSTM%2C%20with%20preferences)) ([Development and Application of Natural Language Processing on Unstructured Data in Hypertension: A Scoping Review | medRxiv](https://www.medrxiv.org/content/10.1101/2024.02.27.24303468v2.full-text#:~:text=number%20of%20studies%20%28N%3D6%2C%2013.3,studies%20trained%20machine%2Fdeep%20learning%20models)).  For example, in medication relation extraction the top systems used rule post-processing on neural outputs ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=The%20top%20performers%20for%20the,based%29%20BiLSTM%2C%20with%20preferences)).  A 2022 review observed that roughly 40–45% of hypertension-related NLP studies still employ rule-based components ([Development and Application of Natural Language Processing on Unstructured Data in Hypertension: A Scoping Review | medRxiv](https://www.medrxiv.org/content/10.1101/2024.02.27.24303468v2.full-text#:~:text=number%20of%20studies%20%28N%3D6%2C%2013.3,studies%20trained%20machine%2Fdeep%20learning%20models)). 

#### Machine Learning (Feature-Based)  

Traditional ML approaches extract features (n-grams, dictionary matches, syntax) and use classifiers like **Conditional Random Fields (CRFs)** or SVMs for sequence labeling.  CRF models were a standard for clinical NER: they model the label sequence jointly and can enforce tag consistency.  Shallow ML methods were widely applied in i2b2/n2c2 challenges through 2010–2015. For instance, CRFs achieved respectable performance on disease and medication NER (often F1 in the 70–85% range on small corpora).  However, studies note that “shallow classifiers and rule-based techniques” are still commonly used but inherently limit the ability to capture complex context ([A Survey of Deep Learning for Electronic Health Records](https://www.mdpi.com/2076-3417/12/22/11709#:~:text=NER.%20Sheikhalishahi%20et%20al.%20,45%5D%20looked%20into%20how)).  These methods require careful feature engineering (e.g. lexicon lookups, context windows) and often plateau once features are optimized.  

Machine-learned models (especially CRFs) still play a role on smaller datasets.  In some benchmarks, well-tuned CRF or SVM models remain competitive: e.g. for medication–attribute relations on the MADE 2018 corpus, non-deep classifiers slightly outperformed neural models on several sub-tasks ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=For%20the%20MADE%201,where%20the%20DL%20approach%20ranked)).  But overall, across EHR tasks, deep learning methods have increasingly taken over performance leadership ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)) ([Named Entity Recognition in Medical Domain: A systematic Literature Review | Kusuma | JOIV : International Journal on Informatics Visualization](https://joiv.org/index.php/joiv/article/view/3111#:~:text=Springer,performance%20of%20these%20models%20mostly)).  Modern IE pipelines rarely use pure CRF/SVM systems as state-of-the-art, except in very low-data settings.

#### Deep Learning Approaches  

Deep neural networks have dominated recent EHR IE. Early deep models included **Recurrent Neural Networks** (RNNs) such as bidirectional LSTM (BiLSTM) often combined with a CRF output layer. BiLSTM-CRFs were (and still are) popular for NER: they learn contextual embeddings of words and capture label dependencies.  **Convolutional Neural Networks (CNNs)** have been applied to sequence tasks by sliding filters over embeddings (useful for capturing local features) and also for relation classification. These networks typically use pre-trained word embeddings (Word2Vec, GloVe) or medical-domain embeddings (trained on MIMIC-III) as input ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=In%20terms%20of%20DL%20methodology%2C,III%2C%20might%20be%20advantageous)). 

With deep learning, performance jumped: for example, disease NER near 90% F1 on i2b22010 has been achieved with BiLSTM-based models ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)).  In medication NER, BiLSTM-CRF models routinely reached F1 ≈95–97% on drug names, dosages, and routes ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)).  Similarly, relation extraction systems based on CNN-RNN architectures achieved F1≈98% on simple relations like frequency or dosage ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=who%20achieved%20top%20F%20_,percentage%20points)).  However, these networks still require lots of annotated data and task-specific tuning.

##### Transformer-Based and Contextual Models  

Since 2019, **transformer models** (BERT and its variants) have set new state-of-the-art across virtually all EHR IE tasks. Transformers use self-attention to encode context and are often fine-tuned for a target task.  Domain-specific pretrained models (e.g. **BioBERT**, **ClinicalBERT**, **BlueBERT**, **PubMedBERT**) are trained on biomedical/EHR text and significantly improve clinical concept recognition ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)) ([A Survey of Deep Learning for Electronic Health Records](https://www.mdpi.com/2076-3417/12/22/11709#:~:text=At%20present%2C%20the%20best%20performance,level%20convolutional)).  For example, ClinicalBERT (pretrained on MIMIC notes) achieved ~90.3% F1 on the i2b2 2010 concept extraction task ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)), exceeding earlier methods. 

Transformers also dominate relation extraction and document classification.  Researchers typically frame IE tasks as sequence-labeling (BIO tagging) for NER or as token pair classification for relations ([A large language model for electronic health records | npj Digital Medicine](https://www.nature.com/articles/s41746-022-00742-2#:~:text=clinical%20concepts%20on%20the%20three,and%20achieved%20the%20best%20F1)) ([Named Entity Recognition in Electronic Health Records: A Methodological Review - PubMed](https://pubmed.ncbi.nlm.nih.gov/37964451/#:~:text=between%20them,within%20a%20specific%20clinical%20domain)).  Notably, a 2024 EHR-NER review found that BERT-style models (with “BIO” tagging) are now *“the most frequently reported”* approach ([Named Entity Recognition in Electronic Health Records: A Methodological Review - PubMed](https://pubmed.ncbi.nlm.nih.gov/37964451/#:~:text=between%20them,within%20a%20specific%20clinical%20domain)).  Hybrid training schemes (multi-task or transfer learning across entities) have also been explored to boost performance when data is scarce ([A Survey of Deep Learning for Electronic Health Records](https://www.mdpi.com/2076-3417/12/22/11709#:~:text=based%20on%20general%20neural%20network,performing)).  

In summary, the “paradigm shift” from feature-based ML to deep learning is now nearly complete in clinical NLP ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)) ([Named Entity Recognition in Medical Domain: A systematic Literature Review | Kusuma | JOIV : International Journal on Informatics Visualization](https://joiv.org/index.php/joiv/article/view/3111#:~:text=Springer,performance%20of%20these%20models%20mostly)).  Studies consistently report that deep models outperform classical ML by large margins on modern benchmarks, especially when ample data or pretrained models are available ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)) ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)). 

#### Hybrid and Knowledge-Infused Models  

Recent methods often combine neural models with rules or external knowledge. For instance, top systems for medication–attribute relation extraction used a **CNN+RNN architecture with rule-based postprocessing** ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=The%20top%20performers%20for%20the,based%29%20BiLSTM%2C%20with%20preferences)). This hybrid approach leverages the flexibility of neural nets while enforcing medical constraints via rules.  Another trend is incorporating **domain ontologies or gazetteers** into neural pipelines.  For example, Nie *et al.*’s KA-NER model augmented BERT with UMLS concept embeddings, achieving an average F1≈84.8% across multiple NER corpora ([A Survey of Deep Learning for Electronic Health Records](https://www.mdpi.com/2076-3417/12/22/11709#:~:text=,this%20model%20on%20multiple%20datasets)). Knowledge-guided embeddings can help disambiguate terms and improve generalization.  

In general, hybrid systems aim to capture the benefits of both worlds: data-driven learning plus curated domain signals. This includes simple pipelines where dictionary matches seed candidate terms for a classifier, or neural models augmented with gazetteer features.  While pure deep models dominate performance, well-engineered hybrids remain competitive on tasks with limited data ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=For%20the%20MADE%201,where%20the%20DL%20approach%20ranked)).  

#### Large Language Models (LLMs) and Pretrained Transformers  

The latest frontier is using very large pretrained models for IE. Beyond BERT-scale (100M–1B parameters), **medical LLMs** and general LLMs have been tested on EHR data.  For example, NVIDIA’s GatorTron (8.9B parameters, trained on >90B words of clinical text) yielded modest gains on multiple clinical NLP tasks (NER, relation extraction, inference) compared to smaller models ([A large language model for electronic health records | npj Digital Medicine](https://www.nature.com/articles/s41746-022-00742-2#:~:text=NLP%20tasks,com%2Forgs%2Fnvidia%2Fteams%2Fclar)).  While GatorTron improved accuracy on inference tasks by ~9–10% ([A large language model for electronic health records | npj Digital Medicine](https://www.nature.com/articles/s41746-022-00742-2#:~:text=NLP%20tasks,com%2Forgs%2Fnvidia%2Fteams%2Fclar)), it also showed that classic extraction tasks are approaching saturation (large models gave only “moderate improvements for easier tasks” like concept recognition ([A large language model for electronic health records | npj Digital Medicine](https://www.nature.com/articles/s41746-022-00742-2#:~:text=answering%2C%20but%20moderate%20improvements%20for,saturation%20of%20simpler%20benchmarks%20with))). 

**ChatGPT and GPT-like models** have recently been explored for EHR IE. A 2024 study used ChatGPT-3.5 (text-davinci) to extract pathology report fields. Without any task-specific training, ChatGPT achieved ~89% accuracy on lung cancer classification and ~99% on osteosarcoma report tasks, outperforming traditional NLP baselines ([A critical assessment of using ChatGPT for extracting structured data from clinical notes | npj Digital Medicine](https://www.nature.com/articles/s41746-024-01079-8#:~:text=demonstrated%20the%20ability%20to%20extract,notes%20for%20structured%20information%20extraction)). This demonstrates LLMs’ potential to do zero-shot information extraction from clinical text. However, these models depend heavily on prompt design and may err on specialized terminology ([A critical assessment of using ChatGPT for extracting structured data from clinical notes | npj Digital Medicine](https://www.nature.com/articles/s41746-024-01079-8#:~:text=traditional%20NLP%20methods,human%20annotation%20and%20model%20training)).  (For a clinical audience: ChatGPT acts as a “black-box” NER/IE tool via natural-language prompts, but reliability/hallucination remain concerns.)  

In summary, transformer-based pretrained models (from BioBERT to GPT-4) are becoming mainstream. Their trend is toward *pretrain then fine-tune* or *prompt and infer*. Domain-specific pretraining (e.g. on MIMIC or PubMed) generally helps; but even large general LLMs can perform well with proper prompting. The field is actively assessing how best to integrate LLMs into clinical workflows (weaker labeling requirements vs risk of error).

#### Datasets and Benchmarks  

Several public datasets and shared tasks drive progress in EHR IE:  

- **i2b2/VA and n2c2 Challenges (2006–2020):** These workshops provided annotated clinical notes for various IE tasks. Notably, i2b2/VA (2009–2012) included concept extraction (disease, drug, etc.) and relations ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)).  n2c2 (formerly i2b2) hosted tasks like medication-attribute extraction (2018) and a variety of phenotyping tasks. For example, the n2c2 2018 medication challenge (track 2) produced ∼1,100 de-identified oncology notes labeled with drugs, dosages, and ADE relations (over 83k entity mentions and 59k relations) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=comparison%29%2C%20and%2083%2C869%20concept%20annotations,DDI%20corpus%20%2076)) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)).  

- **MADE 2018:**  The “Medication, Indication, and Adverse Drug Events” challenge released ~1,092 cancer patient notes with annotations for medications and ADEs ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=annotations%20%28see%20Section%203,structured)). This corpus is used for drug and ADE extraction evaluation.  

- **DDI (2013):** The Drug-Drug Interaction Extraction corpus (from DrugBank and MEDLINE) contains drug mentions and annotated interactions. Although from literature, it’s often referenced for medication IE.  

- **ShARe/CLEF (2013) and SemEval (2014–2015):** Shared tasks on clinical concept recognition and normalization from MIMIC notes (ShARe) and i2b2 notes (SemEval) have provided benchmarks for disease/symptom/negation annotation. As one survey notes, standard corpora include i2b2, ShARe, and SemEval datasets ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=et%C2%A0al,ML%29%20models)).  

- **MIMIC-III/IV:**  This open ICU dataset contains millions of clinical notes (discharge summaries, radiology reports, etc.) plus structured data. MIMIC is not fully annotated for NLP, but it is widely used for unsupervised pretraining (the basis for ClinicalBERT, MIMIC-trained Word2Vec, etc.), and smaller annotation efforts (e.g. de-identification). Many recent models are pretrained on MIMIC before fine-tuning on IE tasks.  

- **BLUE Benchmark:**  The Biomedical Language Understanding Evaluation (BLUE) is a suite combining several tasks (NER, RE) across 10 biomedical corpora (including BC5CDR, DDI, i2b2) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=give%20valuable%20comparative%20information%20of,modeling%20techniques%2C%20and%20evaluation%20metrics)). Though many BLUE tasks are PubMed-based, it includes i2b2-derived medical benchmarks.  

- **Clinical Dialogs and Other Corpora:** In specialized domains, annotated data is scarce. Some efforts (e.g. small studies on Spanish EHRs or domain-specific corpora) exist, but most high-performance results come from the above English resources.  

These datasets are evaluated with standard metrics: **precision, recall, and F1-score** (F1 = harmonic mean of precision and recall) are ubiquitous ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=Performance%20metric%20definitions%20%E2%80%A2%20Sensitivity,extracted%20as)) ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=%E2%80%A2%20F1%20Score%3A%20Computed%20as,balance%20of%20sensitivity%20and%20PPV)). For binary extraction tasks, accuracy or AUC may be reported, but F1 is primary.  Table 1 and the above text cite example F1 results on public benchmarks.

#### Evaluation Metrics  

IE systems are typically evaluated on held-out annotated test sets. Common metrics are: **Precision** (positive predictive value), **Recall** (sensitivity), and their **F1-score** (harmonic mean) ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=Performance%20metric%20definitions%20%E2%80%A2%20Sensitivity,extracted%20as)) ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=%E2%80%A2%20F1%20Score%3A%20Computed%20as,balance%20of%20sensitivity%20and%20PPV)). For a class *e.g.* “Disease”, *precision* = (#correctly extracted Disease mentions) / (#all predicted Disease mentions); *recall* = (#correctly extracted) / (#all gold Disease mentions).  High F1 indicates a good balance of precision and recall. Some tasks (e.g. section classification) use overall accuracy or AUC, but for NER/RE F1 is standard. The cited studies use these metrics consistently ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=Performance%20metric%20definitions%20%E2%80%A2%20Sensitivity,extracted%20as)) ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=%E2%80%A2%20F1%20Score%3A%20Computed%20as,balance%20of%20sensitivity%20and%20PPV)). In medical NLP, even F1 scores in the 0.70–0.85 range can be valuable given complexity and safety needs. 

#### Performance Trends and Comparisons  

The broad trend is clear: **deep learning > classical ML > rule-based** on most benchmarks.  Udo Hahn *et al.* summarize that “overwhelming experimental evidence… DL-based approaches outperform non-DL ones by often large margins” ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)).  Systematic reviews confirm a shift: recent surveys report a transition from conventional ML to deep (especially transformers) ([Named Entity Recognition in Medical Domain: A systematic Literature Review | Kusuma | JOIV : International Journal on Informatics Visualization](https://joiv.org/index.php/joiv/article/view/3111#:~:text=Springer,availability%20to%20build%20accurate%20models)), and contextual models achieve new SOTA on all tasks ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)). 

At a finer level, performance has **plateaued** on simpler extraction tasks. For example, drug name recognition on n2c2 clinical notes now reaches >95% F1 ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)), comparable to human inter-annotator agreement in some cases.  Even a bespoke 8.9B-parameter model (GatorTron) gave only modest gains on concept extraction, noting that “simpler benchmarks [are] saturated” ([A large language model for electronic health records | npj Digital Medicine](https://www.nature.com/articles/s41746-022-00742-2#:~:text=answering%2C%20but%20moderate%20improvements%20for,saturation%20of%20simpler%20benchmarks%20with)). In contrast, harder tasks (ADE detection, nuanced relations) remain much lower: best F1 for adverse-drug-event mention is ≈53–64% ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=whereas%20for%20ADE%20it%20is,constrained%20type%20of%20natural%20language)), and causal or temporal relations still in the 0.70–0.80 range in the best systems. Hybrid approaches can help narrow gaps – e.g. CNN+RNN+rule gave up to 98% F1 on some medication relations ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=who%20achieved%20top%20F%20_,percentage%20points)) – but challenges remain. 

**Datasets and annotation** are still bottlenecks. Many papers caution that small corpora limit deep models ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=limited%20corpora%20create%20intrinsic%20problems,overcome%20by%20adaptive%20learning%20strategies)).  Reviews emphasize the need for more annotated clinical data and the effort to standardize evaluation (e.g. BLUE) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=give%20valuable%20comparative%20information%20of,modeling%20techniques%2C%20and%20evaluation%20metrics)). As one scoping review notes, creating new domain-specific corpora is essential for progress ([Named Entity Recognition in Electronic Health Records: A Methodological Review - PubMed](https://pubmed.ncbi.nlm.nih.gov/37964451/#:~:text=Conclusions%3A%20%20EHRs%20play%20a,of%20NER%20and%20RE%20models)). 

In summary, modern clinical IE systems almost universally use neural models with learned embeddings. Rule-based components and feature-based ML survive mainly in specialized or resource-limited settings. Evaluation focuses on precision/recall/F1, often on i2b2/n2c2-derived benchmarks. The overall trajectory is toward larger, pretrained models (BERT and beyond) and away from hand-tuned rules. 

**Key comparisons:** Deep BiLSTM-CRF models once set the bar for EHR NER; today transformer encoders do better. For example, before BERT the top i2b2 2010 F1 was in the mid-80s, whereas a clinical BERT achieved 90.3% ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)). Similarly, on medication extraction the best BiLSTM-CRF gave ~95% F1 on straightforward fields ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)); by contrast, transformers have standardized methods and typically push such scores slightly higher. Hybrid models (NN+rules) are used mostly to eke out final improvements on specific relation tasks ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=The%20top%20performers%20for%20the,based%29%20BiLSTM%2C%20with%20preferences)). 

**Conclusion:** In 2020–2024, information extraction from EHRs has matured into a deep-learning–dominated field. Rule-based and classical ML methods largely serve as baselines or adjuncts. The strongest systems today use pretrained transformers (often fine-tuned BERT variants) and even large LLMs to label entities and relations in clinical notes. Benchmarks (i2b2/n2c2, MADE, etc.) provide the testbeds, and evaluation relies on precision/recall/F1. While results are high on many subtasks, specialized extraction (e.g. ADEs, causal relations) remains challenging. Ongoing work focuses on expanding datasets, leveraging medical knowledge, and responsibly applying LLMs to improve performance and reliability in clinical NLP ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)) ([A critical assessment of using ChatGPT for extracting structured data from clinical notes | npj Digital Medicine](https://www.nature.com/articles/s41746-024-01079-8#:~:text=demonstrated%20the%20ability%20to%20extract,notes%20for%20structured%20information%20extraction)).

**Sources:** Recent surveys and challenge reports ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Results%20%3A%20In%20the%20past,overcome%20by%20adaptive%20learning%20strategies)) ([A Survey of Deep Learning for Electronic Health Records](https://www.mdpi.com/2076-3417/12/22/11709#:~:text=NER.%20Sheikhalishahi%20et%20al.%20,45%5D%20looked%20into%20how)) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=Except%20for%20route%20and%20ADE%2C,1)) ([

            Medical Information Extraction in the Age of Deep Learning - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442512/#:~:text=who%20achieved%20top%20F%20_,percentage%20points)) ([[1902.08691] Enhancing Clinical Concept Extraction with Contextual Embeddings](https://ar5iv.org/pdf/1902.08691#:~:text=performances%20across%20all%20concept%20extraction,for%20in%20traditional%20word%20representations)) ([A critical assessment of using ChatGPT for extracting structured data from clinical notes | npj Digital Medicine](https://www.nature.com/articles/s41746-024-01079-8#:~:text=demonstrated%20the%20ability%20to%20extract,notes%20for%20structured%20information%20extraction)) provide empirical results and comparisons for clinical IE methods. Performance metrics and definitions follow standard ML conventions ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=Performance%20metric%20definitions%20%E2%80%A2%20Sensitivity,extracted%20as)) ([

            Approach to machine learning for extraction of real-world data variables from electronic health records - PMC

        ](https://pmc.ncbi.nlm.nih.gov/articles/PMC10541019/#:~:text=%E2%80%A2%20F1%20Score%3A%20Computed%20as,balance%20of%20sensitivity%20and%20PPV)).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sepmein/emr_sql

Awesome Lists containing this project

README