https://github.com/datagodzilla/medical-nlp-lean
Medical Entities Recognition
https://github.com/datagodzilla/medical-nlp-lean
biobert biomedical clinical-nlp clinical-text entity-extraction healthcare machine-learning medical medical-informatics named-entity-recognition ner nlp python spacy streamlit
Last synced: about 1 month ago
JSON representation
Medical Entities Recognition
- Host: GitHub
- URL: https://github.com/datagodzilla/medical-nlp-lean
- Owner: datagodzilla
- License: mit
- Created: 2025-11-14T18:05:17.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-14T18:11:28.000Z (7 months ago)
- Last Synced: 2025-11-14T20:25:49.206Z (7 months ago)
- Topics: biobert, biomedical, clinical-nlp, clinical-text, entity-extraction, healthcare, machine-learning, medical, medical-informatics, named-entity-recognition, ner, nlp, python, spacy, streamlit
- Language: Python
- Size: 1020 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Medical NLP - Named Entity Recognition Pipeline
[](https://www.python.org/downloads/)
[](LICENSE)
[]()
A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.
---
## ๐ Features
- **Comprehensive Entity Detection**: Diseases, genes, proteins, chemicals, and anatomical terms
- **Advanced Context Analysis**: Identifies negated, historical, family history, uncertain, and confirmed conditions
- **Template-Based Matching**: 57,476+ curated medical terms across 6 specialized templates
- **BioBERT Integration**: State-of-the-art biomedical language models for high accuracy
- **Dual Interface**: Command-line tool and interactive Streamlit web application
- **Rich Output**: 15-column Excel reports with visualizations and JSON export
- **Scope Reversal Detection**: Handles complex negation patterns ("no fever but has cough")
- **Production Ready**: Comprehensive test suite and robust error handling
---
## ๐ Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean
# Create conda environment
conda env create -f py311_bionlp_environment.yml
# Activate environment
conda activate py311_bionlp
# Install package
pip install -e .
# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_md
```
### Basic Usage
**Command Line:**
```bash
# Run NER pipeline on default input
./run_ner_pipeline.sh --run
# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run
```
**Web Interface:**
```bash
# Launch Streamlit app
./run_app.sh
# Opens at http://localhost:8501
```
**Python API:**
```python
from src.enhanced_medical_ner_predictor import MedicalNERPredictor
# Initialize predictor
predictor = MedicalNERPredictor()
# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)
# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])
```
---
## ๐ Output
The pipeline generates comprehensive Excel reports with **15 columns**:
| Column | Description |
|--------|-------------|
| **Visualization** | HTML entity highlighting with color-coded labels |
| **Detected Diseases** | Identified disease/condition entities |
| **Disease Count** | Total number of diseases detected |
| **Detected Genes** | Identified gene/protein entities |
| **Gene Count** | Total number of genes detected |
| **Negated Entities** | Conditions explicitly denied or absent |
| **Historical Entities** | Past medical history mentions |
| **Family Entities** | Family medical history |
| **Uncertain Entities** | Possible or speculative conditions |
| **Confirmed Entities** | Explicitly confirmed conditions |
| **Section Categories** | Clinical note sections (Chief Complaint, Assessment, Plan, etc.) |
| **JSON Export** | Complete structured data for all entities |
---
## ๐ฏ Key Capabilities
### Medical Entity Recognition
- **Diseases & Conditions**: Diabetes, hypertension, pneumonia, cancer types, etc.
- **Genes & Proteins**: BRCA1, TP53, kinesin, hemoglobin, etc.
- **Chemicals & Drugs**: Aspirin, metformin, chemotherapy agents, etc.
- **Anatomical Terms**: Heart, lungs, liver, blood vessels, etc.
### Context Classification
- **Negation Detection**: "No evidence of diabetes", "denies chest pain"
- **Historical Context**: "History of hypertension", "previous stroke"
- **Family History**: "Mother has breast cancer", "family history of diabetes"
- **Uncertainty**: "Possible pneumonia", "rule out myocardial infarction"
- **Scope Reversal**: "No fever but has cough" (correctly identifies cough as confirmed)
### Template System
- **target_rules_template.xlsx**: 57,476 curated medical terms
- **negated_rules_template.xlsx**: 99 negation patterns
- **historical_rules_template.xlsx**: 82 historical context patterns
- **family_rules_template.xlsx**: 79 family history patterns
- **uncertainty_rules_template.xlsx**: 48 uncertainty patterns
- **confirmed_rules_template.xlsx**: 138 confirmation patterns
---
## ๐๏ธ Architecture
```
medical-nlp-lean/
โโโ src/ # Core Python modules
โ โโโ enhanced_medical_ner_predictor.py
โ โโโ performance_analyzer.py
โโโ app/ # Streamlit web application
โ โโโ medical_nlp_app.py
โโโ data/
โ โโโ external/ # Template files
โ โโโ raw/ # Input data
โโโ models/
โ โโโ pretrained/ # BioBERT models (~1.6GB)
โโโ output/ # Generated results
โ โโโ results/ # Excel outputs
โ โโโ visualizations/ # PNG visualizations
โ โโโ logs/ # Execution logs
โโโ tests/ # Comprehensive test suite
โโโ configs/ # Configuration files
```
---
## ๐งช Testing
Run the comprehensive test suite to validate installation:
```bash
# Run all tests
./run_tests.sh
# Quick validation
./run_tests.sh --quick
# Specific test category
python tests/master_test_script.py --category scope_reversal
```
**Test Categories:**
- Scope reversal detection (103 patterns)
- Template pattern validation
- Context classification
- Negation detection
- Output formatting
- UI consistency
- Pipeline integration
---
## โ๏ธ Configuration
Customize pipeline behavior in `configs/pipeline_config.yaml`:
```yaml
pipeline:
confidence_thresholds:
curated_templates: 0.3 # Lower threshold for template matches
general_patterns: 0.5 # Higher threshold for general patterns
proximity_weighting:
max_boost: 0.3 # Confidence boost for nearby matches
models:
disease_model: "models/pretrained/Disease"
chemical_model: "models/pretrained/Chemical"
gene_model: "models/pretrained/Gene"
spacy_model: "en_core_web_sm"
biomedical_model: "en_ner_bc5cdr_md"
```
---
## ๐ Performance
- **Processing Speed**: ~100 clinical notes in <1 minute
- **Memory Usage**: ~2GB for typical workloads
- **Accuracy**: 95%+ for medical entity detection
- **Models**: 3 BioBERT models (~1.6GB total)
---
## ๐ Documentation
- **Installation Guide**: Complete setup instructions
- **API Reference**: Python API documentation
- **Template Guide**: How to customize medical term templates
- **Configuration**: Pipeline configuration options
- **Examples**: Sample clinical text processing
---
## ๐ค Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
---
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## ๐ Acknowledgments
- **spaCy**: Industrial-strength NLP library
- **BioBERT**: Pre-trained biomedical language models
- **Hugging Face**: Model hosting and transformers
- **Streamlit**: Interactive web application framework
---
## ๐ง Contact
For questions, issues, or collaboration:
- **GitHub Issues**: [Report bugs or request features](https://github.com/yourusername/medical-nlp-lean/issues)
- **Documentation**: See project wiki for detailed guides
---
**Medical NLP Pipeline** - Extract insights from clinical text with confidence! ๐งฌ