https://github.com/datagodzilla/medical-nlp-lean

Medical Entities Recognition
https://github.com/datagodzilla/medical-nlp-lean

biobert biomedical clinical-nlp clinical-text entity-extraction healthcare machine-learning medical medical-informatics named-entity-recognition ner nlp python spacy streamlit

Last synced: about 2 months ago
JSON representation

Medical Entities Recognition

Host: GitHub
URL: https://github.com/datagodzilla/medical-nlp-lean
Owner: datagodzilla
License: mit
Created: 2025-11-14T18:05:17.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-11-14T18:11:28.000Z (7 months ago)
Last Synced: 2025-11-14T20:25:49.206Z (7 months ago)
Topics: biobert, biomedical, clinical-nlp, clinical-text, entity-extraction, healthcare, machine-learning, medical, medical-informatics, named-entity-recognition, ner, nlp, python, spacy, streamlit
Language: Python
Size: 1020 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Medical NLP - Named Entity Recognition Pipeline

[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Status](https://img.shields.io/badge/Status-Production%20Ready-brightgreen.svg)]()

A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.

---

## 🌟 Features

- **Comprehensive Entity Detection**: Diseases, genes, proteins, chemicals, and anatomical terms
- **Advanced Context Analysis**: Identifies negated, historical, family history, uncertain, and confirmed conditions
- **Template-Based Matching**: 57,476+ curated medical terms across 6 specialized templates
- **BioBERT Integration**: State-of-the-art biomedical language models for high accuracy
- **Dual Interface**: Command-line tool and interactive Streamlit web application
- **Rich Output**: 15-column Excel reports with visualizations and JSON export
- **Scope Reversal Detection**: Handles complex negation patterns ("no fever but has cough")
- **Production Ready**: Comprehensive test suite and robust error handling

---

## 🚀 Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean

# Create conda environment
conda env create -f py311_bionlp_environment.yml

# Activate environment
conda activate py311_bionlp

# Install package
pip install -e .

# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_md
```

### Basic Usage

**Command Line:**
```bash
# Run NER pipeline on default input
./run_ner_pipeline.sh --run

# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run
```

**Web Interface:**
```bash
# Launch Streamlit app
./run_app.sh

# Opens at http://localhost:8501
```

**Python API:**
```python
from src.enhanced_medical_ner_predictor import MedicalNERPredictor

# Initialize predictor
predictor = MedicalNERPredictor()

# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)

# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])
```

---

## 📊 Output

The pipeline generates comprehensive Excel reports with **15 columns**:

| Column | Description |
|--------|-------------|
| **Visualization** | HTML entity highlighting with color-coded labels |
| **Detected Diseases** | Identified disease/condition entities |
| **Disease Count** | Total number of diseases detected |
| **Detected Genes** | Identified gene/protein entities |
| **Gene Count** | Total number of genes detected |
| **Negated Entities** | Conditions explicitly denied or absent |
| **Historical Entities** | Past medical history mentions |
| **Family Entities** | Family medical history |
| **Uncertain Entities** | Possible or speculative conditions |
| **Confirmed Entities** | Explicitly confirmed conditions |
| **Section Categories** | Clinical note sections (Chief Complaint, Assessment, Plan, etc.) |
| **JSON Export** | Complete structured data for all entities |

---

## 🎯 Key Capabilities

### Medical Entity Recognition
- **Diseases & Conditions**: Diabetes, hypertension, pneumonia, cancer types, etc.
- **Genes & Proteins**: BRCA1, TP53, kinesin, hemoglobin, etc.
- **Chemicals & Drugs**: Aspirin, metformin, chemotherapy agents, etc.
- **Anatomical Terms**: Heart, lungs, liver, blood vessels, etc.

### Context Classification
- **Negation Detection**: "No evidence of diabetes", "denies chest pain"
- **Historical Context**: "History of hypertension", "previous stroke"
- **Family History**: "Mother has breast cancer", "family history of diabetes"
- **Uncertainty**: "Possible pneumonia", "rule out myocardial infarction"
- **Scope Reversal**: "No fever but has cough" (correctly identifies cough as confirmed)

### Template System
- **target_rules_template.xlsx**: 57,476 curated medical terms
- **negated_rules_template.xlsx**: 99 negation patterns
- **historical_rules_template.xlsx**: 82 historical context patterns
- **family_rules_template.xlsx**: 79 family history patterns
- **uncertainty_rules_template.xlsx**: 48 uncertainty patterns
- **confirmed_rules_template.xlsx**: 138 confirmation patterns

---

## 🏗️ Architecture

```
medical-nlp-lean/
├── src/ # Core Python modules
│ ├── enhanced_medical_ner_predictor.py
│ └── performance_analyzer.py
├── app/ # Streamlit web application
│ └── medical_nlp_app.py
├── data/
│ ├── external/ # Template files
│ └── raw/ # Input data
├── models/
│ └── pretrained/ # BioBERT models (~1.6GB)
├── output/ # Generated results
│ ├── results/ # Excel outputs
│ ├── visualizations/ # PNG visualizations
│ └── logs/ # Execution logs
├── tests/ # Comprehensive test suite
└── configs/ # Configuration files
```

---

## 🧪 Testing

Run the comprehensive test suite to validate installation:

```bash
# Run all tests
./run_tests.sh

# Quick validation
./run_tests.sh --quick

# Specific test category
python tests/master_test_script.py --category scope_reversal
```

**Test Categories:**
- Scope reversal detection (103 patterns)
- Template pattern validation
- Context classification
- Negation detection
- Output formatting
- UI consistency
- Pipeline integration

---

## ⚙️ Configuration

Customize pipeline behavior in `configs/pipeline_config.yaml`:

```yaml
pipeline:
confidence_thresholds:
curated_templates: 0.3 # Lower threshold for template matches
general_patterns: 0.5 # Higher threshold for general patterns
proximity_weighting:
max_boost: 0.3 # Confidence boost for nearby matches

models:
disease_model: "models/pretrained/Disease"
chemical_model: "models/pretrained/Chemical"
gene_model: "models/pretrained/Gene"
spacy_model: "en_core_web_sm"
biomedical_model: "en_ner_bc5cdr_md"
```

---

## 📈 Performance

- **Processing Speed**: ~100 clinical notes in <1 minute
- **Memory Usage**: ~2GB for typical workloads
- **Accuracy**: 95%+ for medical entity detection
- **Models**: 3 BioBERT models (~1.6GB total)

---

## 📚 Documentation

- **Installation Guide**: Complete setup instructions
- **API Reference**: Python API documentation
- **Template Guide**: How to customize medical term templates
- **Configuration**: Pipeline configuration options
- **Examples**: Sample clinical text processing

---

## 🤝 Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- **spaCy**: Industrial-strength NLP library
- **BioBERT**: Pre-trained biomedical language models
- **Hugging Face**: Model hosting and transformers
- **Streamlit**: Interactive web application framework

---

## 📧 Contact

For questions, issues, or collaboration:

- **GitHub Issues**: [Report bugs or request features](https://github.com/yourusername/medical-nlp-lean/issues)
- **Documentation**: See project wiki for detailed guides

---

**Medical NLP Pipeline** - Extract insights from clinical text with confidence! 🧬

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datagodzilla/medical-nlp-lean

Awesome Lists containing this project

README