An open API service indexing awesome lists of open source software.

https://github.com/datagodzilla/medical-nlp-lean

Medical Entities Recognition
https://github.com/datagodzilla/medical-nlp-lean

biobert biomedical clinical-nlp clinical-text entity-extraction healthcare machine-learning medical medical-informatics named-entity-recognition ner nlp python spacy streamlit

Last synced: about 1 month ago
JSON representation

Medical Entities Recognition

Awesome Lists containing this project

README

          

# Medical NLP - Named Entity Recognition Pipeline

[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Status](https://img.shields.io/badge/Status-Production%20Ready-brightgreen.svg)]()

A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.

---

## ๐ŸŒŸ Features

- **Comprehensive Entity Detection**: Diseases, genes, proteins, chemicals, and anatomical terms
- **Advanced Context Analysis**: Identifies negated, historical, family history, uncertain, and confirmed conditions
- **Template-Based Matching**: 57,476+ curated medical terms across 6 specialized templates
- **BioBERT Integration**: State-of-the-art biomedical language models for high accuracy
- **Dual Interface**: Command-line tool and interactive Streamlit web application
- **Rich Output**: 15-column Excel reports with visualizations and JSON export
- **Scope Reversal Detection**: Handles complex negation patterns ("no fever but has cough")
- **Production Ready**: Comprehensive test suite and robust error handling

---

## ๐Ÿš€ Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean

# Create conda environment
conda env create -f py311_bionlp_environment.yml

# Activate environment
conda activate py311_bionlp

# Install package
pip install -e .

# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_md
```

### Basic Usage

**Command Line:**
```bash
# Run NER pipeline on default input
./run_ner_pipeline.sh --run

# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run
```

**Web Interface:**
```bash
# Launch Streamlit app
./run_app.sh

# Opens at http://localhost:8501
```

**Python API:**
```python
from src.enhanced_medical_ner_predictor import MedicalNERPredictor

# Initialize predictor
predictor = MedicalNERPredictor()

# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)

# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])
```

---

## ๐Ÿ“Š Output

The pipeline generates comprehensive Excel reports with **15 columns**:

| Column | Description |
|--------|-------------|
| **Visualization** | HTML entity highlighting with color-coded labels |
| **Detected Diseases** | Identified disease/condition entities |
| **Disease Count** | Total number of diseases detected |
| **Detected Genes** | Identified gene/protein entities |
| **Gene Count** | Total number of genes detected |
| **Negated Entities** | Conditions explicitly denied or absent |
| **Historical Entities** | Past medical history mentions |
| **Family Entities** | Family medical history |
| **Uncertain Entities** | Possible or speculative conditions |
| **Confirmed Entities** | Explicitly confirmed conditions |
| **Section Categories** | Clinical note sections (Chief Complaint, Assessment, Plan, etc.) |
| **JSON Export** | Complete structured data for all entities |

---

## ๐ŸŽฏ Key Capabilities

### Medical Entity Recognition
- **Diseases & Conditions**: Diabetes, hypertension, pneumonia, cancer types, etc.
- **Genes & Proteins**: BRCA1, TP53, kinesin, hemoglobin, etc.
- **Chemicals & Drugs**: Aspirin, metformin, chemotherapy agents, etc.
- **Anatomical Terms**: Heart, lungs, liver, blood vessels, etc.

### Context Classification
- **Negation Detection**: "No evidence of diabetes", "denies chest pain"
- **Historical Context**: "History of hypertension", "previous stroke"
- **Family History**: "Mother has breast cancer", "family history of diabetes"
- **Uncertainty**: "Possible pneumonia", "rule out myocardial infarction"
- **Scope Reversal**: "No fever but has cough" (correctly identifies cough as confirmed)

### Template System
- **target_rules_template.xlsx**: 57,476 curated medical terms
- **negated_rules_template.xlsx**: 99 negation patterns
- **historical_rules_template.xlsx**: 82 historical context patterns
- **family_rules_template.xlsx**: 79 family history patterns
- **uncertainty_rules_template.xlsx**: 48 uncertainty patterns
- **confirmed_rules_template.xlsx**: 138 confirmation patterns

---

## ๐Ÿ—๏ธ Architecture

```
medical-nlp-lean/
โ”œโ”€โ”€ src/ # Core Python modules
โ”‚ โ”œโ”€โ”€ enhanced_medical_ner_predictor.py
โ”‚ โ””โ”€โ”€ performance_analyzer.py
โ”œโ”€โ”€ app/ # Streamlit web application
โ”‚ โ””โ”€โ”€ medical_nlp_app.py
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ external/ # Template files
โ”‚ โ””โ”€โ”€ raw/ # Input data
โ”œโ”€โ”€ models/
โ”‚ โ””โ”€โ”€ pretrained/ # BioBERT models (~1.6GB)
โ”œโ”€โ”€ output/ # Generated results
โ”‚ โ”œโ”€โ”€ results/ # Excel outputs
โ”‚ โ”œโ”€โ”€ visualizations/ # PNG visualizations
โ”‚ โ””โ”€โ”€ logs/ # Execution logs
โ”œโ”€โ”€ tests/ # Comprehensive test suite
โ””โ”€โ”€ configs/ # Configuration files
```

---

## ๐Ÿงช Testing

Run the comprehensive test suite to validate installation:

```bash
# Run all tests
./run_tests.sh

# Quick validation
./run_tests.sh --quick

# Specific test category
python tests/master_test_script.py --category scope_reversal
```

**Test Categories:**
- Scope reversal detection (103 patterns)
- Template pattern validation
- Context classification
- Negation detection
- Output formatting
- UI consistency
- Pipeline integration

---

## โš™๏ธ Configuration

Customize pipeline behavior in `configs/pipeline_config.yaml`:

```yaml
pipeline:
confidence_thresholds:
curated_templates: 0.3 # Lower threshold for template matches
general_patterns: 0.5 # Higher threshold for general patterns
proximity_weighting:
max_boost: 0.3 # Confidence boost for nearby matches

models:
disease_model: "models/pretrained/Disease"
chemical_model: "models/pretrained/Chemical"
gene_model: "models/pretrained/Gene"
spacy_model: "en_core_web_sm"
biomedical_model: "en_ner_bc5cdr_md"
```

---

## ๐Ÿ“ˆ Performance

- **Processing Speed**: ~100 clinical notes in <1 minute
- **Memory Usage**: ~2GB for typical workloads
- **Accuracy**: 95%+ for medical entity detection
- **Models**: 3 BioBERT models (~1.6GB total)

---

## ๐Ÿ“š Documentation

- **Installation Guide**: Complete setup instructions
- **API Reference**: Python API documentation
- **Template Guide**: How to customize medical term templates
- **Configuration**: Pipeline configuration options
- **Examples**: Sample clinical text processing

---

## ๐Ÿค Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

---

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## ๐Ÿ™ Acknowledgments

- **spaCy**: Industrial-strength NLP library
- **BioBERT**: Pre-trained biomedical language models
- **Hugging Face**: Model hosting and transformers
- **Streamlit**: Interactive web application framework

---

## ๐Ÿ“ง Contact

For questions, issues, or collaboration:

- **GitHub Issues**: [Report bugs or request features](https://github.com/yourusername/medical-nlp-lean/issues)
- **Documentation**: See project wiki for detailed guides

---

**Medical NLP Pipeline** - Extract insights from clinical text with confidence! ๐Ÿงฌ