{"id":33282537,"url":"https://github.com/datagodzilla/medical-nlp-lean","last_synced_at":"2026-04-29T21:06:39.728Z","repository":{"id":324267653,"uuid":"1096635938","full_name":"datagodzilla/medical-nlp-lean","owner":"datagodzilla","description":"Medical Entities Recognition","archived":false,"fork":false,"pushed_at":"2025-11-14T18:11:28.000Z","size":1042,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-14T20:25:49.206Z","etag":null,"topics":["biobert","biomedical","clinical-nlp","clinical-text","entity-extraction","healthcare","machine-learning","medical","medical-informatics","named-entity-recognition","ner","nlp","python","spacy","streamlit"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datagodzilla.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-14T18:05:17.000Z","updated_at":"2025-11-14T18:11:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/datagodzilla/medical-nlp-lean","commit_stats":null,"previous_names":["datagodzilla/medical-nlp-lean"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/datagodzilla/medical-nlp-lean","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datagodzilla%2Fmedical-nlp-lean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datagodzilla%2Fmedical-nlp-lean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datagodzilla%2Fmedical-nlp-lean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datagodzilla%2Fmedical-nlp-lean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datagodzilla","download_url":"https://codeload.github.com/datagodzilla/medical-nlp-lean/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datagodzilla%2Fmedical-nlp-lean/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32443620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T20:22:27.477Z","status":"ssl_error","status_checked_at":"2026-04-29T20:22:26.507Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biobert","biomedical","clinical-nlp","clinical-text","entity-extraction","healthcare","machine-learning","medical","medical-informatics","named-entity-recognition","ner","nlp","python","spacy","streamlit"],"created_at":"2025-11-17T14:00:39.832Z","updated_at":"2026-04-29T21:06:39.705Z","avatar_url":"https://github.com/datagodzilla.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Medical NLP - Named Entity Recognition Pipeline\n\n[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Status](https://img.shields.io/badge/Status-Production%20Ready-brightgreen.svg)]()\n\nA production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.\n\n---\n\n## 🌟 Features\n\n- **Comprehensive Entity Detection**: Diseases, genes, proteins, chemicals, and anatomical terms\n- **Advanced Context Analysis**: Identifies negated, historical, family history, uncertain, and confirmed conditions\n- **Template-Based Matching**: 57,476+ curated medical terms across 6 specialized templates\n- **BioBERT Integration**: State-of-the-art biomedical language models for high accuracy\n- **Dual Interface**: Command-line tool and interactive Streamlit web application\n- **Rich Output**: 15-column Excel reports with visualizations and JSON export\n- **Scope Reversal Detection**: Handles complex negation patterns (\"no fever but has cough\")\n- **Production Ready**: Comprehensive test suite and robust error handling\n\n---\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/medical-nlp-lean.git\ncd medical-nlp-lean\n\n# Create conda environment\nconda env create -f py311_bionlp_environment.yml\n\n# Activate environment\nconda activate py311_bionlp\n\n# Install package\npip install -e .\n\n# Download required spaCy models\npython -m spacy download en_core_web_sm\npython -m spacy download en_ner_bc5cdr_md\n```\n\n### Basic Usage\n\n**Command Line:**\n```bash\n# Run NER pipeline on default input\n./run_ner_pipeline.sh --run\n\n# Process custom file\n./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run\n```\n\n**Web Interface:**\n```bash\n# Launch Streamlit app\n./run_app.sh\n\n# Opens at http://localhost:8501\n```\n\n**Python API:**\n```python\nfrom src.enhanced_medical_ner_predictor import MedicalNERPredictor\n\n# Initialize predictor\npredictor = MedicalNERPredictor()\n\n# Process text\ntext = \"Patient denies chest pain but reports shortness of breath.\"\nresults = predictor.process_text(text)\n\n# Access detected entities\nprint(results['detected_diseases'])\nprint(results['negated_entities'])\nprint(results['confirmed_entities'])\n```\n\n---\n\n## 📊 Output\n\nThe pipeline generates comprehensive Excel reports with **15 columns**:\n\n| Column | Description |\n|--------|-------------|\n| **Visualization** | HTML entity highlighting with color-coded labels |\n| **Detected Diseases** | Identified disease/condition entities |\n| **Disease Count** | Total number of diseases detected |\n| **Detected Genes** | Identified gene/protein entities |\n| **Gene Count** | Total number of genes detected |\n| **Negated Entities** | Conditions explicitly denied or absent |\n| **Historical Entities** | Past medical history mentions |\n| **Family Entities** | Family medical history |\n| **Uncertain Entities** | Possible or speculative conditions |\n| **Confirmed Entities** | Explicitly confirmed conditions |\n| **Section Categories** | Clinical note sections (Chief Complaint, Assessment, Plan, etc.) |\n| **JSON Export** | Complete structured data for all entities |\n\n---\n\n## 🎯 Key Capabilities\n\n### Medical Entity Recognition\n- **Diseases \u0026 Conditions**: Diabetes, hypertension, pneumonia, cancer types, etc.\n- **Genes \u0026 Proteins**: BRCA1, TP53, kinesin, hemoglobin, etc.\n- **Chemicals \u0026 Drugs**: Aspirin, metformin, chemotherapy agents, etc.\n- **Anatomical Terms**: Heart, lungs, liver, blood vessels, etc.\n\n### Context Classification\n- **Negation Detection**: \"No evidence of diabetes\", \"denies chest pain\"\n- **Historical Context**: \"History of hypertension\", \"previous stroke\"\n- **Family History**: \"Mother has breast cancer\", \"family history of diabetes\"\n- **Uncertainty**: \"Possible pneumonia\", \"rule out myocardial infarction\"\n- **Scope Reversal**: \"No fever but has cough\" (correctly identifies cough as confirmed)\n\n### Template System\n- **target_rules_template.xlsx**: 57,476 curated medical terms\n- **negated_rules_template.xlsx**: 99 negation patterns\n- **historical_rules_template.xlsx**: 82 historical context patterns\n- **family_rules_template.xlsx**: 79 family history patterns\n- **uncertainty_rules_template.xlsx**: 48 uncertainty patterns\n- **confirmed_rules_template.xlsx**: 138 confirmation patterns\n\n---\n\n## 🏗️ Architecture\n\n```\nmedical-nlp-lean/\n├── src/                        # Core Python modules\n│   ├── enhanced_medical_ner_predictor.py\n│   └── performance_analyzer.py\n├── app/                        # Streamlit web application\n│   └── medical_nlp_app.py\n├── data/\n│   ├── external/              # Template files\n│   └── raw/                   # Input data\n├── models/\n│   └── pretrained/            # BioBERT models (~1.6GB)\n├── output/                    # Generated results\n│   ├── results/              # Excel outputs\n│   ├── visualizations/       # PNG visualizations\n│   └── logs/                 # Execution logs\n├── tests/                     # Comprehensive test suite\n└── configs/                   # Configuration files\n```\n\n---\n\n## 🧪 Testing\n\nRun the comprehensive test suite to validate installation:\n\n```bash\n# Run all tests\n./run_tests.sh\n\n# Quick validation\n./run_tests.sh --quick\n\n# Specific test category\npython tests/master_test_script.py --category scope_reversal\n```\n\n**Test Categories:**\n- Scope reversal detection (103 patterns)\n- Template pattern validation\n- Context classification\n- Negation detection\n- Output formatting\n- UI consistency\n- Pipeline integration\n\n---\n\n## ⚙️ Configuration\n\nCustomize pipeline behavior in `configs/pipeline_config.yaml`:\n\n```yaml\npipeline:\n  confidence_thresholds:\n    curated_templates: 0.3    # Lower threshold for template matches\n    general_patterns: 0.5     # Higher threshold for general patterns\n  proximity_weighting:\n    max_boost: 0.3           # Confidence boost for nearby matches\n\nmodels:\n  disease_model: \"models/pretrained/Disease\"\n  chemical_model: \"models/pretrained/Chemical\"\n  gene_model: \"models/pretrained/Gene\"\n  spacy_model: \"en_core_web_sm\"\n  biomedical_model: \"en_ner_bc5cdr_md\"\n```\n\n---\n\n## 📈 Performance\n\n- **Processing Speed**: ~100 clinical notes in \u003c1 minute\n- **Memory Usage**: ~2GB for typical workloads\n- **Accuracy**: 95%+ for medical entity detection\n- **Models**: 3 BioBERT models (~1.6GB total)\n\n---\n\n## 📚 Documentation\n\n- **Installation Guide**: Complete setup instructions\n- **API Reference**: Python API documentation\n- **Template Guide**: How to customize medical term templates\n- **Configuration**: Pipeline configuration options\n- **Examples**: Sample clinical text processing\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n---\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## 🙏 Acknowledgments\n\n- **spaCy**: Industrial-strength NLP library\n- **BioBERT**: Pre-trained biomedical language models\n- **Hugging Face**: Model hosting and transformers\n- **Streamlit**: Interactive web application framework\n\n---\n\n## 📧 Contact\n\nFor questions, issues, or collaboration:\n\n- **GitHub Issues**: [Report bugs or request features](https://github.com/yourusername/medical-nlp-lean/issues)\n- **Documentation**: See project wiki for detailed guides\n\n---\n\n**Medical NLP Pipeline** - Extract insights from clinical text with confidence! 🧬\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatagodzilla%2Fmedical-nlp-lean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatagodzilla%2Fmedical-nlp-lean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatagodzilla%2Fmedical-nlp-lean/lists"}