Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/madhanmohanreddy2301/meditrack
This project focuses on leveraging Natural Language Processing (NLP) techniques to identify and extract entities from healthcare data, such as diseases and treatments. It employs Conditional Random Fields (CRF) for entity recognition, achieving high accuracy in detecting relevant medical terms and relationships.
https://github.com/madhanmohanreddy2301/meditrack
crf-model ml natural-language-processing nlp nltk
Last synced: about 1 month ago
JSON representation
This project focuses on leveraging Natural Language Processing (NLP) techniques to identify and extract entities from healthcare data, such as diseases and treatments. It employs Conditional Random Fields (CRF) for entity recognition, achieving high accuracy in detecting relevant medical terms and relationships.
- Host: GitHub
- URL: https://github.com/madhanmohanreddy2301/meditrack
- Owner: MadhanMohanReddy2301
- License: apache-2.0
- Created: 2024-12-05T12:01:33.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-05T13:29:29.000Z (about 1 month ago)
- Last Synced: 2024-12-05T14:29:41.602Z (about 1 month ago)
- Topics: crf-model, ml, natural-language-processing, nlp, nltk
- Language: Jupyter Notebook
- Homepage:
- Size: 130 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MediTrack
## Identifying Entities in Healthcare Data
This project demonstrates the application of Natural Language Processing (NLP) techniques for identifying and extracting entities in healthcare data. Using a labeled corpus, it explores tasks such as tokenization, data preprocessing, exploratory data analysis (EDA), feature extraction, and model building with Conditional Random Fields (CRF). The goal is to identify diseases (D) and treatments (T) effectively.
## Project Workflow
### 1. Workspace Setup
- Import necessary libraries and install required packages to enable NLP and data processing.### 2. Data Preprocessing
- **Sentence Construction:** Transform individual words into structured sentences using a custom function.
- **Label Analysis:** Analyze and ensure consistency between sentences and their associated labels.### 3. Exploratory Data Analysis (EDA)
- **Label Distribution:** Observe the quantiles and distribution of labels (D, T, O).
- **Insights:** Highlight the prevalence of "Others" (O) labels in the dataset.### 4. Concept Identification
- Extract tokens with **NOUN** or **PROPN** PoS tags and analyze their frequency.
- Display the top 25 most common tokens, emphasizing their relevance in healthcare contexts.### 5. Feature Engineering
- **CRF Features:** Define and compute features for CRF, including word context, capitalization, and more.### 6. Model Building
- Build a **Conditional Random Field (CRF)** model for entity recognition.
- Train the model on processed data and evaluate it using performance metrics.### 7. Evaluation
- Predict token labels in the test dataset.
- Calculate the **F1 Score** to measure model performance (achieved score: **0.9086**).### 8. Named Entity Recognition (NER) for Healthcare
- Identify diseases and treatments using custom NER logic.
- Example: Predict treatments for specific diseases such as *"hereditary retinoblastoma"*.## Key Results
- **F1 Score:** Achieved a score of 0.9086 for entity recognition.
- Identified treatment for *"hereditary retinoblastoma"* as *"radiotherapy"*.## Prerequisites
- Python 3.7+
- Libraries: `pandas`, `numpy`, `nltk`, `sklearn-crfsuite`, `matplotlib`