https://github.com/jcaperella29/clinical-text-mining_r_script
A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.
https://github.com/jcaperella29/clinical-text-mining_r_script
biomedical-data clinical-informatics data-cleaning data-harmonization database-integration icd-10 machine-learning medical-data nlp-machine-learning one-hot-encoding phenotype r text-mining
Last synced: 9 months ago
JSON representation
A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.
- Host: GitHub
- URL: https://github.com/jcaperella29/clinical-text-mining_r_script
- Owner: jcaperella29
- Created: 2025-02-16T19:33:01.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-16T19:59:03.000Z (over 1 year ago)
- Last Synced: 2025-07-20T02:14:56.967Z (11 months ago)
- Topics: biomedical-data, clinical-informatics, data-cleaning, data-harmonization, database-integration, icd-10, machine-learning, medical-data, nlp-machine-learning, one-hot-encoding, phenotype, r, text-mining
- Language: R
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# clinical-text-mining_R_SCRIPT#
# ๐ฅ Medical Phenotype Extraction from Doctor's Notes ๐ฉบ
## ๐ Overview
This R script extracts **structured phenotype data** from **unstructured doctor's notes**.
It cleans, standardizes, maps diagnoses to **ICD-10 codes**, applies **one-hot encoding**,
and exports a **ready-to-use phenotype matrix** for **machine learning & statistical analysis**.
### ๐ฌ Features
โ
Parses **doctorโs notes** into structured data using **regex & NLP**
โ
Handles **missing values & normalizes blood pressure, weight, age**
โ
**Maps diagnoses to ICD-10 codes** for standardization
โ
**One-hot encodes categorical data** (diagnosis & meds) for ML
โ
Saves **phenotype_matrix.csv** for **database integration & research**
---
## โ๏ธ Installation & Dependencies
```r
install.packages(c("dplyr", "tidyr", "stringr"))
๐ Usage
Prepare your raw doctorโs notes in a structured text file.
Run the script to extract structured data:
r
Copy
Edit
source("generate_phenotype_matrix.R")
Upload the phenotype_matrix.csv to your labโs database.
๐ Output Example
sample_id age weight_kg systolic diastolic diagnosis_Hypertension diagnosis_Diabetes diagnosis_Asthma diagnosis_Cardiovascular_Disease med_Lisinopril med_Metformin med_Albuterol med_Atorvastatin
S001 56 81 140 90 1 0 0 0 1 0 0 0
๐ฅ Database Integration
If using SQL, run:
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = "lab_database.sqlite")
dbWriteTable(con, "phenotype_data", read.csv("phenotype_matrix.csv"), overwrite = TRUE)
dbDisconnect(con)