An open API service indexing awesome lists of open source software.

https://github.com/jcaperella29/clinical-text-mining_r_script

A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.
https://github.com/jcaperella29/clinical-text-mining_r_script

biomedical-data clinical-informatics data-cleaning data-harmonization database-integration icd-10 machine-learning medical-data nlp-machine-learning one-hot-encoding phenotype r text-mining

Last synced: 9 months ago
JSON representation

A lightweight R script for text mining and harmonizing medical phenotype data. Cleans, standardizes, and maps diagnoses to ICD-10 codes, with clinical annotations for enhanced data usability.

Awesome Lists containing this project

README

          

# clinical-text-mining_R_SCRIPT#

# ๐Ÿฅ Medical Phenotype Extraction from Doctor's Notes ๐Ÿฉบ

## ๐Ÿ“œ Overview
This R script extracts **structured phenotype data** from **unstructured doctor's notes**.
It cleans, standardizes, maps diagnoses to **ICD-10 codes**, applies **one-hot encoding**,
and exports a **ready-to-use phenotype matrix** for **machine learning & statistical analysis**.

### ๐Ÿ”ฌ Features
โœ… Parses **doctorโ€™s notes** into structured data using **regex & NLP**
โœ… Handles **missing values & normalizes blood pressure, weight, age**
โœ… **Maps diagnoses to ICD-10 codes** for standardization
โœ… **One-hot encodes categorical data** (diagnosis & meds) for ML
โœ… Saves **phenotype_matrix.csv** for **database integration & research**

---

## โš™๏ธ Installation & Dependencies
```r
install.packages(c("dplyr", "tidyr", "stringr"))

๐Ÿš€ Usage
Prepare your raw doctorโ€™s notes in a structured text file.
Run the script to extract structured data:
r
Copy
Edit
source("generate_phenotype_matrix.R")
Upload the phenotype_matrix.csv to your labโ€™s database.
๐Ÿ“‚ Output Example
sample_id age weight_kg systolic diastolic diagnosis_Hypertension diagnosis_Diabetes diagnosis_Asthma diagnosis_Cardiovascular_Disease med_Lisinopril med_Metformin med_Albuterol med_Atorvastatin
S001 56 81 140 90 1 0 0 0 1 0 0 0
๐Ÿฅ Database Integration
If using SQL, run:

library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = "lab_database.sqlite")
dbWriteTable(con, "phenotype_data", read.csv("phenotype_matrix.csv"), overwrite = TRUE)
dbDisconnect(con)