https://github.com/paulinagonzalezc/multiple-imputation
Probabilistically imputing missing data to enable further statistical analysis.
https://github.com/paulinagonzalezc/multiple-imputation
Last synced: 2 months ago
JSON representation
Probabilistically imputing missing data to enable further statistical analysis.
- Host: GitHub
- URL: https://github.com/paulinagonzalezc/multiple-imputation
- Owner: paulinagonzalezc
- Created: 2023-10-26T20:19:52.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-20T01:12:23.000Z (about 1 year ago)
- Last Synced: 2025-04-11T02:18:36.034Z (2 months ago)
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Data Wrangling
# Multiple Imputation for Medical Data
## Overview
This project implements a simplified version of multiple imputation to handle incomplete medical data from multiple sources. The focus is on probabilistically imputing missing data to enable further statistical analysis.
## Project Description
I tackled the challenge of disparate and incomplete data in the Georgia Coverdell Acute Stroke Registry (GCASR) by imputing missing values using SQL and Python/pandas. The project facilitated the first step in multiple imputation, preparing the data for subsequent statistical methods like linear regression.## Features
* SQL scripts to impute missing medical data across ten hospital tables.
* Python/pandas functions to mirror SQL data manipulation on dataframes.
* Linear regression application to estimate missing computed tomography times based on existing cholesterol levels.
### Imputation Strategies
* Age: Missing ages were filled with the median age from the respective hospital's data.
* Cholesterol Level: Missing values were replaced by the average cholesterol level for matching ages or the smallest value within a similar age bracket.
* Computed Tomography Time: Imputed using a one-dimensional linear regression trained on non-missing cholesterol levels.## Tools Used
* MySQL for relational database management.
* Python with Pandas for dataframe manipulation.## Results
The data was successfully wrangled into a format suitable for machine learning and statistical analysis, with missing values imputed as per the specifications.