An open API service indexing awesome lists of open source software.

https://github.com/paulinagonzalezc/multiple-imputation

Probabilistically imputing missing data to enable further statistical analysis.
https://github.com/paulinagonzalezc/multiple-imputation

Last synced: 2 months ago
JSON representation

Probabilistically imputing missing data to enable further statistical analysis.

Awesome Lists containing this project

README

        

Data Wrangling

# Multiple Imputation for Medical Data

## Overview
This project implements a simplified version of multiple imputation to handle incomplete medical data from multiple sources. The focus is on probabilistically imputing missing data to enable further statistical analysis.

image

## Project Description
I tackled the challenge of disparate and incomplete data in the Georgia Coverdell Acute Stroke Registry (GCASR) by imputing missing values using SQL and Python/pandas. The project facilitated the first step in multiple imputation, preparing the data for subsequent statistical methods like linear regression.

## Features
* SQL scripts to impute missing medical data across ten hospital tables.
* Python/pandas functions to mirror SQL data manipulation on dataframes.
* Linear regression application to estimate missing computed tomography times based on existing cholesterol levels.

image

### Imputation Strategies
* Age: Missing ages were filled with the median age from the respective hospital's data.
* Cholesterol Level: Missing values were replaced by the average cholesterol level for matching ages or the smallest value within a similar age bracket.
* Computed Tomography Time: Imputed using a one-dimensional linear regression trained on non-missing cholesterol levels.

## Tools Used
* MySQL for relational database management.
* Python with Pandas for dataframe manipulation.

## Results
The data was successfully wrangled into a format suitable for machine learning and statistical analysis, with missing values imputed as per the specifications.