https://github.com/paulinagonzalezc/multiple-imputation

Probabilistically imputing missing data to enable further statistical analysis.
https://github.com/paulinagonzalezc/multiple-imputation

Last synced: 2 months ago
JSON representation

Probabilistically imputing missing data to enable further statistical analysis.

Host: GitHub
URL: https://github.com/paulinagonzalezc/multiple-imputation
Owner: paulinagonzalezc
Created: 2023-10-26T20:19:52.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-20T01:12:23.000Z (about 1 year ago)
Last Synced: 2025-04-11T02:18:36.034Z (2 months ago)
Language: Python
Homepage:
Size: 23.4 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Data Wrangling

# Multiple Imputation for Medical Data

## Overview
This project implements a simplified version of multiple imputation to handle incomplete medical data from multiple sources. The focus is on probabilistically imputing missing data to enable further statistical analysis.

## Project Description
I tackled the challenge of disparate and incomplete data in the Georgia Coverdell Acute Stroke Registry (GCASR) by imputing missing values using SQL and Python/pandas. The project facilitated the first step in multiple imputation, preparing the data for subsequent statistical methods like linear regression.

## Features
* SQL scripts to impute missing medical data across ten hospital tables.
* Python/pandas functions to mirror SQL data manipulation on dataframes.
* Linear regression application to estimate missing computed tomography times based on existing cholesterol levels.

### Imputation Strategies
* Age: Missing ages were filled with the median age from the respective hospital's data.
* Cholesterol Level: Missing values were replaced by the average cholesterol level for matching ages or the smallest value within a similar age bracket.
* Computed Tomography Time: Imputed using a one-dimensional linear regression trained on non-missing cholesterol levels.

## Tools Used
* MySQL for relational database management.
* Python with Pandas for dataframe manipulation.

## Results
The data was successfully wrangled into a format suitable for machine learning and statistical analysis, with missing values imputed as per the specifications.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paulinagonzalezc/multiple-imputation

Awesome Lists containing this project

README