An open API service indexing awesome lists of open source software.

https://github.com/babak2/synthea-data-analysis

Synthea Data Analysis
https://github.com/babak2/synthea-data-analysis

data-analysis data-visualization jupyter-notebook jupytext matplotlib numpy pandas python3 seaborn synthea

Last synced: 5 months ago
JSON representation

Synthea Data Analysis

Awesome Lists containing this project

README

          

# Synthea Data Analysis

This repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.

## Repository Structure

The project is organised as follows:

```
├── README.md # Project overview, setup, & usage
├── synthea_data-analysis.ipynb # Integrated notebook
├── requirements.txt # Python dependencies
├── .gitignore # Ignoring data dumps, etc.
├── data/
│ ├── original/ # Raw Synthea data (input data)
│ └── processed/ # Cleaned outputs from scripts
├── docs/
│ └── data_dictionary.md # Data dictionary for reference
├── archive/ # Archived scripts and notebooks
│ ├── scripts/ # Python scripts
│ │ ├── 01_patient_cleaning.py
│ │ ├── 02_conditions_cleaning.py
│ │ ├── 03_observations_cleaning.py
│ │ ├── 04_medications_cleaning.py
│ │ ├── 05_encounters_cleaning.py
│ │ ├── 06_data_desc.py
│ │ ├── 07_hypertension_bp_bmi_analysis.py
│ │ ├── 08_compare_bp_bmi_hypertensive_vs_non.py
│ │ └── 09_hypertension_prevalence.py
│ └── notebooks/ # Jupyter notebooks

```
## Project Overview

This repository focuses on cleaning and analysing the synthetic healthcare data produced by the [Synthea](https://github.com/synthetichealth/synthea) simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.

### Analysis Pipeline

1. **Data Cleaning:**
The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters.

2. **Data Analysis:**
Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations.

3. **Reporting & Visualisation:**
The final results are summarised in reports, including figures and tables generated during analysis.

## Install

To get started, you can set up the environment using `pip`. First, clone the repository:

```bash
git clone https://github.com/babak2/synthea_data-analysis.git
cd synthea_data-analysis
```

Then, install the required dependencies:

```pip install -r requirements.txt```

## Required Libraries

The project requires the following key Python libraries:

- **pandas**: For data manipulation and cleaning

- **numpy**: For numerical operations

- **matplotlib** and **seaborn**: For data visualization

- **jupytext**: To work seamlessly with Jupyter notebooks and scripts

For a full list of dependencies, check out the requirements.txt file.

## Running the Scripts

The repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:

1. Run individual Python scripts:
Each script is designed to be executed in sequence. You can run any script individually using Python:

```python archive/scripts/01_patient_cleaning.py```
```python archive/scripts/02_conditions_cleaning.py```
... and so on for each script

2. Execute the integrated Jupyter notebook:
The final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:

``` jupyter notebook synthea_data-analysis.ipynb ```

## Data

The raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:

```
data/
├── original/ # Raw data
│ ├── patients.csv.gz
│ ├── conditions.csv.gz
│ ├── observations.csv.gz
│ └── ...
└── processed/ # Cleaned data
├── clean_patients.csv
├── clean_conditions.csv
├── clean_observations.csv
└── ...
```
## Contributing

If you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.

How to Contribute

- Fork the repository.

- Create a feature branch (git checkout -b feature-branch).

- Commit your changes (git commit -am 'Add new feature').

- Push to the branch (git push origin feature-branch).

- Create a new Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Author

Babak Mahdavi Ardestani

babak.m.ardestani@gmail.com