https://github.com/babak2/synthea-data-analysis
Synthea Data Analysis
https://github.com/babak2/synthea-data-analysis
data-analysis data-visualization jupyter-notebook jupytext matplotlib numpy pandas python3 seaborn synthea
Last synced: 5 months ago
JSON representation
Synthea Data Analysis
- Host: GitHub
- URL: https://github.com/babak2/synthea-data-analysis
- Owner: babak2
- License: mit
- Created: 2025-04-25T07:33:05.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2025-05-22T13:42:36.000Z (9 months ago)
- Last Synced: 2025-05-22T15:03:43.480Z (9 months ago)
- Topics: data-analysis, data-visualization, jupyter-notebook, jupytext, matplotlib, numpy, pandas, python3, seaborn, synthea
- Language: Jupyter Notebook
- Homepage:
- Size: 17 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Synthea Data Analysis
This repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.
## Repository Structure
The project is organised as follows:
```
├── README.md # Project overview, setup, & usage
├── synthea_data-analysis.ipynb # Integrated notebook
├── requirements.txt # Python dependencies
├── .gitignore # Ignoring data dumps, etc.
├── data/
│ ├── original/ # Raw Synthea data (input data)
│ └── processed/ # Cleaned outputs from scripts
├── docs/
│ └── data_dictionary.md # Data dictionary for reference
├── archive/ # Archived scripts and notebooks
│ ├── scripts/ # Python scripts
│ │ ├── 01_patient_cleaning.py
│ │ ├── 02_conditions_cleaning.py
│ │ ├── 03_observations_cleaning.py
│ │ ├── 04_medications_cleaning.py
│ │ ├── 05_encounters_cleaning.py
│ │ ├── 06_data_desc.py
│ │ ├── 07_hypertension_bp_bmi_analysis.py
│ │ ├── 08_compare_bp_bmi_hypertensive_vs_non.py
│ │ └── 09_hypertension_prevalence.py
│ └── notebooks/ # Jupyter notebooks
```
## Project Overview
This repository focuses on cleaning and analysing the synthetic healthcare data produced by the [Synthea](https://github.com/synthetichealth/synthea) simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.
### Analysis Pipeline
1. **Data Cleaning:**
The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters.
2. **Data Analysis:**
Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations.
3. **Reporting & Visualisation:**
The final results are summarised in reports, including figures and tables generated during analysis.
## Install
To get started, you can set up the environment using `pip`. First, clone the repository:
```bash
git clone https://github.com/babak2/synthea_data-analysis.git
cd synthea_data-analysis
```
Then, install the required dependencies:
```pip install -r requirements.txt```
## Required Libraries
The project requires the following key Python libraries:
- **pandas**: For data manipulation and cleaning
- **numpy**: For numerical operations
- **matplotlib** and **seaborn**: For data visualization
- **jupytext**: To work seamlessly with Jupyter notebooks and scripts
For a full list of dependencies, check out the requirements.txt file.
## Running the Scripts
The repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:
1. Run individual Python scripts:
Each script is designed to be executed in sequence. You can run any script individually using Python:
```python archive/scripts/01_patient_cleaning.py```
```python archive/scripts/02_conditions_cleaning.py```
... and so on for each script
2. Execute the integrated Jupyter notebook:
The final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:
``` jupyter notebook synthea_data-analysis.ipynb ```
## Data
The raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:
```
data/
├── original/ # Raw data
│ ├── patients.csv.gz
│ ├── conditions.csv.gz
│ ├── observations.csv.gz
│ └── ...
└── processed/ # Cleaned data
├── clean_patients.csv
├── clean_conditions.csv
├── clean_observations.csv
└── ...
```
## Contributing
If you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.
How to Contribute
- Fork the repository.
- Create a feature branch (git checkout -b feature-branch).
- Commit your changes (git commit -am 'Add new feature').
- Push to the branch (git push origin feature-branch).
- Create a new Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Author
Babak Mahdavi Ardestani
babak.m.ardestani@gmail.com