{"id":30560990,"url":"https://github.com/babak2/synthea-data-analysis","last_synced_at":"2026-04-11T11:02:40.020Z","repository":{"id":289821637,"uuid":"972507252","full_name":"babak2/synthea-data-analysis","owner":"babak2","description":"Synthea Data Analysis","archived":false,"fork":false,"pushed_at":"2025-05-22T13:42:36.000Z","size":17860,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-28T16:13:22.055Z","etag":null,"topics":["data-analysis","data-visualization","jupyter-notebook","jupytext","matplotlib","numpy","pandas","python3","seaborn","synthea"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/babak2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-25T07:33:05.000Z","updated_at":"2025-06-08T22:46:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"965de89e-32f3-4721-b58b-ae5591f7fef8","html_url":"https://github.com/babak2/synthea-data-analysis","commit_stats":null,"previous_names":["babak2/synthea_data-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/babak2/synthea-data-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babak2%2Fsynthea-data-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babak2%2Fsynthea-data-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babak2%2Fsynthea-data-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babak2%2Fsynthea-data-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/babak2","download_url":"https://codeload.github.com/babak2/synthea-data-analysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/babak2%2Fsynthea-data-analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31677819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-11T08:18:19.405Z","status":"ssl_error","status_checked_at":"2026-04-11T08:17:08.892Z","response_time":54,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-visualization","jupyter-notebook","jupytext","matplotlib","numpy","pandas","python3","seaborn","synthea"],"created_at":"2025-08-28T11:05:12.460Z","updated_at":"2026-04-11T11:02:40.005Z","avatar_url":"https://github.com/babak2.png","language":"Jupyter Notebook","readme":"# Synthea Data Analysis\n\nThis repository contains a series of Python scripts and Jupyter notebooks for cleaning, processing, and analysing synthetic healthcare data generated by the Synthea simulator, with a focus on hypertension analysis. The project includes data cleaning, data validation, and statistical analysis related to blood pressure, BMI, and hypertension prevalence.\n\n## Repository Structure\n\nThe project is organised as follows:\n\n```\n├── README.md                         # Project overview, setup, \u0026 usage\n├── synthea_data-analysis.ipynb       # Integrated notebook\n├── requirements.txt                  # Python dependencies\n├── .gitignore                        # Ignoring data dumps, etc.\n├── data/\n│   ├── original/                     # Raw Synthea data (input data)\n│   └── processed/                    # Cleaned outputs from scripts\n├── docs/\n│   └── data_dictionary.md            # Data dictionary for reference\n├── archive/                          # Archived scripts and notebooks\n│   ├── scripts/                      # Python scripts\n│   │   ├── 01_patient_cleaning.py\n│   │   ├── 02_conditions_cleaning.py\n│   │   ├── 03_observations_cleaning.py\n│   │   ├── 04_medications_cleaning.py\n│   │   ├── 05_encounters_cleaning.py\n│   │   ├── 06_data_desc.py\n│   │   ├── 07_hypertension_bp_bmi_analysis.py\n│   │   ├── 08_compare_bp_bmi_hypertensive_vs_non.py\n│   │   └── 09_hypertension_prevalence.py\n│   └── notebooks/                    # Jupyter notebooks\n\n```\n## Project Overview\n\nThis repository focuses on cleaning and analysing the synthetic healthcare data produced by the [Synthea](https://github.com/synthetichealth/synthea) simulator. The analysis primarily examines hypertension-related data, including blood pressure and BMI metrics.\n\n### Analysis Pipeline\n\n1. **Data Cleaning:**  \n   The raw Synthea data is cleaned in a series of scripts, starting with patient data and continuing through conditions, observations, medications, and encounters.\n\n2. **Data Analysis:**  \n   Once the data is cleaned, the project performs statistical analysis on key indicators like hypertension prevalence, blood pressure (BP), and BMI across different patient populations.\n\n3. **Reporting \u0026 Visualisation:**  \n   The final results are summarised in reports, including figures and tables generated during analysis.\n\n## Install\n\nTo get started, you can set up the environment using `pip`. First, clone the repository:\n\n```bash\ngit clone https://github.com/babak2/synthea_data-analysis.git\ncd synthea_data-analysis \n```\n\n\nThen, install the required dependencies:\n\n```pip install -r requirements.txt```\n\n\n## Required Libraries\n\nThe project requires the following key Python libraries:\n\n- **pandas**: For data manipulation and cleaning\n\n- **numpy**: For numerical operations\n\n- **matplotlib** and **seaborn**: For data visualization\n\n- **jupytext**: To work seamlessly with Jupyter notebooks and scripts\n\nFor a full list of dependencies, check out the requirements.txt file.\n\n\n\n## Running the Scripts\n\nThe repository contains Python scripts that can be executed independently or together in sequence. Here's how you can run them:\n\n1. Run individual Python scripts:\n    Each script is designed to be executed in sequence. You can run any script individually using Python:\n\n    ```python archive/scripts/01_patient_cleaning.py```\n    ```python archive/scripts/02_conditions_cleaning.py```\n    ... and so on for each script\n\n\n\n\n2. Execute the integrated Jupyter notebook:\nThe final analysis is contained in the synthea_data-analysis.ipynb notebook. You can execute the entire analysis in one go:\n\n\n      ``` jupyter notebook synthea_data-analysis.ipynb ```\n\n\n\n## Data\n\nThe raw Synthea data files can be placed in the data/original/ directory. After running the cleaning scripts, the processed data will be saved in the data/processed/ directory. Here's an example of the data structure:\n\n\n```\ndata/\n├── original/   # Raw data\n│   ├── patients.csv.gz\n│   ├── conditions.csv.gz\n│   ├── observations.csv.gz\n│   └── ...\n└── processed/  # Cleaned data\n    ├── clean_patients.csv\n    ├── clean_conditions.csv\n    ├── clean_observations.csv\n    └── ...\n```\n## Contributing\n\n\nIf you'd like to improve the analysis, suggest new features, or fix bugs, feel free to fork the repository and create a pull request.\n\nHow to Contribute\n\n- Fork the repository.\n\n- Create a feature branch (git checkout -b feature-branch).\n\n- Commit your changes (git commit -am 'Add new feature').\n\n- Push to the branch (git push origin feature-branch).\n\n- Create a new Pull Request.\n\n  \n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n\n## Author \n\nBabak Mahdavi Ardestani\n\nbabak.m.ardestani@gmail.com\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabak2%2Fsynthea-data-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbabak2%2Fsynthea-data-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbabak2%2Fsynthea-data-analysis/lists"}