{"id":28449126,"url":"https://github.com/sodascience/synthetic_youth_pilot","last_synced_at":"2025-10-14T18:06:31.615Z","repository":{"id":291947292,"uuid":"979302208","full_name":"sodascience/synthetic_youth_pilot","owner":"sodascience","description":"Synthetic data pilot for YOUth study questionnaires, using metasyn","archived":false,"fork":false,"pushed_at":"2025-07-08T09:34:40.000Z","size":86,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-28T05:15:50.543Z","etag":null,"topics":["questionnaire-survey","synthetic-data","youth-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sodascience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-07T09:56:56.000Z","updated_at":"2025-07-08T09:34:43.000Z","dependencies_parsed_at":"2025-05-07T11:34:45.532Z","dependency_job_id":null,"html_url":"https://github.com/sodascience/synthetic_youth_pilot","commit_stats":null,"previous_names":["sodascience/synthetic_youth_pilot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sodascience/synthetic_youth_pilot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsynthetic_youth_pilot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsynthetic_youth_pilot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsynthetic_youth_pilot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsynthetic_youth_pilot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sodascience","download_url":"https://codeload.github.com/sodascience/synthetic_youth_pilot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsynthetic_youth_pilot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279020322,"owners_count":26086864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["questionnaire-survey","synthetic-data","youth-data"],"created_at":"2025-06-06T14:06:44.728Z","updated_at":"2025-10-14T18:06:31.602Z","avatar_url":"https://github.com/sodascience.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# YOUth pilot privacy-friendly synthetic data\n![Python](https://img.shields.io/badge/Python-3776AB?logo=python\u0026logoColor=fff)\n![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json) \n\nThis repository implements a pilot for creating privacy-friendly questionnaire datasets from the YOUth cohort. It is built on [metasyn](https://github.com/sodascience/metasyn) with the [disclosure control plugin](https://github.com/sodascience/metasyn-disclosure-control).\n\n## Installation\n\nTo install the dependencies of this project, follow the following steps:\n\n1. We use [uv](https://docs.astral.sh/uv) to manage dependencies and environments. Install it first.\n2. Clone this repository.\n3. Instantiate the environment by running `uv sync` from this folder.\n\n## Synthesizing data\n1. Obtain the following datasets from the YOUth study and put them in the `raw_data` folder: `CECPAQ_2.csv`, `M_DEMOGRAFY_1.csv`, `P_DEMOGRAFY_1.csv`, `P_LIFSTYLE_1_MED_STOREY.csv`, `P_LIFSTYLE_1_MEDICATIONY.csv`, `P_LIFSTYLE_1.csv`, `Q_1.csv`\n2. Obtain the following metadata files and put them in the `raw_data\\metadata` folder: `YOUth_baby_en_kind-metadata.csv`, `YOUth_baby_en_kind-valuelabels.csv`.\n3. Create the synthetic data by running `uv run synthesize.py`\n\nNow, the folders `output/csv` and `output/gmf` should be populated with synthetic data and metadata, respectively:\n\n```\n📁 synthetic_youth_pilot/\n├── 📖 README.md\n├── 📄 test_analysis.py\n├── 📄 synthesize.py\n├── pyproject.toml\n├── uv.lock\n├── 📁 raw_data/\n│   ├── 📜 CECPAQ_2.csv\n│   ├── 📜 M_DEMOGRAFY_1.csv\n│   ├── 📜 P_DEMOGRAFY_1.csv\n│   ├── 📜 P_LIFSTYLE_1.csv\n│   ├── 📜 P_LIFSTYLE_1_MEDICATIONY.csv\n│   ├── 📜 P_LIFSTYLE_1_MED_STOREY.csv\n│   ├── 📜 Q_1.csv\n│   └── 📁 metadata/\n│       ├── 📜 YOUth_baby_en_kind-metadata.csv\n│       └── 📜 YOUth_baby_en_kind-valuelabels.csv\n└── 📁 output/\n    ├── 📁 csv/\n    │   ├── 📜 CECPAQ_2.csv\n    │   ├── 📜 M_DEMOGRAFY_1.csv\n    │   ├── 📜 P_DEMOGRAFY_1.csv\n    │   ├── 📜 P_LIFSTYLE_1.csv\n    │   ├── 📜 P_LIFSTYLE_1_MEDICATIONY.csv\n    │   ├── 📜 P_LIFSTYLE_1_MED_STOREY.csv\n    │   └── 📜 Q_1.csv\n    └── 📁 gmf/\n        ├── 📜 CECPAQ_2.json\n        ├── 📜 M_DEMOGRAFY_1.json\n        ├── 📜 P_DEMOGRAFY_1.json\n        ├── 📜 P_LIFSTYLE_1.json\n        ├── 📜 P_LIFSTYLE_1_MEDICATIONY.json\n        ├── 📜 P_LIFSTYLE_1_MED_STOREY.json\n        └── 📜 Q_1.json\n\n5 directories, 28 files\n📖README 📜Data 📄Code 📁Folder\n```\n_(Made with [`scitree`](https://github.com/J535D165/scitree))_\n\n## Test analysis\n\nThis repo includes a test analysis on both the real and synthetic data to display medication use by age bracket. You can find the analysis in the file [test_analysis.py](./test_analysis.py). To run this analysis, run `uv run test_analysis.py`. It will show something like the following:\n\n```\nParacetamol use in real data:\n\nAge: 10 - 19 | ____ ███████████████████\nAge: 20 - 29 | ____ ████████████████\nAge: 30 - 39 | ____ ███████████████\nAge: 40 - 49 | ____ ██████████████\nAge: 50 - 59 | ____ █████████\n\n\nParacetamol use in synthetic data:\n\nAge: 10 - 19 | 0.81 ████████████████████\nAge: 20 - 29 | 0.81 ████████████████████\nAge: 30 - 39 | 0.78 ███████████████████\nAge: 40 - 49 | 0.74 ██████████████████\nAge: 50 - 59 | 0.81 ████████████████████\n```\n_(numbers redacted \u0026 bars fuzzed in real data analysis for privacy)_\n\nThree things are noteworthy here:\n1. The analysis code is __exactly__ the same between the synthetic and real analyses\n2. The ranges of the individual variables (age and paracetamol use) are similar\n3. The relation between age and paracetamol use is removed from the synthetic data\n\n## Contact\n\nThis is a project by the [ODISSEI Social Data Science team](https://odissei-soda.nl/). Do you have questions, suggestions, or remarks on the technical implementation? Create an issue in the issue tracker or feel free to contact [Erik-Jan van Kesteren](https://github.com/vankesteren). \n\n\u003cimg src=\"https://odissei-soda.nl/images/logos/soda_logo.svg\" alt=\"SoDa logo\" width=\"250px\"/\u003e ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fsynthetic_youth_pilot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsodascience%2Fsynthetic_youth_pilot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fsynthetic_youth_pilot/lists"}