https://github.com/gretelai/synthetic-data-genomics
Proof of concept code from Gretel.ai and Illumina using generative neural networks to create synthetic versions of mouse genotype and phenotype data.
https://github.com/gretelai/synthetic-data-genomics
generative-model genomics privacy-enhancing-technologies synthetic-data
Last synced: 3 months ago
JSON representation
Proof of concept code from Gretel.ai and Illumina using generative neural networks to create synthetic versions of mouse genotype and phenotype data.
- Host: GitHub
- URL: https://github.com/gretelai/synthetic-data-genomics
- Owner: gretelai
- Created: 2021-09-20T22:41:30.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-01-19T21:20:17.000Z (over 3 years ago)
- Last Synced: 2025-04-04T15:40:16.846Z (7 months ago)
- Topics: generative-model, genomics, privacy-enhancing-technologies, synthetic-data
- Language: Jupyter Notebook
- Homepage: https://cdn.gretel.ai/case_studies/gretel_illumina_case_study.pdf
- Size: 7.04 MB
- Stars: 37
- Watchers: 28
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Synthetic Data Genomics
The code in this repository uses Gretel.ai's synthetic data APIs to create synthetic (artificial) versions of real world mouse genotype and connected phenotype datasets. We then measure the accuracy of our synthetic data by replicating the results of a Genome Wide Association Study (GWAS) on the real world genotypes and phenotypes for 1,220 mice from this paper: https://doi.org/10.1038/ng.3609.View the full case study here: https://cdn.gretel.ai/case_studies/gretel_illumina_case_study.pdf
## Installation
Requirements:
* Conda package manager
* Ubuntu 18.04 recommended
* NVidia T4 or faster GPU
* Gretel.ai API key (https://console.gretel.cloud)Install the Conda package manager:
```
conda create --name genomics python=3.9
conda activate genomics
conda install jupyter
```Note that
## Recreate the original paper experiments
Follow the steps in `EXPERIMENTS.md` to download the experiment datasets and recreate the results from the paper using real world data.## Synthesize genome and phenome data, run experiments
Next, create synthetic versions of the mouse phenome and genome datasets from the original experiments.
1. `synthetics/01_create_phenome_training_data.ipynb` creates the genome training set and filter irrelevant fields.
2. `synthetics/02_create_synthetic_mouse_phenomes.ipynb` trains a synthetic model on the mouse phenome set.
3. `synthetics/03_build_genome_training_set.ipynb` creates a genome dataset based on abBMD SNPs
4. `synthetics/04_create_synthetic_mouse_genomes.ipynb` trains a synthetic model on the mouse genome set, runs GWAS analysis and compares to original results
5. `research_paper_code/notebooks/map_synth.ipynb` run GWAS on your final genomic results## Additional resources
* `research_paper_code/notebooks/05_compare_associations.ipynb` compute precision, recall and F1 scores for the final synthetic data
* `synthetics/Optional_tune_synthetic_training_params` optionally use Optuna to tune synthetic training parameters.
* `research_paper_code/notebooks/Manhattan plot.ipynb` compute Manhattan plots for both the original and synthetic genome/phenome gwas p-values