{"id":20074751,"url":"https://github.com/greenelab/simulate-expression-compendia","last_synced_at":"2025-05-05T21:32:27.989Z","repository":{"id":53076980,"uuid":"189881087","full_name":"greenelab/simulate-expression-compendia","owner":"greenelab","description":"Evaluating the effect of technical sources of variability in large-scale gene expression compendia.","archived":false,"fork":false,"pushed_at":"2021-04-07T20:27:31.000Z","size":889355,"stargazers_count":6,"open_issues_count":1,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2023-08-21T19:10:51.115Z","etag":null,"topics":["analysis","dataset","methodology","software","supplement","tool"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1101/2020.05.03.066597","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greenelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-02T18:28:59.000Z","updated_at":"2023-08-21T19:10:51.116Z","dependencies_parsed_at":"2022-09-10T04:23:06.344Z","dependency_job_id":null,"html_url":"https://github.com/greenelab/simulate-expression-compendia","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fsimulate-expression-compendia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fsimulate-expression-compendia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fsimulate-expression-compendia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fsimulate-expression-compendia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greenelab","download_url":"https://codeload.github.com/greenelab/simulate-expression-compendia/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224470624,"owners_count":17316705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","dataset","methodology","software","supplement","tool"],"created_at":"2024-11-13T14:54:09.699Z","updated_at":"2024-11-13T14:54:12.831Z","avatar_url":"https://github.com/greenelab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Correcting for experiment-specific variability in expression compendia can remove underlying signals\n\n**Alexandra J Lee, YoSon Park, Georgia Doing, Deborah A Hogan and Casey S Greene**\n\n**University of Pennsylvania, Dartmouth College**\n\n[![PDF Manuscript](https://img.shields.io/badge/manuscript-PDF-blue.svg)](https://academic.oup.com/gigascience/article/9/11/giaa117/5952607)\n\nThis repository stores data and analysis modules to simulate compendia of gene expression data and measure the effect of technical sources of variation on our ability to extract an underlying biological signal.  \n\n**Motivation:** In the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into a compendium and used to extract novel biological patterns. However, combining different experiments introduces technical variance, which could distort biological patterns and lead to misinterpretation. As the scale and prevalence of these compendia increases, it becomes crucial to evaluate how integrating multiple experiments affects our ability to detect biological patterns.\n\n**Objective:** To determine the extent to which underlying biological structures are masked by technical variants via simulation of a multi-experiment compendia.\n\n**Method:** We used a generative multi-layer neural network to simulate a compendium of P. aeruginosa gene expression experiments. We performed a pairwise comparison of the simulated compendium versus the simulated compendium containing varying number of sources of technical variation.\n\n**Results:** We found that it was difficult to detect the original biological structure of interest in a compendium containing some sources of technical variation unless we applied batch correction. Interestingly, as the number of sources of variation increased, it became easier to detect the original biological structure without correction. Furthermore, when we applied batch correction, it reduced our power to detect the biological structure of interest.     \n\n**Conclusion:** When combining some sources of technical variation, it is best to perform batch correction. However, as the number of sources increases, batch correction becomes unnecessary and indeed harms our ability to extract biological patterns.\n\n**Citation:**\nFor more details about the analysis, see our paper published in GigaScience. The paper should be cited as:\n\u003e Alexandra J Lee, YoSon Park, Georgia Doing, Deborah A Hogan, Casey S Greene, Correcting for experiment-specific variability in expression compendia can remove underlying signals, GigaScience, Volume 9, Issue 11, November 2020, giaa117, https://doi.org/10.1093/gigascience/giaa117\n\n## Analysis Modules\n\nThere are 2 analyses using Pseudomonas dataset in the `Pseudomonas` directory and 2 analyses using the recount2 dataset in the `Human` directory:\n\n| Name | Description |\n| :--- | :---------- |\n| [Pseudomonas_sample_lvl_sim](Pseudomonas/Pseudomonas_sample_lvl_sim.ipynb) | Analysis notebook applying sample-level gene expression simulation to *P. aeruginosa* data|\n| [Pseudomonas_experiment_lvl_sim](Pseudomonas/Pseudomonas_experiment_lvl_sim.ipynb) | Analysis notebook applying experiment-level gene expression simulation to *P. aeruginosa* data|\n| [Human_sample_lvl_sim](Human/Human_sample_lvl_sim.ipynb) | Analysis notebook applying sample-level gene expression simulation to human (recount2) data|\n| [Human_experiment_lvl_sim](Human/Human_experiment_lvl_sim.ipynb) | Analysis notebook applying experiment-level gene expression simulation to human (recount2) data|\n\n\n## Usage\n\n**How to run notebooks from simulate-expression-compendia**\n\n*Operating Systems:* Mac OS, Linux\n\nIn order to run this simulation on your own gene expression data the following steps should be performed:\n\nFirst you need to set up your local repository: \n1. Download and install [github's large file tracker](https://git-lfs.github.com/).\n2. Install [miniconda](https://docs.conda.io/en/latest/miniconda.html)\n3. Clone the `simulate-expression-compendia` repository by running the following command in the terminal:\n```\ngit clone https://github.com/greenelab/simulate-expression-compendia.git\n```\nNote: Git automatically detects the LFS-tracked files and clones them via http.\n4. Navigate into cloned repo by running the following command in the terminal:\n```\ncd simulate-expression-compendia\n```\n5. Set up conda environment by running the following command in the terminal:\n```bash\n# conda version 4.6.12\nconda env create -f environment.yml\n\nconda activate simulate_expression_compendia\n\npip install -e .\n```\n6. Navigate to either the `Pseudomonas` or `Human` directories and run the notebooks.\n\n\n**How to analyze your own data**\n\nIn order to run this simulation on your own gene expression data the following steps should be performed:\n\nFirst you need to set up your local repository and environment: \n1. Download and install [github's large file tracker](https://git-lfs.github.com/).\n2. Install [miniconda](https://docs.conda.io/en/latest/miniconda.html)\n3. Clone the `simulate-expression-compendia` repository by running the following command in the terminal:\n```\ngit clone https://github.com/greenelab/simulate-expression-compendia.git\n```\nNote: Git automatically detects the LFS-tracked files and clones them via http.\n4. Navigate into cloned repo by running the following command in the terminal:\n```\ncd simulate-expression-compendia\n```\n5. Set up conda environment by running the following command in the terminal:\n```bash\n# conda version 4.6.12\nconda env create -f environment.yml\n\nconda activate simulate_expression_compendia\n\npip install -e .\n```\n6. Create a new analysis folder in the main directory. This is equivalent to the `Pseudomonas` directory\n7. Copy `Pseudomonas_sample_lvl_sim.ipynb` or `Pseudomonas_experiment_lvl_sim.ipynb` into your analysis folder depending on if you would like to use the sample level(see [simulate_by_random_sampling()](https://github.com/greenelab/ponyo/blob/master/ponyo/simulate_expression_data.py)) or experiment level simulation (see [simulate_by_latent_transformation()](https://github.com/greenelab/ponyo/blob/master/ponyo/simulate_expression_data.py))approach. \n8. Within your analysis folder create `data/` directory and `input/`, `metadata/` subdirectories\n\nNext we need to modify the code for your analysis:\n1. Create a configuration file in `configs/` using the parameters outlined below.\n2. Update the analysis notebooks to use your config file (see below) and input file\n3. Add your gene expression data file to the `data/input/` directory.  Your data is expected to be stored as a tab-delimited dataset with samples as rows and genes as columns. Your input data is also expected to be 0-1 normalized per gene. If your data needs to be normalized or transposed, there are functions to do this in [ponyo/utils](https://github.com/greenelab/ponyo/blob/master/ponyo/utils.py).\n4. Add your metadata file to `data/metadata/` directory.  Your metadata is expected to be stored as a tab-delimited with sample ids matching the gene expression dataset as one column and experiment ids as another. \n5. Run notebooks\n\n**Additional customization**\n\nFurther customization can be accomplished by doing the following:\n\n1. The `apply_correction_io` function in the `generate_data_parallel.py` file can be modified to use a different correction method.\n2. If there are additional pre-processing specific to your data, these can be added as modules in the `pipeline.py` file and called in the analysis notebook\n\n**Configuration file**\n\nThe tables lists parameters required to run the analysis in this repository.\n\nNote: Some of these parameters are required by the imported [ponyo](https://github.com/greenelab/ponyo) modules. \n\n| Name | Description |\n| :--- | :---------- |\n| local_dir| str: Parent directory on local machine to store intermediate results.|\n| scaler_transform_file| str: File name to store mapping from normalized to raw gene expression range. This is an intermediate file that gets generated. This file is generated in the `normalize_expression_data()` function from this [ponyo script](https://github.com/greenelab/ponyo/blob/master/ponyo/train_vae_modules.py).|\n| dataset_name| str: Name for analysis directory. Either \"Human\" or \"Pseudomonas\". If you created a new analysis directory this is the name of that new directory created in step 6 above.|\n| simulation_type | str: \"sample_lvl_sim\" (simulated based on randomly sampling the latent space) or \"experiment_lvl_sim\" (simulation based on shifting in the latent space).|\n| NN_architecture | str: Name of neural network architecture to use. Format 'NN_\u003cintermediate layer\u003e_\u003clatent layer\u003e'.|\n| learning_rate| float: Step size used for gradient descent. In other words, it's how quickly the  methods is learning.|\n| batch_size | str: Training is performed in batches. So this determines the number of samples to consider at a given time.|\n| epochs | int: Number of times to train over the entire input dataset.|\n| kappa | float: How fast to linearly ramp up KL loss.|\n| intermediate_dim| int: Size of the hidden layer.|\n| latent_dim | int: Size of the bottleneck layer.|\n| epsilon_std | float: Standard deviation of Normal distribution to sample latent space.|\n| validation_frac | float: Fraction of input samples to use to validate for VAE training.|\n| num_simulated_samples | int: Simulate a compendium with this number of samples. Used if simulation_type == \"sample_lvl_sample\"|\n| num_simulated_experiments| int: Simulate a compendium with this number of experiments. Used if simulation_type == \"experiment_lvl_sample\"|\n| lst_num_experiments | list:  List of different numbers of experiments to add to simulated compendium.  These are the number of sources of technical variation that are added to the simulated compendium.|\n| lst_num_partitions | list:  List of different numbers of partitions to add to simulated compendium.  These are the number of sources of technical variation that are added to the simulated compendium.|\n| use_pca | bool: True if want to represent expression data in top PCs before calculating SVCCA similarity.|\n| num_PCs | int: Number of top PCs to use to represent expression data. If use_pca == True.|\n| correction_method | str: Noise correction method to use. Either \"limma\" or \"combat\".|\n| metadata_colname | str: Column header that contains sample id that maps expression data and metadata.|\n| iterations | int: Number of simulations to run.|\n| num_cores | int: Number of processing cores to use.|\n\n## Acknowledgements\nWe would like to thank YoSon Park, David Nicholson, Ben Heil and Ariel Hippen-Anderson for insightful discussions and code review\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fsimulate-expression-compendia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreenelab%2Fsimulate-expression-compendia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fsimulate-expression-compendia/lists"}