{"id":16320084,"url":"https://github.com/helenalc/simulation-comparison","last_synced_at":"2025-09-21T19:31:41.429Z","repository":{"id":82290203,"uuid":"344418980","full_name":"HelenaLC/simulation-comparison","owner":"HelenaLC","description":"Snakemake workflow to benchmark scRNA-seq data simulators","archived":false,"fork":false,"pushed_at":"2022-08-10T10:06:52.000Z","size":517,"stargazers_count":13,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T10:01:57.016Z","etag":null,"topics":["benchmark","scrna-seq","simulation","snakemake"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HelenaLC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-04T09:28:12.000Z","updated_at":"2024-10-26T11:29:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"372bbb67-8444-406b-a6f4-bba3b1ee4564","html_url":"https://github.com/HelenaLC/simulation-comparison","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/HelenaLC/simulation-comparison","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelenaLC%2Fsimulation-comparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelenaLC%2Fsimulation-comparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelenaLC%2Fsimulation-comparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelenaLC%2Fsimulation-comparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HelenaLC","download_url":"https://codeload.github.com/HelenaLC/simulation-comparison/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelenaLC%2Fsimulation-comparison/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276295004,"owners_count":25617998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-21T02:00:07.055Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","scrna-seq","simulation","snakemake"],"created_at":"2024-10-10T22:29:03.418Z","updated_at":"2025-09-21T19:31:41.028Z","avatar_url":"https://github.com/HelenaLC.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `Snakemake` workflow to benchmark \u003cbr\u003e scRNA-seq data simulators\n\n- [Setup](#setup)\n  - [Dependencies](#dependencies)\n  - [Structure](#structure)\n- [Workflow](#workflow)\n  - [Preproccessing](#preprocessing)\n  - [Simulation](#simulation)\n  - [Summaries](#summaries)\n  - [Statistics](#statistics)\n  - [Downstream](#downstream)\n    - [Clustering](#clustering)\n    - [Integration](#integration)\n  - [Visualization](#visualization)\n- [Customization](#customization)\n  - [Datasets](#datasets)\n  - [Methods](#methods)\n\n*** \n\n# Setup\n\n## Dependencies\n\nThe current code was implemented using R v4.1.0, Bioconductor v3.13, Snakemake v5.5.0, and Python v3.6.8. All R dependencies (from GitHub, CRAN and Bioconductor) are listed under *code/10-session_info.R* and may be installed using the command contained therein.\n\n## Structure\n\n* `config.yaml` specifies the R library and version to use\n* `code` contains all R scripts used in the *Snakemake* workflow\n* `data` contains raw, filtered and simulated scRNA-seq datasets,  \nas well as simulation parameter estimates\n* `meta` contains two *.json* files that specify simulation method (`methods.json`) and reference subset (`subsets.json`) configurations\n* `outs` contains all results from computations (typically `data.frame`s) as *.rds* files\n* `figs` contains all visual outputs as *.pdf* files, and corresponding `ggplot` objects as *.rds* files (for subsequent arrangement into 'super'-figures)\n\nSimulation methods are tagged with *one or many* of the following labels, according to which scenario(s) they can accommodate: \n\n* `n` for none: no clusters or batches\n* `b` for batch: multiple batches, no clusters\n* `k` for cluster: multiple clusters, no batches\n\nSimilarly, we tag subsets (see below) with *exactly one* of these labels. This allows running each method on subsets they are capable of simulating.\n\n***\n\n# Workflow\n\n![Schematic of the computational workflow used to benchmark scRNA-seq simulators. (1) Methods are grouped according to which level of complexity they can accommodate: type *n* (`singular'), *b* (batches), *k* (clusters). (2) Raw datasets are retrieved reproducibly from a public source, filtered, and subsetted into various datasets that serve as reference for (3) parameter estimation and simulation. (4) Various gene-, cell-level and global summaries are computed from reference and simulated data, and (5) compared in a one- and two-dimensional setting using two statistics each. (6) Integration and clustering methods are applied to type *b* and *k* references and simulations, respectively, and relative performances compared between reference-simulation and simulation-simulation pairs.](schematic.png)\n\n## Preprocessing\n\n**1. Data retrieval**\n\nEach `code/00-get_data-\u003cdatset_id\u003e.R` script retrieves a publicly available scRNA-seq dataset through from which a *SingleCellExperiment* is constructed and written to `data/00-raw/\u003cdatset_id\u003e.rds`\n\n**2. Filtering**\n\n`code/01-fil_data.R` is applied to each raw dataset as to \n\n  * remove batches, cluster, or batch-cluster instances with fewer than 50 cells (depending on the dataset's complexity)\n  * keep genes with a count of at least 1 in at least 10 cells, and remove cells with fewer than 100 detected genes \n  \nFiltered data are written to `data/01-fil/\u003cdatset_id\u003e.rds`.\n\n**3. Subsetting**\n\nBecause different methods can accommodate only some features (e.g. multiple batches or clusters, both or neither), `code/02-sub_data.R` creates specific subsets in `data/02-sub/\u003cdatset_id\u003e.\u003csubset_id\u003e,rds`. We term these *ref(erence)set*s (i.e. `\u003cdatset_id\u003e.\u003csubset_id\u003e = \u003crefset_id\u003e`), as they serve as the input reference data for simulation.\n\n## Simulation\n\n**1. Parameter estimation**\n\nSimulation parameters are estimated with `code/03-est_pars.R`, which in term sources a `code/03-est_pars-\u003cmethod_id\u003e.R` script that executes a method's parameter estimation function(s). In cases where no separate estimation takes place, this returns `NULL`. Parameter estimates for each combination of `\u003crefset_id.\u003cmethod_id\u003e = \u003csimset_id\u003e` are written to `data/04-est/\u003csimset_id\u003e.rds`.\n\n**2. Data simulation**\n\nData is simulated with `code/04-sim_data.R`, which in term sources a `code/04-sim_data-\u003cmethod_id\u003e.R` script that executes a method's simulation function. Simulations for each combination of `\u003crefset_id\u003e` and `method_id` are written to `data/05-sim/\u003crefset_id\u003e,\u003cmethod_id\u003e.rds`.\n\n## Summaries\n\nVarious quality control (QC) summaries are computed with `code/05-calc_qc.R`, which in term sources a set of `code/05-calc_qc-\u003cmetric_id\u003e.R` scripts. QC results for reference and simulated data are written to `outs/qc_ref-\u003crefset_id\u003e,\u003cmetric_id\u003e.rds` and `outs/qc_sim-\u003csimset_id\u003e,\u003cmetric_id\u003e.rds`, respectively. At current, we consider:\n\n**1. Gene-level**\n\n* `frq`: detection frequency (i.e., fraction of cells with non-zero counts)\n* `avg/var`: average/variance of logCPM\n* `cv`: coefficient of variation\n* `cor`: gene-to-gene correlation\n\n**2. Cell-level**\n\n* `frq`: detection frequency (i.e., fraction of genes with non-zero counts)\n* `lls`: log-transformed library size (total counts)\n* `cor`: cell-to-cell correlation\n* `pcd`: cell-to-cell distance (in PCA space)\n* `knn`: number of KNN occurrences\n* `ldf`: local density factor\n\n**3. Global**\n\n* `sw`: Silhouette width (using batch/cluster labels as classes)\n* `cms`: cell-specific mixing score (using batch/cluster labels as batches)\n* `pve`: percent variance explained (of gene expression = logCPM, by batch/cluster)\n\nNoteworthily, we compute each summary for different groupings of cells (depending on the dataset's complexity): \n\n1. globally, i.e. across all cells\n2. at the batch-level, i.e. for each batch\n3. at the cluster-level, i.e. for each cluster\n\nGlobal summaries are computed at the batch-/cluster-level only, as they require a grouping variable. \n\n## Statistics\n\nWe compare summaries between reference and simulated data in both one- (`code/06-stat_1d.R`) and two-dimensional settings (`code/06-statl_2d.R`). For the latter, every combination of gene- and cell-level metrics is considered, excluding correlations and global summaries. Furthermore, metrics are evaluated for each cell grouping, i.e. we perform a test globally, for each batch and cluster (again, depending on the dataset's complexity). Test results are written to `outs/stat_1d,\u003crefset_id\u003e,\u003cmetric_id\u003e,\u003cstat1d_id\u003e.rds` for 1D, and `outs/stat_2d,\u003crefset_id\u003e,\u003cmetric1_id\u003e,\u003cmetric2_id\u003e,\u003cstat2d_id\u003e.rds` for 2D tests.\n\n**1. One-dimensional**\n\n* Kolmogorov-Smirnov (KS) test\n* Wasserstein metric\n\n**2. Two-dimensional**\n\n* two-dimensional KS test\n* Earth Mover's Distance (EMD)\n\n## Downstream\n\n### Integration\n\nEach `05-calc_batch-x.R` script wraps around an integration method that is applied in `05-calc_batch.R` to the set of type *b* subsets. The output corrected assay data or integrated cell embeddings (depending on the method) are written to `outs/batch_ref/sim-\u003cref/simset_id\u003e,\u003cbatch_method\u003e.rds` for every reference and simulation, respectively. Results are evaluated by `06-eval_batch.R`, which computes the following set of metrics:\n\n- cell-specific mixing score (CMS)\n- difference in local density factor ($\\Delta$LDF) \n- batch correction score (BCS)\n\n### Clustering\n\nEach `05-calc_clust-x.R` script wraps around an integration method that is applied in `05-calc_clust.R` to the set of type *b* subsets. The output cluster assignments are written to `outs/clust_ref/sim-\u003cref/simset_id\u003e,\u003cclust_method\u003e.rds` for every reference and simulation, respectively. Results are evaluated by `06-eval_clust.R`, which computes the following set of metrics:\n\n- precision (P) and recall (R)\n- F1 score (harmonic mean of P and R)\n\n## Visualization\n\nFinally, results are collected across `refset_id`s and `method_id`s (jointly or separated by type), and visualized in various ways using as set of `07-plot_x.R` scripts. Output figures are written to `plts` as *.pdf* files, along with the corresponding `ggplot` objects as *.rds* files. Lastly, `08-fig_x.R` scripts are used to combined various `ggplot`s into figures that are saved to `figs` as *.pdf* files.\n\n***\n\n# Customization\n\n## Datasets\n\nIn principle, any dataset for which a `code/00-get_data-\u003cdataset_id\u003e.R` script exists will be accessible to the workflow. However, data will only be retrieved if the dataset appears in `meta/subsets.json`. Hence,\n\n### Removing\n\nTo exclude a dataset from the workflow, i) (re)move the corresponding `code/00-get_data-\u003cdataset_id\u003e.R` script; or, ii) remove or comment out any associated `meta/subsets.json` entries.\n\n### Adding\n\nSimilarly, a new dataset can be added by supplying an adequate `code/00-get_data-\u003cdataset_id\u003e.R` script, and adding an entry to the `meta/subsets.json` configuration that specifies the subset ID, the number of genes/cells to sample (`NULL` for all), which batch(es)/cluster(s) to retain, as well as the resulting subset's type (one of n,b,k,g).\n\n## Methods\n\nThe *Snakemake* will automatically include any simulation method for which a `code/03-est_pars-\u003cmethod_id\u003e.R` and `code/04-sim_data-\u003cmethod_id\u003e.R` script exists. Secondly, `meta/methods.json` will determine on which type(s) of dataset(s) each method should be run. Thus, \n\n### Removing\n\nTo exclude a method from the workflow, either i) set `\"\u003cmethod_id\u003e\": \"x\"` in the `meta/methods.json` file (or anything other than n,b,k,g); or, ii) (re)move the parameter estimation and/or simulation script from the `code` directory.\n\n### Adding\n\nAnalogous to the above, adding a method to the benchmark requires i) adding a `code/03-est_pars-\u003cmethod_id\u003e.R` and `code/04-sim-data-\u003cmethod_id\u003e.R` script; and, ii) adding an entry for the `method_id` to the `meta/methods.json` file. Importantly, the R script for parameter estimation should handle batches (`colData` column `batch`), clusters (`colData` column `cluster`), both or neither. And the method's type(s) should be specified accordingly (`n` for neither, `b/k` for batches/clusters, `g` for groups), e.g. `\"\u003cmethod_id\u003e\": [\"n\", \"k\"]` for a method that supports 'singular' datasets, as well as ones with multiple clusters.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelenalc%2Fsimulation-comparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhelenalc%2Fsimulation-comparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelenalc%2Fsimulation-comparison/lists"}