{"id":13773973,"url":"https://github.com/favilaco/deconv_benchmark","last_synced_at":"2025-05-11T06:31:51.082Z","repository":{"id":50333254,"uuid":"233047193","full_name":"favilaco/deconv_benchmark","owner":"favilaco","description":null,"archived":false,"fork":false,"pushed_at":"2020-11-21T10:01:30.000Z","size":509,"stargazers_count":51,"open_issues_count":1,"forks_count":20,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-02-15T07:32:10.721Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/favilaco.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-10T12:57:26.000Z","updated_at":"2024-01-22T21:07:24.000Z","dependencies_parsed_at":"2022-09-21T02:24:02.657Z","dependency_job_id":null,"html_url":"https://github.com/favilaco/deconv_benchmark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favilaco%2Fdeconv_benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favilaco%2Fdeconv_benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favilaco%2Fdeconv_benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/favilaco%2Fdeconv_benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/favilaco","download_url":"https://codeload.github.com/favilaco/deconv_benchmark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253528362,"owners_count":21922623,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T17:01:22.505Z","updated_at":"2025-05-11T06:31:50.768Z","avatar_url":"https://github.com/favilaco.png","language":"R","readme":"Source code (R statistical programming language, v3.6) to reproduce the results described in the article:\n\n\u003e *Avila Cobos F, Alquicira-Hernandez J, Powell JE, Mestdagh P and De Preter K.* **Benchmarking of cell type deconvolution pipelines for transcriptomics data.** *(Nature Communications; https://doi.org/10.1038/s41467-020-19015-1)*\n\nDATASETS\n========\nHere we provide an **example folder** (named \"example\"; see *\"Folder requirements \u0026 running the deconvolution\"*) that can be directly used. It contains an artificial single-cell RNA-seq dataset made of 5 artificial cell types; 200 cells per cell type and 80 genes.\n\nThe **other five external datasets** (together with the necessary metadata) can be downloaded from their respective sources:\n\n* Baron: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 (Specifically, GSM2230757 to GSM2230760 for human pancreatic islands)\n* GSE81547: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81547\n* E-MTAB-5061: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5061/\n* PBMCs: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a \n* kidney.HCL: https://figshare.com/articles/HCL_DGE_Data/7235471\n\nRegarding E-MTAB-5061: cells with \"not_applicable\", \"unclassified” and “co-expression_cell\" labels were excluded and only cells coming from six healthy patients (non-diabetic) were kept.\n\nThe following line is needed for fresh installations of Linux (Debian):\n`sudo apt-get install curl libcurl4-openssl-dev libssl-dev zlib1g-dev r-base-dev libxml2-dev`\n\n\nR 3.6.0: REQUIRED PACKAGES AND PACKAGE DEPENDENCIES:\n===================================================\nCode to be run before running any deconvolution (to be run in **R \u003e= 3.6.0**):\n```\npackages \u003c- c(\"devtools\", \"BiocManager\",\"data.table\",\"ggplot2\",\"tidyverse\",\n\t\t\t  \"Matrix\",\"matrixStats\",\n\t\t\t  \"gtools\",\n\t\t\t  \"foreach\",\"doMC\",\"doSNOW\", #for parallelism\n\t\t\t  \"Seurat\",\"sctransform\", #sc-specific normalization\n\t\t\t  \"nnls\",\"FARDEEP\",\"MASS\",\"glmnet\",\"ComICS\",\"dtangle\") #bulk deconvolution methods\n\nfor (i in packages){ install.packages(i, character.only = TRUE)}\n\n# Installation using BiocManager:\n# Some packages that didn't work with install.packages (e.g. may not be present in a CRAN repository chosen by the user)\npackages3 = c('limma','edgeR','DESeq2','pcaMethods','BiocParallel','preprocessCore','scater','SingleCellExperiment','Linnorm','DeconRNASeq','multtest','GSEABase','annotate','genefilter','preprocessCore','graph','MAST','Biobase') #last two are required by DWLS and MuSiC, respectively.\nfor (i in packages3){ BiocManager::install(i, character.only = TRUE)}\n\n# Dependencies for CellMix: 'NMF', 'csSAM', 'GSEABase', 'annotate', 'genefilter', 'preprocessCore', 'limSolve', 'corpcor', 'graph', 'BiocInstaller'\npackages2 = c('NMF','csSAM','limSolve','corpcor')\nfor (i in packages2){ install.packages(i, character.only = TRUE)}\n\n# Special instructions for CellMix and DSA\ninstall.packages(\"BiocInstaller\", repos=\"http://bioconductor.org/packages/3.7/bioc/\")\nsystem('wget http://web.cbio.uct.ac.za/~renaud/CRAN/src/contrib/CellMix_1.6.2.tar.gz')\nsystem(\"R CMD INSTALL CellMix_1.6.2.tar.gz\")\nsystem('wget https://github.com/zhandong/DSA/raw/master/Package/version_1.0/DSA_1.0.tar.gz')\nsystem(\"R CMD INSTALL DSA_1.0.tar.gz\")\n\n# Following packages come from Github\ndevtools::install_github(\"GfellerLab/EPIC\", build_vignettes=TRUE) #requires knitr\ndevtools::install_github(\"xuranw/MuSiC\") \ndevtools::install_bitbucket(\"yuanlab/dwls\", ref=\"default\")\ndevtools::install_github(\"meichendong/SCDC\")\ndevtools::install_github(\"rosedu1/deconvSeq\")\ndevtools::install_github(\"cozygene/bisque\")\ndevtools::install_github(\"dviraran/SingleR@v1.0\")\n```\n\nUsers interested in the **generation of pseudo-bulk mixtures from scRNA-seq data** can use the *\"Generator\"* function that is located inside **helper_functions.R**\n\n\nReferences to other methods included in our benchmark:\n======================================================\nWhile our work has a **BSD (3-clause)** license, you **may need** to obtain a license to use the individual normalization/deconvolution methods (e.g. CIBERSORT. The source code for CIBERSORT needs to be asked to the authors at https://cibersort.stanford.edu).\n\n| method | ref |\n|--------|----------|\n| OLS | Chambers, J., Hastie, T. \u0026 Pregibon, D. Statistical Models in S. in Compstat (eds. Momirović, K. \u0026 Mildner, V.) 317–321 (Physica-Verlag HD, 1990). doi:10.1007/978-3-642-50096-1_48 |\n| nnls | Mullen, K. M. \u0026 van Stokkum, I. H. M. nnls: The Lawson-Hanson algorithm for non-negative least squares (NNLS). R package version 1.4. https://CRAN.R-project.org/package=nnls |\n| FARDEEP | Hao, Y., Yan, M., Lei, Y. L. \u0026 Xie, Y. Fast and Robust Deconvolution of Tumor Infiltrating Lymphocyte from Expression Profiles using Least Trimmed Squares. bioRxiv 358366 (2018) doi:10.1101/358366 |\n| MASS: Robust linear regression (RLR) | Ripley, B. et al. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. (2002) |\n| DeconRNASeq | Gong, T. \u0026 Szustakowski, J. D. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data. Bioinforma. Oxf. Engl. 29, 1083–1085 (2013) |\n| CellMix: DSA, ssKL, ssFrobenius  | Gaujoux, R. \u0026 Seoighe, C. CellMix: a comprehensive toolbox for gene expression deconvolution. Bioinformatics 29, 2211–2212 (2013) |\n| DCQ | Altboum, Z. et al. Digital cell quantification identifies global immune cell dynamics during influenza infection. Mol. Syst. Biol. 10, 720 (2014) |\n| glmnet: lasso, ridge, elastic net | Friedman, J., Hastie, T. \u0026 Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33, 1–22 (2010) |\n| EPIC | Racle, J., Jonge, K. de, Baumgaertner, P., Speiser, D. E. \u0026 Gfeller, D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. eLife 6, e26476 (2017) |\n| dtangle | Hunt, G. J., Freytag, S., Bahlo, M. \u0026 Gagnon-Bartsch, J. A. dtangle: accurate and robust cell type deconvolution. Bioinformatics 35, 2093–2099 (2019) |\n| CIBERSORT | Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015) |\n|--------|----------|\n| BisqueRNA | Jew, B. et al. Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. bioRxiv 669911 (2019) doi:10.1101/669911 |\n| deconvSeq | Du, R., Carey, V. \u0026 Weiss, S. T. deconvSeq: deconvolution of cell mixture distribution in sequencing data. Bioinformatics doi:10.1093/bioinformatics/btz444 |\n| DWLS | Tsoucas, D. et al. Accurate estimation of cell-type composition from gene expression data. Nat. Commun. 10, 1–9 (2019) |\n| MuSiC | Wang, X., Park, J., Susztak, K., Zhang, N. R. \u0026 Li, M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 10, 380 (2019) |\n| SCDC | Dong, M. et al. SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References. Briefings in Bioinformatics (2020), bbz166, https://doi.org/10.1093/bib/bbz166 |\n|--------|----------|\n| SCTransform / regularized negative binomial regression (RNBR) | Hafemeister, C. \u0026 Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology (2019) doi:10.1186/s13059-019-1874-1 |\n| Linnorm | Yip, S. H., Wang, P., Kocher, J.-P. A., Sham, P. C. \u0026 Wang, J. Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Res. 45, e179–e179 (2017) |\n| scran | L. Lun, A. T., Bach, K. \u0026 Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016) |\n| scater | McCarthy, D. J., Campbell, K. R., Lun, A. T. L. \u0026 Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017) |\n| Quantile normalization (QN) | Bolstad, B. M., Irizarry, R. A., Åstrand, M. \u0026 Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003) |\n| Upper quartile (UQ) | Bullard, J. H., Purdom, E., Hansen, K. D. \u0026 Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010) |\n| Trimmed mean of M-values (TMM) | Robinson, M. D. \u0026 Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010) |\n| Transcripts per million (TPM) | Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. \u0026 Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010) |\n| LogNormalize | LogNormalize function (part of \"Seurat\"). R Documentation. https://www.rdocumentation.org/packages/Seurat/versions/3.1.1/topics/LogNormalize ; Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018) doi:10.1038/nbt.4096 |\n| Variance stabilization transformation (VST) \u0026 Median of ratios | Anders, S. \u0026 Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010) |\n\n\nFOLDER REQUIREMENTS \u0026 RUNNING THE DECONVOLUTION\n===============================================\n\na) Folder structure:\n```\n.\n├── example\n│   ├── example.rds\n│   └── example_phenoData.txt\n├── baron\n│   ├── sc_baron.rds\n│   └── baron_phenoData.txt\n├── GSE81547\n│   ├── sc_GSE81547.rds\n│   └── GSE81547_phenoData.txt\n...\n\n├── helper_functions.R\n├── Master_deconvolution.R\n└── CIBERSORT.R\n```\n\nb) Minimally the following (tab-separated) columns being part of the metadata: \"cellID\", \"cellType\", \"sampleID\". Optionally, other columns may be present (e.g. \"gender\",\"disease\").\n\n```\n# For the baron dataset, it should look like:\n\n\t\t     cellID  cellType sampleID\nhuman1_lib3.final_cell_0178     delta   human1\nhuman1_lib2.final_cell_0498     delta   human1\n...\n```\n\nc) Each single-cell RNA-seq input (\"sc_input\") dataset is a integer matrix containing gene names as rows and cellID as columns.\n\n\nd) Make the following choices:\n```\n\ti) a specific dataset (from \"example\",\"baron\",\"GSE81547\",\"E-MTAB-5061\",\"PBMCs\")\n\tii) data transformation (from \"none\",\"log\",\"sqrt\",\"vst\"); with \"none\" meaning linear scale\n\tiii) type of deconvolution method (from \"bulk\",\"sc\")\n\t\tiii.1) For \"bulk\" methods:\n\t\t\tiii.1.1) choose normalization method among: \"column\",\"row\",\"mean\",\"column_z-score\",\"global_z-score\",\"column_min-max\",\"global_min-max\",\"LogNormalize\",\"QN\",\"TMM\",\"UQ\", \"median_ratios\", \"TPM\"\n\t\t\tiii.1.2) Marker selection strategy from \"all\", \"pos_fc\", \"top_50p_logFC\", \"bottom_50p_logFC\", \"top_50p_AveExpr\", \"bottom_50p_AveExpr\", \"top_n2\", \"random5\" (see main manuscript for more details).\n\t\t\tiii.1.3) choose deconvolution method among: \"CIBERSORT\",\"DeconRNASeq\",\"OLS\",\"nnls\",\"FARDEEP\",\"RLR\",\"DCQ\",\"elastic_net\",\"lasso\",\"ridge\",\"EPIC\",\"DSA\",\"ssKL\",\"ssFrobenius\",\"dtangle\".\n\n\t\tiii.2) For \"sc\" methods:\n\t\t\tiii.2.1) choose normalization method for both the reference matrix (scC) and the pseudo-bulk matrix (scT) among: \"column\",\"row\",\"mean\",\"column_z-score\",\"global_z-score\",\"column_min-max\",\"global_min-max\",\"LogNormalize\",\"QN\",\"TMM\",\"UQ\", \"median_ratios\", \"TPM\", \"SCTransform\",\"scran\",\"scater\",\"Linnorm\" (last 4 are single-cell-specific)\n\t\t\tiii.2.2.) choose deconvolution method among: \"MuSiC\",\"BisqueRNA\",\"DWLS\",\"deconvSeq\",\"SCDC\"\n\n\tiv) Number of cells to be used to make the pseudo-bulk mixtures (multiple of 100)\n\tv) Cell type to be removed from the reference matrix (\"none\" for the full matrix; this is dataset dependent: e.g. \"alpha\" from baron dataset)\n\tvi) Number of available cores (by default 1, can be enlarged if more resources available)\n```\n\nR example calls\n===============\n\nFor bulk:\n---------\n\n```\n# With the example we provided with this repository + no cell type removed:\nRscript Master_deconvolution.R example none bulk TMM all nnls 100 none 1\n\t#Expected output:\n\t#        RMSE   Pearson\n\t#1     0.0351    0.9866\n\n\n# With the example we provided with this repository + \"cell_type_1\" removed:\nRscript Master_deconvolution.R example none bulk TMM all nnls 100 cell_type_1 1\n\t#Expected output:\n\t#       RMSE   Pearson\n\t#1    0.1038    0.9379\n\n\n# With baron (or GSE81547, E-MTAB-5061, PBMCs) + no cell type removed:\nRscript Master_deconvolution.R baron none bulk TMM all nnls 100 none 1\n\t#Expected output:\n\t#       RMSE   Pearson\n\t#1    0.0724    0.8961\n\n\n# With baron + delta cells removed:\nRscript Master_deconvolution.R baron none bulk TMM all nnls 100 delta 1\n\t#Expected output:\n\t#        RMSE   Pearson\n\t#1     0.0887    0.8197\n```\n\n\nFor single-cell:\n----------------\n\n```\n# With the example we provided with this repository + no cell type removed::\nRscript Master_deconvolution.R example none sc TMM TMM MuSiC 100 none 1\n\t#Expected output:\n\t#        RMSE   Pearson\n\t#1     0.0351    0.9866\n\n\n# With the example we provided with this repository + \"cell_type_1\" removed:\nRscript Master_deconvolution.R example none sc TMM TMM MuSiC 100 cell_type_1 1\n\t#Expected output:\n\t#       RMSE   Pearson\n\t#1    0.1044    0.9376\n\n\n# With baron (or GSE81547, E-MTAB-5061, PBMCs) + no cell type removed:\nRscript Master_deconvolution.R baron none sc TMM TMM MuSiC 100 none 1\n\t#Expected output:\n\t#        RMSE   Pearson\n\t#1     0.0488     0.953\n\n\n# With baron + delta cells removed:\nRscript Master_deconvolution.R baron none sc TMM TMM MuSiC 100 delta 1\n\t#Expected output:\n\t#        RMSE   Pearson\n\t#1      0.073    0.8799\n```\n\nsessionInfo() files Linux \u0026 macOS\n----------------------------------\nPlease see \"sessionInfo_Linux.txt\" and \"sessionInfo_macOS.txt\" in this repository.\n","funding_links":[],"categories":["RNA-seq"],"sub_categories":["Cell-Type Deconvolution"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffavilaco%2Fdeconv_benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffavilaco%2Fdeconv_benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffavilaco%2Fdeconv_benchmark/lists"}