{"id":24527244,"url":"https://github.com/biobakery/sparsedossa2","last_synced_at":"2026-01-03T05:03:31.590Z","repository":{"id":104688567,"uuid":"219829612","full_name":"biobakery/SparseDOSSA2","owner":"biobakery","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-02T01:34:24.000Z","size":23361,"stargazers_count":13,"open_issues_count":5,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-15T16:31:18.702Z","etag":null,"topics":["biobakery","bioconductor","public","tools"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/biobakery.png","metadata":{"files":{"readme":".github/README.md","changelog":"NEWS","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-05T19:05:37.000Z","updated_at":"2024-12-05T23:31:04.000Z","dependencies_parsed_at":"2024-03-27T00:31:11.826Z","dependency_job_id":"79b7bea7-b40e-4426-914f-34ee14138b7e","html_url":"https://github.com/biobakery/SparseDOSSA2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/biobakery/SparseDOSSA2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FSparseDOSSA2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FSparseDOSSA2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FSparseDOSSA2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FSparseDOSSA2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/biobakery","download_url":"https://codeload.github.com/biobakery/SparseDOSSA2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biobakery%2FSparseDOSSA2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28183989,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2026-01-03T02:00:06.471Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biobakery","bioconductor","public","tools"],"created_at":"2025-01-22T06:17:43.308Z","updated_at":"2026-01-03T05:03:31.561Z","avatar_url":"https://github.com/biobakery.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# \"Simulating realistic microbial observations with SparseDOSSA2\" \n\nAuthor Name: \"Siyuan Ma\"  \nAffiliation: Harvard T.H. Chan School of Public Health.  \nBroad Institute email: siyuan.ma@pennmedicine.upenn.edu \n\nTutorial: https://github.com/biobakery/biobakery/wiki/SparseDOSSA2\n\n# Introduction\nSparseDOSSA2 an R package for fitting to and the simulation of realistic microbial abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities.\n\n```\nlibrary(SparseDOSSA2)\n# tidyverse packages for utilities\nlibrary(magrittr)\nlibrary(dplyr)\nlibrary(ggplot2)\n```\n\n# Installation\nSparseDOSSA2 is a Bioconductor package and can be installed via the following command.\n```\n# if (!requireNamespace(\"BiocManager\", quietly = TRUE))\n#     install.packages(\"BiocManager\")\n# BiocManager::install(\"SparseDOSSA2\")\n```\n# Simulating realistic microbial observations with SparseDOSSA2\nThe most important functionality of SparseDOSSA2 is the simulation of realistic synthetic microbial observations. To this end, SparseDOSSA2 provides three pre-trained templates, \"Stool\", \"Vaginal\", and \"IBD\", targeting continuous, discrete, and diseased population structures.\n```\nStool_simulation \u003c- SparseDOSSA2(template = \"Stool\", \n                                 n_sample = 100, \n                                 n_feature = 100,\n                                 verbose = TRUE)\nVaginal_simulation \u003c- SparseDOSSA2(template = \"Vaginal\", \n                                   n_sample = 100, \n                                   n_feature = 100,\n                                   verbose = TRUE)\n```\n\n# Fitting to microbiome datasets with SparseDOSSA2\nSparseDOSSA2 provide two functions, fit_SparseDOSSA2 and fitCV_SparseDOSSA2, to fit the SparseDOSSA2 model to microbial count or relative abundance observations. For these functions, as input, SparseDOSSA2 requires a feature-by-sample table of microbial abundance observations. We provide with SparseDOSSA2 a minimal example of such a dataset: a five-by-five of the HMP1-II stool study.\n```\ndata(\"Stool_subset\", package = \"SparseDOSSA2\")\n# columns are samples.\nStool_subset[1:2, 1, drop = FALSE]\n```\n\n## Fitting SparseDOSSA2 model with fit_SparseDOSSA2\nfit_SparseDOSSA2 fits the SparseDOSSA2 model to estimate the model parameters: per-feature prevalence, mean and standard deviation of non-zero abundances, and feature-feature correlations. It also estimates joint distribution of these parameters and (if input is count) a read count distribution.\n```\nfitted \u003c- fit_SparseDOSSA2(data = Stool_subset,\n                           control = list(verbose = TRUE))\n# fitted mean log non-zero abundance values of the first two features\nfitted$EM_fit$fit$mu[1:2]\n```\n\n## Fitting SparseDOSSA2 model with fitCV_SparseDOSSA2\nThe user can additionally achieve optimal model fitting via fitCV_SparseDOSSA2. They can either provide a vector of tuning parameter values (lambdas) to control sparsity in the estimation of the correlation matrix parameter, or a grid will be selected automatically. fitCV_SparseDOSSA2 uses cross validation to select an \"optimal\" model fit across these tuning parameters via average testing log-likelihood. This is a computationally intensive procedure, and best-suited for users that would like accurate fitting to the input dataset, for best simulated new microbial observations on the same features as the input (i.e. not new features).\n```\nset.seed(1)\nfitted_CV \u003c- fitCV_SparseDOSSA2(data = Stool_subset,\n                                         lambdas = c(0.1, 1),\n                                         K = 2,\n                                         control = list(verbose = TRUE))\n# the average log likelihood of different tuning parameters\napply(fitted_CV$EM_fit$logLik_CV, 2, mean)\n# The second lambda (1) had better performance in terms of log likelihood,\n# and will be selected as the default fit.\n```\n\n# Parallelization controls with future\nSparseDOSSA2 internally uses r BiocStyle::CRANpkg(\"future\") to allow for parallel computation. The user can thus specify parallelization through future's interface. See the reference manual for future for more details. This is particularly suited if fitting SparseDOSSA2 in a high-performance computing environment/\n```\n## regular fitting \n# system.time(fitted_regular \u003c- \n#               fit_SparseDOSSA2(data = Stool_subset,\n#                                control = list(verbose = FALSE)))\n## parallel fitting with future:\n# future::plan(future::multisession())\n# system.time(fitted_parallel \u003c- \n#               fit_SparseDOSSA2(data = Stool_subset,\n#                                control = list(verbose = FALSE)))\n\n## For CV fitting, there are three components that can be paralleled, in order:\n## different cross validation folds, different tuning parameter lambdas, \n## and different samples. It is usually most efficient to parallelize at the\n## sample level:\n# system.time(fitted_regular_CV \u003c-\n#               fitCV_SparseDOSSA2(data = Stool_subset,\n#                                  lambdas = c(0.1, 1),\n#                                  K = 2,\n#                                  control = list(verbose = TRUE)))\n# future::plan(future::sequential(), future::sequential(), future::multisession())\n# system.time(fitted_parallel_CV \u003c-\n#               fitCV_SparseDOSSA2(data = Stool_subset,\n#                                  lambdas = c(0.1, 1),\n#                                  K = 2,\n#                                  control = list(verbose = TRUE)))\n```\n\n# Sessioninfo\n```\nsessionInfo()\n\nR version 3.6.2 (2019-12-12)\nPlatform: x86_64-apple-darwin15.6.0 (64-bit)\nRunning under: macOS Mojave 10.14.6\n\nMatrix products: default\nBLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib\nLAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib\n\nlocale:\n[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n\nattached base packages:\n[1] stats     graphics  grDevices utils     datasets  methods   base     \n\nother attached packages:\n [1] SparseDOSSA2_0.99.0 Rmpfr_0.8-2         gmp_0.6-1           igraph_1.2.6       \n [5] truncnorm_1.0-8     magrittr_2.0.1      future.apply_1.7.0  future_1.21.0      \n [9] huge_1.3.4.1        mvtnorm_1.1-1       ks_1.11.7           BiocCheck_1.22.0   \n\nloaded via a namespace (and not attached):\n [1] Rcpp_1.0.5          compiler_3.6.2      BiocManager_1.30.10 bitops_1.0-6       \n [5] tools_3.6.2         digest_0.6.27       mclust_5.4.7        jsonlite_1.7.2     \n [9] lattice_0.20-41     pkgconfig_2.0.3     Matrix_1.2-18       graph_1.64.0       \n[13] curl_4.3            parallel_3.6.2      xfun_0.20           stringr_1.4.0      \n[17] httr_1.4.2          knitr_1.30          globals_0.14.0      stats4_3.6.2       \n[21] grid_3.6.2          getopt_1.20.3       optparse_1.6.6      Biobase_2.46.0     \n[25] listenv_0.8.0       R6_2.5.0            parallelly_1.23.0   XML_3.99-0.3       \n[29] RBGL_1.62.1         codetools_0.2-18    biocViews_1.54.0    BiocGenerics_0.32.0\n[33] MASS_7.3-53         stringdist_0.9.6.3  RUnit_0.4.32        KernSmooth_2.23-18 \n[37] stringi_1.5.3       RCurl_1.98-1.2     \n```\n\n# Contributions\nThanks go to these wonderful people:\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiobakery%2Fsparsedossa2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbiobakery%2Fsparsedossa2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiobakery%2Fsparsedossa2/lists"}