{"id":32611471,"url":"https://github.com/ctlab/samovar","last_synced_at":"2025-10-30T13:59:10.235Z","repository":{"id":216739066,"uuid":"741952211","full_name":"ctlab/samovar","owner":"ctlab","description":"Algorithms package for generating model metagenomes with specified properties","archived":false,"fork":false,"pushed_at":"2025-06-07T13:12:21.000Z","size":36283,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-07T14:20:55.885Z","etag":null,"topics":["artificial-data","metagenomics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ctlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-11T13:02:31.000Z","updated_at":"2025-06-07T13:12:25.000Z","dependencies_parsed_at":"2024-01-19T04:28:35.939Z","dependency_job_id":"14b6e5dc-ad7e-4ce2-853c-5b46a5e94086","html_url":"https://github.com/ctlab/samovar","commit_stats":null,"previous_names":["dsmutin/samovar"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ctlab/samovar","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctlab%2Fsamovar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctlab%2Fsamovar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctlab%2Fsamovar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctlab%2Fsamovar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ctlab","download_url":"https://codeload.github.com/ctlab/samovar/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctlab%2Fsamovar/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281818072,"owners_count":26566859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-data","metagenomics"],"created_at":"2025-10-30T13:58:12.185Z","updated_at":"2025-10-30T13:59:10.226Z","avatar_url":"https://github.com/ctlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SamovaR \u003ca href=\"\"\u003e\u003cimg src=\"data/img/logos/logo_stable.png\" align=\"right\" width=\"150\" \u003e\u003c/a\u003e \n### Automated re-profiling \u0026 benchmarking of metagenomic tools based on artificial data generation\n\n\n[![R package](https://github.com/ctlab/samovar/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ctlab/samovar/actions/workflows/R-CMD-check.yaml)\n[![python package](https://github.com/ctlab/samovar/actions/workflows/python-package.yml/badge.svg)](https://github.com/ctlab/samovar/actions/workflows/python-package.yaml)\n\nThere is a fundamental problem in modern ***metagenomics***: there are huge differences between methodological approaches that strongly influence the results, while remaining outside the attention of researchers. \n\nThe use of golden practice and open code, while allowing data to be analyzed reproducibly, locks scientists into a single, far from perfect approach, with its own bias.\n\nTherefore, we propose an approach that utilizes de novo generation of the artificial metagenomes - `SamovaR`.\n\n## Installation\n\n### Quick Installation\n\n\u003cb\u003e\u003cfont color=\"red\"\u003eWarning:\u003c/font\u003e\u003c/b\u003e beta\n\nUse installation script:\n\n```bash\ngit clone https://github.com/ctlab/samovar\ncd samovar\nchmod +x install.sh\n./install.sh\n```\n\n***Attention**: the script automatically detects custom R library paths from `.Renviron` (R_LIBS) or `.Rprofile` (libPaths())*\n\n### Manual Installation\n\nInstall **R** package:\n\n```r\ndevtools::install_github(\"https://github.com/ctlab/samovar/\")\n```\n\n***Attention:*** *check that samovar can be loaded with* ```Rscript -e 'library(samovar)'```, *especially in case of several R versions installed*\n\nInstall **python** package:\n\n```bash\ngit clone https://github.com/ctlab/samovar\ncd samovar\npip install -e .\n```\n***Attention:*** *most samovar usage require properly configurated file in build/config.json*\n\n## Usage\n### Cross-validation and re-profiling\n\nExample usage:\n```bash\n# Generate reads for benchmarking (skip for real data)\nsamovar generate \\\n    --genome_dir $SAMOVAR/data/test_genomes/meta \\\n    --host_genome $SAMOVAR/data/test_genomes/host/9606.fna \\\n    --output_dir samovar\n\n# Generate pipeline (for example, kraken2 + kaiju )\n## specify --input_dir for real data\nsamovar preprocess \\\n    --output_dir samovar \\\n    --kraken2-test \"kraken2 $DB_KRAKEN2\" \\\n    --kaiju-test \"kaiju $DB_KAIJU\"\n\n# Run the pipeline(s)\nsamovar exec --output-dir samovar\n```\n\nResults and flexibility of the tool can be improved with specification of config files. Please folow wiki, or see {samovar_function} -h\n\nManual example:\n```bash\ncd samovar\nbash workflow/pipeline.sh\n```\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px', 'fontFamily': 'arial', 'primaryColor': '#fff', 'primaryTextColor': '#000', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#fff', 'tertiaryColor': '#fff'}}}%%\ngraph TD\n    subgraph Input\n        subgraph Metagenomes\n            A1[FastQ files]\n            A2([InSilicoSeq config])\n        end\n        A3([SAMOVAR config])\n    end\n\n    subgraph Processing\n        Metagenomes --\u003e C[Initial annotation]\n        A3 --\u003e C\n        A3 --\u003e F\n        A3 --\u003e E[Metagenome generation]\n        C --\u003e E\n        E --\u003e F[Re-annotation]\n    end\n    \n\n    subgraph Results\n        F --\u003e G1[Annotators scores]\n        F --\u003e ML\n        subgraph Re-profiling\n            C --\u003e R\n            ML --\u003e R[Corrected results]\n        end\n        C --\u003e C1[Cross-validation]\n    end\n\n    style Input fill:#90ee9020,stroke:#333,stroke-width:2px\n    style Metagenomes fill:#b2ee9020,stroke:#333,stroke-width:2px\n    style Processing fill:#ee90bf20,stroke:#333,stroke-width:2px\n    style Results fill:#90d8ee20,stroke:#333,stroke-width:2px\n    style Re-profiling fill:#90a4ee20,stroke:#333,stroke-width:2px\n```\n\n### Artificial metagenome reneration\nBasic usage described in \u003ca href=\"./vignettes\"\u003e**vignettes**\u003c/a\u003e and \u003ca href=\"https://github.com/ctlab/samovar/wiki\"\u003e**wiki**\u003c/a\u003e\n\nYou can also try the generator with \u003ca href=\"https://dsmutin.shinyapps.io/samovaR/\"\u003e**web** shiny app\u003c/a\u003e\n\n\n#### R generation\n\n\u003ca href=\"https://github.com/ctlab/samovar/blob/main/samovaR.pdf\"\u003eSee description\u003c/a\u003e or \u003ca href=\"vignettes/samovar-basic.Rmd\"\u003esource\u003c/a\u003e a vignette\n\n``` r\nlibrary(samovaR)\n\n# download data\nteatree \u003c- GMrepo_type2data(number_to_process = 2000)\n\n# filter\ntealeaves \u003c- teatree %\u003e%\n  teatree_trim(treshhold_species = 3, treshhold_samples = 3, treshhold_amount = 10^(-3))\n\n# normalizing\nteabag \u003c- tealeaves %\u003e%\n  tealeaves_pack()\n\n# clustering\nconcotion \u003c- teabag %\u003e%\n  teabag_brew(min_cluster_size = 4, max_cluster_size = 6)\n\n# building samovar\nsamovar \u003c- concotion %\u003e%\n  concotion_pour()\n\n# generating new data\nnew_data \u003c- samovar %\u003e%\n  samovar_boil(n = 100)\n```\n\n\u003ca src=\"https://github.com/ctlab/samovar/blob/main/samovaR_man.pdf\"\u003eDocumentation\u003c/a\u003e for the **R package**\n\n#### Pipeline\n\n\u003cimg src=\"data/img/additional/algo.png\" width = 50%\u003e\n\n## Components\n\n- **R** package `samova.R` for the artificial abundance table generation\n- Pipeline for the automated benchmarking and re-profiling\n\n## Project Structure\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px', 'fontFamily': 'arial', 'primaryColor': '#fff', 'primaryTextColor': '#000', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#fff', 'tertiaryColor': '#fff'}}}%%\ngraph LR\n    A[SamovaR] --\u003e G1[Abundance table generation]\n    G1 --\u003e B[R Package]\n    A --\u003e G2[Automated re-profiling]\n    G2 --\u003e C[snakemake + Python Pipeline]\n    G1 --\u003e G[Shiny App]\n\n    B --\u003e B1[R/]\n    B --\u003e B2[man/]\n    B --\u003e B3[vignettes/]\n\n    C --\u003e C1[workflow/]\n    C --\u003e C2[src/]\n\n    G --\u003e H[shiny/]\n```\n\n\n## References\n- Chechenina А., Vaulin N., Ivanov A., Ulyantsev V. Development of in-silico models of metagenomic communities with given properties and a pipeline for their generation. Bioinformatics Institute 2022/23 URL: https://elibrary.ru/item.asp?id=60029330\n\n\n## Dependencies\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px', 'fontFamily': 'arial', 'primaryColor': '#fff', 'primaryTextColor': '#000', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#fff', 'tertiaryColor': '#fff'}}}%%\ngraph LR\n    subgraph \"R Package Dependencies\"\n        subgraph \"Main\"\n            direction LR\n            tidyverse\n            scclust\n            Matrix\n            methods\n        end\n        \n        subgraph \"Visualization\"\n            direction LR\n            ggplot\n            plotly\n            ggnewscale\n        end\n        \n        subgraph \"API\"\n            direction LR\n            httr\n            jsonlite\n            xml2\n        end\n    end\n    \n    subgraph \"Automated Benchmarking\"\n        subgraph \"Major\"\n            direction LR\n            samova.R\n            R::yaml\n            SnakeMake\n            InSilicoSeq\n        end\n        \n        subgraph \"Python packages\"\n            direction LR\n            numpy\n            pandas\n            requests\n            ete3\n            scikit-learn\n        end\n    end\n    \n    linkStyle default stroke:#000\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctlab%2Fsamovar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fctlab%2Fsamovar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctlab%2Fsamovar/lists"}