{"id":23903319,"url":"https://github.com/aryarm/varca","last_synced_at":"2025-04-11T00:23:35.839Z","repository":{"id":37906297,"uuid":"197674437","full_name":"aryarm/varCA","owner":"aryarm","description":"Use an ensemble of variant callers to call variants from ATAC-seq data","archived":false,"fork":false,"pushed_at":"2025-03-12T17:09:52.000Z","size":357,"stargazers_count":23,"open_issues_count":21,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T21:12:31.967Z","etag":null,"topics":["atac-seq-data","machine-learning","random-forest","snakemake","variant-calling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aryarm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null}},"created_at":"2019-07-19T00:29:55.000Z","updated_at":"2025-03-12T17:09:56.000Z","dependencies_parsed_at":"2022-07-09T17:17:58.761Z","dependency_job_id":null,"html_url":"https://github.com/aryarm/varCA","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryarm%2FvarCA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryarm%2FvarCA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryarm%2FvarCA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryarm%2FvarCA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aryarm","download_url":"https://codeload.github.com/aryarm/varCA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248319169,"owners_count":21083782,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atac-seq-data","machine-learning","random-forest","snakemake","variant-calling"],"created_at":"2025-01-04T22:53:55.099Z","updated_at":"2025-04-11T00:23:35.825Z","avatar_url":"https://github.com/aryarm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)\n[![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)\n\n# varCA\nA pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.\n\nThe entire pipeline is made up of two smaller subworkflows. The `prepare` subworkflow calls each variant caller and prepares the resulting data for use by the `classify` subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.\n\n\u003e [!NOTE]  \n\u003e VarCA does not output genotypes (GT fields) because of the possibility of inaccuracy in the presence of allele-specific open chromatin. Please refer to https://github.com/aryarm/varCA/issues/43#issuecomment-1088028758\n\n### [Code Ocean](https://codeocean.com/capsule/6980349/tree/v1)\nUsing [our Code Ocean compute capsule](https://codeocean.com/capsule/6980349/tree/v1), you can execute [VarCA v0.2.1](https://github.com/aryarm/varCA/releases/tag/v0.2.1) on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the [`prepare` subworkflow](rules#output) and the [`classify` subworkflow](rules#output-1) in the [rules README](rules/README.md).\n\n# download\nExecute the following command or download the [latest release](https://github.com/aryarm/varCA/releases/latest) manually.\n```\ngit clone https://github.com/aryarm/varCA.git\n```\nAlso consider downloading the [example data](https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz).\n```\ncd varCA\nwget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -\n```\n\n# setup\nThe pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:\n```\nconda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'\n```\nWe highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).\n\n# execution\n1. Activate snakemake via `conda`:\n    ```\n    conda activate snakemake\n    ```\n2. Execute the pipeline on the example data\n\n    Locally:\n    ```\n    ./run.bash \u0026\n    ```\n    __or__ on an SGE cluster:\n    ```\n    ./run.bash --sge-cluster \u0026\n    ```\n#### Output\nVarCA will place all of its output in a new directory (`out/`, by default). Log files describing the progress of the pipeline will also be created there: the `log` file contains a basic description of the progress of each step, while the `qlog` file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the [rules README](rules/README.md).\n\n#### Executing the pipeline on your own data\nYou must modify [the config.yaml file](configs#configyaml) to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.\n\n#### Executing each portion of the pipeline separately\nThe pipeline is made up of [two subworkflows](rules). These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the [rules README](rules/README.md) for execution instructions and a description of the outputs. You will need to execute the subworkflows separately [if you ever want to create your own trained models](rules#training-and-testing-varca).\n\n#### Reproducing our results\nWe provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.\n\n### If this is your first time using Snakemake\nWe recommend that you run `snakemake --help` to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the `-n -p -r` flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).\n\nNote that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that *you* create and provide in place of the files it would have generated.\n\nBy default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing the pipeline.\n\n# files and directories\n\n### [Snakefile](Snakefile)\nA [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:\n\n1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and\n2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants\n\n### [rules/](rules)\nSnakemake rules for the `prepare` and `classify` subworkflows. You can either execute these subworkflows from the [master Snakefile](#snakefile) or individually as their own Snakefiles. See the [rules README](rules/README.md) for more information.\n\n### [configs/](configs)\nConfig files that define options and input for the pipeline and the `prepare` and `classify` subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out [the config file for the pipeline](/configs#configyaml).\n\n### [callers/](callers)\nScripts for executing each of the variant callers which are used by the `prepare` subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.\n\n### [breakCA/](breakCA)\nScripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).\n\n### [scripts/](scripts)\nVarious scripts used by the pipeline. See the [script README](scripts/README.md) for more information.\n\n### [run.bash](run.bash)\nAn example bash script for executing the pipeline using `snakemake` and `conda`. Any arguments to this script are passed directly to `snakemake`.\n\n# citation\nThere is an option to _\"Cite this repository\"_ on the right sidebar of [the repository homepage](https://github.com/aryarm/varCA).\n\u003e Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., \u0026 McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faryarm%2Fvarca","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faryarm%2Fvarca","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faryarm%2Fvarca/lists"}