https://github.com/aryarm/varca
Use an ensemble of variant callers to call variants from ATAC-seq data
https://github.com/aryarm/varca
atac-seq-data machine-learning random-forest snakemake variant-calling
Last synced: about 1 year ago
JSON representation
Use an ensemble of variant callers to call variants from ATAC-seq data
- Host: GitHub
- URL: https://github.com/aryarm/varca
- Owner: aryarm
- License: mit
- Created: 2019-07-19T00:29:55.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2025-03-12T17:09:52.000Z (about 1 year ago)
- Last Synced: 2025-03-24T21:12:31.967Z (about 1 year ago)
- Topics: atac-seq-data, machine-learning, random-forest, snakemake, variant-calling
- Language: Python
- Homepage:
- Size: 349 KB
- Stars: 23
- Watchers: 2
- Forks: 7
- Open Issues: 21
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
[](https://snakemake.readthedocs.io/)
[](LICENSE)
# varCA
A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.
The entire pipeline is made up of two smaller subworkflows. The `prepare` subworkflow calls each variant caller and prepares the resulting data for use by the `classify` subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.
> [!NOTE]
> VarCA does not output genotypes (GT fields) because of the possibility of inaccuracy in the presence of allele-specific open chromatin. Please refer to https://github.com/aryarm/varCA/issues/43#issuecomment-1088028758
### [Code Ocean](https://codeocean.com/capsule/6980349/tree/v1)
Using [our Code Ocean compute capsule](https://codeocean.com/capsule/6980349/tree/v1), you can execute [VarCA v0.2.1](https://github.com/aryarm/varCA/releases/tag/v0.2.1) on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the [`prepare` subworkflow](rules#output) and the [`classify` subworkflow](rules#output-1) in the [rules README](rules/README.md).
# download
Execute the following command or download the [latest release](https://github.com/aryarm/varCA/releases/latest) manually.
```
git clone https://github.com/aryarm/varCA.git
```
Also consider downloading the [example data](https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz).
```
cd varCA
wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -
```
# setup
The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:
```
conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
```
We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
# execution
1. Activate snakemake via `conda`:
```
conda activate snakemake
```
2. Execute the pipeline on the example data
Locally:
```
./run.bash &
```
__or__ on an SGE cluster:
```
./run.bash --sge-cluster &
```
#### Output
VarCA will place all of its output in a new directory (`out/`, by default). Log files describing the progress of the pipeline will also be created there: the `log` file contains a basic description of the progress of each step, while the `qlog` file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the [rules README](rules/README.md).
#### Executing the pipeline on your own data
You must modify [the config.yaml file](configs#configyaml) to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.
#### Executing each portion of the pipeline separately
The pipeline is made up of [two subworkflows](rules). These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the [rules README](rules/README.md) for execution instructions and a description of the outputs. You will need to execute the subworkflows separately [if you ever want to create your own trained models](rules#training-and-testing-varca).
#### Reproducing our results
We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
### If this is your first time using Snakemake
We recommend that you run `snakemake --help` to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the `-n -p -r` flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).
Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that *you* create and provide in place of the files it would have generated.
By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing the pipeline.
# files and directories
### [Snakefile](Snakefile)
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:
1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and
2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants
### [rules/](rules)
Snakemake rules for the `prepare` and `classify` subworkflows. You can either execute these subworkflows from the [master Snakefile](#snakefile) or individually as their own Snakefiles. See the [rules README](rules/README.md) for more information.
### [configs/](configs)
Config files that define options and input for the pipeline and the `prepare` and `classify` subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out [the config file for the pipeline](/configs#configyaml).
### [callers/](callers)
Scripts for executing each of the variant callers which are used by the `prepare` subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.
### [breakCA/](breakCA)
Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).
### [scripts/](scripts)
Various scripts used by the pipeline. See the [script README](scripts/README.md) for more information.
### [run.bash](run.bash)
An example bash script for executing the pipeline using `snakemake` and `conda`. Any arguments to this script are passed directly to `snakemake`.
# citation
There is an option to _"Cite this repository"_ on the right sidebar of [the repository homepage](https://github.com/aryarm/varCA).
> Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621