https://github.com/andersen-lab/cryptic-variants
https://github.com/andersen-lab/cryptic-variants
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/andersen-lab/cryptic-variants
- Owner: andersen-lab
- Created: 2023-05-24T21:39:57.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-06T16:30:42.000Z (over 2 years ago)
- Last Synced: 2025-02-01T06:25:12.258Z (over 1 year ago)
- Language: Nextflow
- Size: 49.8 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Cryptic Variant Detection
[](http://nextflow.io)
SARS-CoV-2 wastewater cryptic variant detection pipeline.
## Installation
---
### Install via Git
```
git clone https://github.com/dylanpilz/cryptic-variants.git
cd cryptic-variants
```
## Environment setup
---
#### Via conda:
```
conda env create -f environment.yml
conda activate cryptic-variants
```
#### Via mamba:
```
mamba create -n cryptic-variants
mamba env update -n cryptic-variants --file environment.yml
mamba activate cryptic-variants
```
## Usage
---
### GISAID Authentication
To access data from GISAID, you must have a GISAID account. In order to authenticate your credentials, run the following command:
```
python -c "from outbreak_data import authenticate_user; authenticate_user.authenticate_new_user()"
```
This will provide a link prompting you to enter your GISAID username and password. Once completed, your token will be saved and you will not need to authenticate again.
### Running the pipeline
```
nextflow run main.nf -entry [from_fastq|from_bam] --input_dir --output_dir
```
Set `-entry` to `from_fastq` if you are providing paired fastq files, or `from_bam` if you are providing aligned bam files. In the output directory, you will find a `cryptic_variants` directory containing potential cryptic variants, as well as a `covariants` directory containing the output from `freyja covariants` for the provided samples (see [freyja](https://github.com/andersen-lab/Freyja)).
### Optional parameters
```
--ref
Reference genome to use for alignment and covariant detection
(default: data/NC_045512.2_Hu-1.fasta)
--gff_file
GFF file containing gene annotations
(default: data/NC_045512.2_Hu-1.gff)
--primer_bed
BED file containing primer locations for primer trimming
(default: data/nCoV-2019_v3.primer.bed)
--skip_trimming
Whether or not to trim primer sequences from reads. If true, primer
trimming will be skipped.
(default: false)
--min_site
Minimum genomic site to consider for cryptic variant detection
(default: 22556) (RBD start)
--max_site
Maximum genomic site to consider for cryptic variant detection
(default: 23156) (RBD end)
--min_WW_count
Minimum number of wastewater hits to consider a cluster of
variants in a given sample
(default: 30)
--max_clinical_count
Maximum number of clinical hits for a variant to be considered
cryptic
(default: 5)
--location_id
Location ID to query from GISAID
(default: 'global')
```
### Output
The pipeline produces two output directories: `cryptic_variants` and `covariants`. The `cryptic_variants` directory will contain a `{sample}.covariants.cryptic.tsv` for each sample in the input directory. This file contains the following columns:
`Covariants`
Mutation cluster detected
`WW_Count`
Number of wastewater hits for the mutation cluster in this sample
`Clinical_Count`
Number of clinical hits for the mutation cluster in this sample
`Lineages`
Lineages associated with the mutation cluster (if any)
```
Covariants WW_Count Clinical_Count Lineages
['S:G416E', 'S:K417N', 'S:N440K', 'S:L452Q'] 12 0 NA
['S:K417N', 'S:S438P', 'S:N440K', 'S:L452Q'] 14 0 NA
['S:K417N', 'S:Y421H', 'S:N440K', 'S:L452Q'] 11 0 NA
['S:G416G', 'S:K417N', 'S:N440K', 'S:L452Q'] 11 2 ['ba.2.12.1']
['S:K417N', 'S:N440K', 'S:G446G', 'S:L452Q'] 14 2 ['bg.5']
['S:K417N', 'S:N440K', 'S:L452Q', 'S:F456F'] 15 2 ['ba.2.12.1']
```
The `covariants` directory contains the raw output from [freyja covariants](https://github.com/andersen-lab/Freyja).