https://github.com/tobiasrausch/covid19
SARS-CoV-2 analysis pipeline for short-read, paired-end illumina sequencing
https://github.com/tobiasrausch/covid19
consensus covid19 covid19-analysis sars-cov-2 variant-calling whole-genome-sequencing
Last synced: 6 months ago
JSON representation
SARS-CoV-2 analysis pipeline for short-read, paired-end illumina sequencing
- Host: GitHub
- URL: https://github.com/tobiasrausch/covid19
- Owner: tobiasrausch
- License: bsd-3-clause
- Created: 2021-01-24T20:08:07.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-12-23T08:37:17.000Z (10 months ago)
- Last Synced: 2025-03-27T07:21:18.602Z (7 months ago)
- Topics: consensus, covid19, covid19-analysis, sars-cov-2, variant-calling, whole-genome-sequencing
- Language: Python
- Homepage:
- Size: 324 MB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SARS-CoV-2 data analysis
SARS-CoV-2 analysis pipeline for short-read, paired-end sequencing.
## Installation
A [Makefile](https://github.com/tobiasrausch/covid19/blob/main/Makefile) is part of the code that installs all dependencies using bioconda.
`git clone --recursive https://github.com/tobiasrausch/covid19.git`
`cd covid19`
`make all`
## Preparing the reference databases and indexes
There is a script to download and index the SARS-CoV-2 and GRCh38 reference sequence.
`cd ref/ && ./prepareREF.sh`
There is another script to prepare the kraken2 human database to filter host reads.
`cd kraken2/ && ./prepareDB.sh`
## Running the data analysis pipeline
There is a run script that performs adapter trimming, host read removal, alignment, variant calling and annotation, consensus calling and some quality control. The last parameter, called `unique_sample_id`, is used to create a unique output directory in the current working directory.
`./src/run.sh `
## Output
The main output files are:
* The adapter-trimmed and host-filtered FASTQ files: `ls .filtered.R_[12].fq.gz`
* The alignment to SARS-CoV-2: `ls .srt.bam`
* The consensus sequence: `ls .cons.fa`
* The annotated variants: `ls .variants.tsv`
* The assigned lineage: `ls .lineage.csv`
* The summary QC report: `ls .qc.summary`
## Aggregating results
The above pipeline generates a report for every sample. It can be naively parallelized on the sample level. You can then aggregate all the QC information and the lineage & clade assignments using
`./src/aggregate.sh outtable */*.qc.summary`
## Estimating cross-contamination
You can estimate cross-contamination based on the allelic frequencies of variant calls using
`./src/crosscontam.sh contam */*.bcf`
This works best on good quality consensus sequences, i.e.:
`./src/crosscontam.sh contam `grep "RKI pass" */*.qc.summary | sed 's/.qc.summary.*$/.bcf/' | tr '\n' ' '`
## Example
The repository contains an example script using a [COG-UK](https://www.cogconsortium.uk/) data set.
`cd example/ && ./expl.sh`
## Citation
Evolution of SARS-CoV-2 in the Rhine-Neckar/Heidelberg Region 01/2021 - 07/2023. Infect Genet Evol. 2024 Feb 23:119:105577. [DOI: 10.1016/j.meegid.2024.105577](https://doi.org/10.1016/j.meegid.2024.105577)
## Credits
Many thanks to the open-science of [COG-UK](https://www.cogconsortium.uk/), their data sets in [ENA](https://www.ebi.ac.uk/ena/browser/home) were very useful to develop the code. The workflow uses many tools distributed via [bioconda](https://bioconda.github.io/), please see the [Makefile](https://github.com/tobiasrausch/covid19/blob/main/Makefile) for all the dependencies and of course, thanks to all the developers.