https://github.com/metagentools/VStrains
VStrains is a de novo approach for reconstructing strains from viral quasispecies.
https://github.com/metagentools/VStrains
short-reads viral-metagenomics
Last synced: about 2 months ago
JSON representation
VStrains is a de novo approach for reconstructing strains from viral quasispecies.
- Host: GitHub
- URL: https://github.com/metagentools/VStrains
- Owner: metagentools
- License: mit
- Created: 2022-10-15T02:17:41.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-11-24T01:10:55.000Z (6 months ago)
- Last Synced: 2024-11-24T01:29:17.206Z (6 months ago)
- Topics: short-reads, viral-metagenomics
- Language: Python
- Homepage:
- Size: 20.4 MB
- Stars: 23
- Watchers: 2
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-virome - VStrains - Viral strain reconstruction from metagenomic data. [Python] (Other Tools / Viral Strain Reconstruction)
README
![]()
# VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs

[](https://github.com/psf/black)Manual
===========Table of Contents
-----------------1. [About VStrains](#sec1)
2. [Updates](#sec2)
3. [Installation](#sec3)
3.1. [Option 1. Quick Install](#sec3.1)
3.2. [Option 2. Manual Install](#sec3.2)
3.3. [Download & Install VStrains](#sec3.3)
4. [Running VStrains](#sec4)
4.1. [Quick Usage](#sec4.1)
4.2. [Support SPAdes](#sec4.2)
4.3. [Output](#sec4.3)
5. [Stand-alone binaries](#sec5)
6. [Experiment](#sec6)
7. [Citation](#sec7)
8. [Feedback and bug reports](#sec8)VStrains is a de novo approach for reconstructing strains from viral quasispecies.
## VStrains 1.1.0 Release (03 Feb 2023)
* Replace the PE link inference module `VStrains_Alignment.py` with `VStrains_PE_Inference.py`
`VStrains_PE_Inference.py` implements a hash table approach that produce efficient perfect match lookup, the new module leads to consistent evaluation results and substantially decrease the runtime and memory usage against previous alignment approach.VStrains requires a 64-bit Linux system or Mac OS and python (supported versions are python3: 3.2 and higher).
## Option 1. Quick Install (**recommended**)Install [(mini)conda](https://conda.io/miniconda.html) as a light-weighted package management tool. Run the following commands to initialize and setup the conda environment for VStrains
```bash
# add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge# create conda environment
conda create --name VStrains-env# activate conda environment
conda activate VStrains-envconda install -c bioconda -c conda-forge python=3 graph-tool minimap2 numpy gfapy matplotlib
```Manually install dependencies:
- [minimap2](https://github.com/lh3/minimap2)And python modules:
- [graph-tool](https://graph-tool.skewed.de)
- [numpy](https://numpy.org)
- [gfapy](https://github.com/ggonnella/gfapy)
- [matplotlib](https://matplotlib.org)
## Download & Install VStrainsAfter successfully setup the environment and dependencies, clone the VStrains into your desirable place.
```bash
git clone https://github.com/metagentools/VStrains.git
```Install the VStrains via `Pip`
```bash
cd VStrains; pip install .
```Run the following commands to ensure VStrains is correctly setup & installed.
```bash
vstrains -h
```VStrains supports assembly results from [SPAdes](https://github.com/ablab/spades) (includes metaSPAdes and metaviralSPAdes) and may supports other graph-based assemblers in the future.
```
usage: VStrains [-h] -a {spades} -g GFA_FILE [-p PATH_FILE] [-o OUTPUT_DIR] -fwd FWD -rve RVEConstruct full-length viral strains under de novo approach from contigs and assembly graph, currently supports
SPAdesoptional arguments:
-h, --help show this help message and exit
-a {spades}, --assembler {spades}
name of the assembler used. [spades]
-g GFA_FILE, --graph GFA_FILE
path to the assembly graph, (.gfa format)
-p PATH_FILE, --path PATH_FILE
contig file from SPAdes (.paths format), only required for SPAdes. e.g., contigs.paths
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
path to the output directory [default: acc/]
-fwd FWD, --fwd_file FWD
paired-end sequencing reads, forward strand (.fastq format)
-rve RVE, --rve_file RVE
paired-end sequencing reads, reverse strand (.fastq format)
```VStrains takes as input an assembly graph in Graphical Fragment Assembly (GFA) Format and associated contig information, together with the raw reads in paired-end format (e.g., forward.fastq, reverse.fastq).
When running SPAdes, we recommend to use `--careful` option for more accurate assembly results. Do not modify any contig/node name from the SPAdes assembly results for consistency. Please refer to [SPAdes](https://github.com/ablab/spades) for further guideline. Example usage as below:
```bash
# SPAdes assembler example, pair-end reads
python spades.py -1 forward.fastq -2 reverse.fastq --careful -t 16 -o output_dir
```Both assembly graph (`assembly_graph_after_simplification.gfa`) and contig information (`contigs.paths`) can be found in the output directory after running SPAdes assembler. Please use them together with raw reads as inputs for VStrains, and set `-a` flag to `spades`. Example usage as below:
```bash
vstrains -a spades -g assembly_graph_after_simplification.gfa -p contigs.paths -o output_dir -fwd forward.fastq -rve reverse.fastq
```VStrains stores all output files in ``, which is set by the user.
* `/aln/` directory contains paired-end (PE) linkage information, which is stored in `pe_info` and `st_info`.
* `/gfa/` directory contains iteratively simplified assembly graphs, where `graph_L0.gfa` contains the assembly graph produced by SPAdes after Strandedness Canonization, `split_graph_final.gfa` contains the assembly graph after Graph Disentanglement, and `graph_S_final.gfa` contains the assembly graph after Contig-based Path Extraction, the rests are intermediate results. All the assembly graphs are in [GFA 1.0 format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md).
* `/paf/` and `/tmp/` are temporary directories, feel free to ignore them.
* `/strain.fasta` contains resulting strains in `.fasta`, the headers for each strain has the form `NODE___` which is compatiable to SPAdes contigs format.
* `/strain.paths` contains paths in the assembly graph (input `GFA_FILE`) corresponding to `strain.fasta` using [Bandage](https://github.com/rrwick/Bandage) for further downstream analysis.
* `/vstrains.log` contains the VStrains log.`evals/quast_evaluation.py` is a wrapper script for strain-level experimental result analysis using [MetaQUAST](https://github.com/ablab/quast).
```
usage: quast_evaluation.py [-h] -quast QUAST [-cs FILES [FILES ...]] [-d IDIR] -ref REF_FILE -o OUTPUT_DIRUse MetaQUAST to evaluate assembly result
options:
-h, --help show this help message and exit
-quast QUAST, --path_to_quast QUAST
path to MetaQuast python script, version >= 5.2.0
-cs FILES [FILES ...], --contig_files FILES [FILES ...]
contig files from different tools, separated by space
-d IDIR, --contig_dir IDIR
contig files from different tools, stored in the directory, .fasta format
-ref REF_FILE, --ref_file REF_FILE
ref file (single)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
output directory
```VStrains is evaluated on both simulated and real datasets under default settings, and the source of the datasets can be found in the links listed below:
1. Simulated Dataset, can be found at [savage-benchmark](https://bitbucket.org/jbaaijens/savage-benchmarks/src/master/) (No preprocessing is required)
- 6 Poliovirus (20,000x)
- 10 HCV (20,000x)
- 15 ZIKV (20,000x)
2. Real Dataset (please refer to [Supplementary Material](https://www.biorxiv.org/content/10.1101/2022.10.21.513181v3.supplementary-material) for preprocessing the real datasets)
- 5 HIV labmix (20,000x) [SRR961514](https://www.ncbi.nlm.nih.gov/sra/?term=SRR961514), reference genome sequences are available at [5 HIV References](https://github.com/cbg-ethz/5-virus-mix/blob/master/data/REF.fasta)
- 2 SARS-COV-2 (4,000x) [SRR18009684](https://www.ncbi.nlm.nih.gov/sra/?term=SRR18009684), [SRR18009686](https://www.ncbi.nlm.nih.gov/sra/?term=SRR18009686), pre-processed reads and individually assemble ground-truth reference sequences can be found at [2 SARS-COV-2 Dataset](https://github.com/RunpengLuo/sarscov2-4000x)
# Citation
VStrains has been accepted at [RECOMB 2023](http://recomb2023.bilkent.edu.tr/program.html) and manuscript is publicly available at [here](https://link.springer.com/chapter/10.1007/978-3-031-29119-7_1).If you use VStrains in your work, please cite the following publications.
Runpeng Luo and Yu Lin, VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs
Thanks for using VStrains. If any bugs be experienced during execution, please re-run the program with additional `-d` flag and provide the `vstains.log` together with user cases via `Issues`