Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/OpenGene/fastv

An ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.
https://github.com/OpenGene/fastv

2019-ncov bioinformatics coronavirus covid covid-19 hcov meta-genomics microbial-sequences mngs ngs sars-cov-2 sequencing viral viral-infectious-diseases virus visualization

Last synced: 29 days ago
JSON representation

An ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.

Host: GitHub
URL: https://github.com/OpenGene/fastv
Owner: OpenGene
License: mit
Created: 2020-03-26T03:33:51.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-10-27T06:16:38.000Z (9 months ago)
Last Synced: 2023-10-27T07:27:49.779Z (9 months ago)
Topics: 2019-ncov, bioinformatics, coronavirus, covid, covid-19, hcov, meta-genomics, microbial-sequences, mngs, ngs, sars-cov-2, sequencing, viral, viral-infectious-diseases, virus, visualization
Language: C++
Homepage:
Size: 1.02 MB
Stars: 105
Watchers: 11
Forks: 24
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

top-life-sciences - **OpenGene/fastv** - fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. This tool can be used to detect viral infectious diseases, like COVID-19.<br>`2019-ncov`, `bioinformatics`, `coronavirus`, `covid`, `covid-19`, `hcov`, `meta-genomics`, `microbial-sequences`, `mngs`, `ngs`, `sars-cov-2`, `sequencing`, `viral`, `viral-infectious-diseases`, `virus`, `visualization`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 109 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 24 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> C++ <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT License <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2023-10-27 06:16:38 | (Ranked by starred repositories)

README

        # fastv

fastv is an ultra-fast tool for identification of SARS-CoV-2 and other microbes from sequencing data. It detects microbial sequences from FASTQ data, generates JSON reports and visualizes the result in HTML reports. This tool can be used to detect viral infectious diseases, like COVID-19. This tool supports both short reads (Illumina, BGI, etc.) and long reads (ONT, PacBio, etc.)

* [Examples](#take-a-quick-glance-of-the-informative-report)

* [How it works?](#quick-example)

* [Understand the input](#understand-the-input)

* [Understand the output](#understand-the-output)

* [Download or build fastv](#get-fastv)

* [Screenshot](#screenshot)

* [Options](#options)

* [Citation](#citation)

* [Tutorials](#tutorials)

* [---- mNGS data analysis](#analyze-metagenomics-sequencing-mngs-data)

* [---- SARS-CoV-2 identification](#identify-sars-cov-2)

* [---- influenza A virus subtyping](#influenza-a-virus-subtyping)

# take a quick glance of the informative report

* Sample HTML report (Illumina): http://opengene.org/fastv/fastv.html

* Sample HTML report (ONT): http://opengene.org/fastv/ont.html

* Sample JSON report: http://opengene.org/fastv/fastv.json

# quick example for SARS-CoV-2 identification

* download FASTQ file for testing: http://opengene.org/fastv/testdata.fq.gz

* get fastv and use following command for testing: 

```shell

# make sure that SARS-CoV-2.kmer.fa and SARS-CoV-2.genomes.fa are in the ./data folder

./fastv -i testdata.fq.gz

```

# how it works?

`fastv` accepts FASTQ files as input, and then:

1. performs data QC and quality filtering as `fastp` does (cut adapters, remove low quality reads, correct wrong bases).

2. scans the clean data to collect the sequences that containing any unique k-mer, or can be mapped to any reference microbial genomes.

3. makes statistics, visualizes the result in HTML format, and output the results in JSON format.

4. outputs the on-target sequencing reads so that they can be analyzed by downstream tools.

# understand the input

`fastv` accepts following files as input:

1. `FASTQ` file (required) to be scanned, can be single-end (`-i`) or paired-end (`-i` and `-I`), can be short reads (Illumina, MGI, etc.) or long reads (PacBio, ONT, etc.)

2. `genomes` file (optional): a FASTA file containing one or many reference genomes of the target microorganism (`-g`).

3. `k-mer` file (optional): a FASTA file containing the UNIQUE k-mer of the target microbial genomes (`-k`).

4. `k-mer collection` file (optional): a FASTA containing the unique k-mers of many microorganisms (`-c`). See an example: http://opengene.org/kmer_collection.fasta

If none of (`k-mer`, `k-mer collection`, `genomes`) files is specified, fastv will try to load the SARS-CoV-2 Genomes/k-mer files in the `data` folder to detect SARS-CoV-2 sequences.

Besides the HTML/JSON reports, fastv also can output the sequence reads that contains any unique k-mer or can be mapped to any of the target reference genomes. The output data:

* is in FASTQ format

* is clean data after quality filtering

* the file names can be specified by `-o` for SE data, or `-o` and `-O` for PE data.

## get the pre-built k-mer file, genomes file or k-mer collection file for viruses

* You can download `k-mer` files and `genomes` files of viruses from http://opengene.org/uniquekmer/viral/index.html. This is generated by extracting unique k-mers for all genomes in a big FASTA (http://opengene.org/viral.genomic.fasta), which contains all NCBI complete RefSeq release of viral sequences that can be found from https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. The k-mers that can be mapped to human reference genome (GRCh38) with `edit_distance <= 3` have already been filtered out.

* You can download the `k-mer collection` file for viral genomes from: http://opengene.org/viral.kc.fasta.gz

## get the pre-built k-mer file, genomes file or k-mer collection file for viruses and human microorganisms

* You can download `k-mer` files and `genomes` files of viruses from http://opengene.org/uniquekmer/microbial/index.html. This is generated by extracting unique k-mers for all genomes in a big FASTA (http://opengene.org/microbial.genomic.fasta), which contains genomes for the viruses above and common human microorganisms. The k-mers that can be mapped to human reference genome (GRCh38) with `edit_distance <= 3` have already been filtered out.

* You can download the `k-mer collection` file for viral and microbial genomes from: http://opengene.org/microbial.kc.fasta.gz

* you can get the `k-mer` file and `genomes` file for SARS-CoV-2 by `git clone https://github.com/OpenGene/fastv.git`. If you don't use git, you can also download these two files from http://opengene.org/fastv/SARS-CoV-2.kmer.fa and http://opengene.org/fastv/SARS-CoV-2.genomes.fa

If you want to generate your own unique k-mer files and k-mer collection files, please use UniqueKMER: https://github.com/OpenGene/UniqueKMER

# understand the output

fastv outputs reports in HTML and JSON formats.

* Sample HTML report (Illumina): http://opengene.org/fastv/fastv.html

* Sample JSON report: http://opengene.org/fastv/fastv.json

* If the `k-mer` file is specified, there will be a `POSITIVE` or `NEGATIVE` result, which is determined by comparing the mean depth of the k-mer keys to the threshold (`--positive_threshold`).

Besides the HTML/JSON reports, fastv also can output the sequence reads that contains any unique k-mer or can be mapped to any of the target reference genomes. The output data:

 * is in FASTQ format

 * is clean data after quality filtering

 * the file names can be specified by `-o` for SE data, or `-o` and `-O` for PE data.

# get fastv

## download binary 

This binary is only for Linux systems: http://opengene.org/fastv/fastv

```shell

# this binary was compiled on CentOS, and tested on CentOS/Ubuntu

wget http://opengene.org/fastv/fastv

chmod a+x ./fastv

```

## or compile from source

```shell

# step 1: get the code

git clone https://github.com/OpenGene/fastv.git

# step 2: build

cd fastv

make

# step 3: install it to system if you have a sudo permission

make install

```

## screenshot

![image](http://www.opengene.org/fastv/fastv-0801.jpg)   

# options 
Key options: 
``` 
  -i, --in1 
  -I, --in2 
  -o, --out1 
  -O, --out2 
  -c, --kmer_collection 
  -k, --kmer 
  -g, --genomes 
  -p, --positive_threshold 
  -d, --depth_threshold 
  -E, --ed_threshold 
      --long_read_threshold 
      --read_segment_len 
      --bin_size 
      --kc_coverage_threshold 
      --kc_high_confi 
      --kc_high_confi 
  -j, --json 
  -h, --html 
  -R, --report_title 
  -w, --thread 
``` 
Other I/O options: 
``` 
  -6, --phred64 
  -z, --compression 
      --stdin 
      --stdout 
      --interleaved_in 
      --reads_to_process 
      --dont_overwrite 
  -V, --verbose 
``` 
QC and quality 
``` 
  -A, --disable_adapter_trimming 
  -a, --adapter_sequence 
      --adapter_sequence_r2 
      --adapter_fasta 
      --detect_adapter_for_pe 
  -f, --trim_front1 
  -t, --trim_tail1 
  -b, --max_len1 
  -F, --trim_front2 
  -T, --trim_tail2 
  -B, --max_len2 
      --poly_g_min_len 
  -G, --disable_trim_poly_g 
  -x, --trim_poly_x 
      --poly_x_min_len 
  -5, --cut_front 
  -3, --cut_tail 
  -r, --cut_right 
  -W, --cut_window_size 
  -M, --cut_mean_quality 
      --cut_front_window_size 
      --cut_front_mean_quality 
      --cut_tail_window_size 
      --cut_tail_mean_quality 
      --cut_right_window_size 
      --cut_right_mean_quality 
  -Q, --disable_quality_filtering 
  -q, --qualified_quality_phred 
  -u, --unqualified_percent_limit 
  -n, --n_base_limit 
  -e, --average_qual

read1 input file name (string [=]) read2 input file name (string [=]) file name to store read1 with on-target sequences (string [=]) file name to store read2 with on-target sequences (string [=]) the unique k-mer collection file in fasta format, see an example: http://opengene.org/kmer_collection.fasta (string [=]) the unique k-mer file of the detection target in fasta format. data/SARS-CoV-2.kmer.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=]) the genomes file of the detection target in fasta format. data/SARS-CoV-2.genomes.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=]) the data is considered as POSITIVE, when its mean coverage of unique kmer >= positive_threshold (0.001 ~ 100). 0.1 by default. (float [=0.1]) For coverage calculation. A region is considered covered when its mean depth >= depth_threshold (0.001 ~ 1000). 1.0 by default. (float [=1]) If the edit distance of a sequence and a genome region is <=ed_threshold, then consider it a match (0 ~ 50). 8 by default. (int [=8]) A read will be considered as long read if its length >= long_read_threshold (100 ~ 10000). 200 by default. (int [=200]) A long read will be splitted to read segments, with each <= read_segment_len (50 ~ 5000, should be < long_read_threshold). 100 by default. (int [=100]) For coverage calculation. The genome is splitted to many bins, with each bin has a length of bin_size (1 ~ 100000), default 0 means adaptive. (int [=0]) For each genome in the k-mer collection FASTA, report it when its coverage > kc_coverage_threshold. Default is 0.01. (double [=0.01]) dence_coverage_threshold      For each genome in the k-mer collection FASTA, report it as high confidence when its coverage > kc_high_confidence_coverage_threshold. Default is 0.9. (double [=0.9]) dence_median_hit_threshold    For each genome in the k-mer collection FASTA, report it as high confidence when its median hits > kc_high_confidence_median_hit_threshold. Default is 5. (int [=5]) the json format report file name (string [=fastv.json]) the html format report file name (string [=fastv.html]) should be quoted with ' or ", default is "fastv report" (string [=fastv report]) worker thread number, default is 4 (int [=4]) indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33) compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4]) input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in. stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default. indicate that  is an interleaved FASTQ which contains both read1 and read2. Disabled by default. specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0]) don't overwrite existing files. Overwritting is allowed by default. output verbose log information (i.e. when every 1M reads are processed). pruning options (inherited from fastp: https://github.com/OpenGene/fastp) adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto]) the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as  (string [=auto]) specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=]) by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data. trimming how many bases in front for read1, default is 0 (int [=0]) trimming how many bases in tail for read1, default is 0 (int [=0]) if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0]) trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0]) trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0]) if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0]) the minimum length to detect polyG in the read tail. 10 by default. (int [=10]) disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data enable polyX trimming in 3' ends. the minimum length to detect polyX in the read tail. 10 by default. (int [=10]) move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop. the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4]) the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20]) the window size option of cut_front, default to cut_window_size if not specified (int [=4]) the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20]) the window size option of cut_tail, default to cut_window_size if not specified (int [=4]) the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20]) the window size option of cut_right, default to cut_window_size if not specified (int [=4]) the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20]) quality filtering is enabled by default. If this option is specified, quality filtering is disabled the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15]) how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40]) if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5]) if one read's average quality score