Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/OpenGene/AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
https://github.com/OpenGene/AfterQC

adapter-trimming bioinformatics error fastq filtering ngs overlap qc quality-control sequencing trimming

Last synced: about 2 months ago
JSON representation

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Awesome Lists containing this project

README

        

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square)](http://bioconda.github.io/recipes/afterqc/README.html)
# AfterQC
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
`AfterQC` can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair.
Currently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other [Illumina 1.8 or newer formats](http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm)

The author has reimplemented this tool in C++ with multithreading support to make it much faster. The new tool is called `fastp` and can be found at: https://github.com/OpenGene/fastp . If you prefer a C++ based tool, please use `fastp` instead.

# An Example of Report
The report of AfterQC is a single HTML page with figures contained in. See an example: [http://opengene.org/AfterQC/report.html](http://opengene.org/AfterQC/report.html)

# Features:
`AfterQC` does following tasks automatically:
* Filters reads with too low quality, too short length or too many N
* Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences
* Does per-base quality control and plots the figures
* Trims reads at front and tail, according to QC results
* For pair-end sequencing data, `AfterQC` automatically corrects low quality wrong bases in overlapped area of read1/read2
* Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues
* Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing), `AfterQC` shifts the barcodes from the reads to the fastq query names
* Support both single-end sequencing and pair-end sequencing data
* Automatic adapter cutting for pair-end sequencing data
* Sequencing error estimation, and error distribution profiling

# Get AfterQC
* with bioconda `conda install afterqc`
* latest: `git clone https://github.com/OpenGene/AfterQC.git` or download [https://github.com/OpenGene/AfterQC/archive/master.zip](https://github.com/OpenGene/AfterQC/archive/master.zip)
* stable: [Releases](https://github.com/OpenGene/AfterQC/releases)

# PyPy suggestion:
`AfterQC` is compitable with `PyPy`. Using `PyPy` to run `AfterQC` is strongly suggested since it can make `AfterQC` 3X faster than native Python (CPython).  To run with `pypy`, just replace `python` with `pypy` in the commands.

# Simple usage:
* Prepare your fastq files in a folder
* For single-end sequencing, the filenames in the folder should be `*R1*`, otherwise you should specify `--read1_flag`
* For pair-end sequencing, the filenames in the folder should be `*R1*` and `*R2*`, otherwise you should specify `--read1_flag` and `--read2_flag`
```shell
cd /path/to/fastq/folder
python path/to/AfterQC/after.py
```
* three folders will be automatically generated, a folder `good` stores the good reads, a folder `bad` stores the bad reads and a folder `QC` stores the report of quality control
* `AfterQC` will print some statistical information after it is done, such how many good reads, how many bad reads, and how many reads are corrected.
* if you want to run `AfterQC` only with a single file/pair:
```shell
# with a single file
python after.py -1 R1.fq

# with a single pair
python after.py -1 R1.fq -2 R2.fq
```

# Quality Control only
If you only want to get quality control statistics, run:
```shell
python after.py --qc_only
```

# Gzip output
* If the input FastQ files are gzipped, then the output will be also gzipped.  
* If the input FastQ files are not gzipped, you can enable `--gzip` or `-z` option to force gzip compression.
* Use `--compression` to change the compression level (0~9), default is 2. The better the compression, the lower the speed.

# Full options:
***Common options***
```shell
--version show program's version number and exit
-h, --help show this help message and exit
```
***File (name) options***
```

-1 READ1_FILE, --read1_file=READ1_FILE
file name of read1, required. If input_dir is
specified, then this arg is ignored.
-2 READ2_FILE, --read2_file=READ2_FILE
file name of read2, if paired. If input_dir is
specified, then this arg is ignored.
-7 INDEX1_FILE, --index1_file=INDEX1_FILE
file name of 7' index. If input_dir is specified, then
this arg is ignored.
-5 INDEX2_FILE, --index2_file=INDEX2_FILE
file name of 5' index. If input_dir is specified, then
this arg is ignored.
-d INPUT_DIR, --input_dir=INPUT_DIR
the input dir to process automatically. If read1_file
are input_dir are not specified, then current dir (.)
is specified to input_dir
-g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER
the folder to store good reads, by default it is the
same folder contains read1
-b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER
the folder to store bad reads, by default it is same
as good_output_folder
--read1_flag=READ1_FLAG
specify the name flag of read1, default is R1, which
means a file with name *R1* is read1 file
--read2_flag=READ2_FLAG
specify the name flag of read2, default is R2, which
means a file with name *R2* is read2 file
--index1_flag=INDEX1_FLAG
specify the name flag of index1, default is I1,
which means a file with name *I1* is index2 file
--index2_flag=INDEX2_FLAG
specify the name flag of index2, default is I2,
which means a file with name *I2* is index2 file
```
***Filter options***
```
-f TRIM_FRONT, --trim_front=TRIM_FRONT
number of bases to be trimmed in the head of read. -1
means auto detect
-t TRIM_TAIL, --trim_tail=TRIM_TAIL
number of bases to be trimmed in the tail of read. -1
means auto detect
--trim_pair_same=TRIM_PAIR_SAME
use same trimming configuration for read1 and read2 to
keep their sequence length identical, default is true
lots of dedup algorithms require this feature
-q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED
the quality value that a base is qualifyed. Default 20
means base quality >=Q20 is qualified.
-u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT
if exists more than unqualified_base_limit bases that
quality is lower than qualified quality, then this
read/pair is bad. Default 0 means do not filter reads
by low quality base count
-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT
if exists one polyX(polyG means GGGGGGGGG...), and its
length is >= poly_size_limit, then this read/pair is
bad. Default is 35
-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY
the count of allowed mismatches when evaluating
poly_X. Default 5 means disallow any mismatches
-n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT
if exists more than maxn bases have N, then this
read/pair is bad. Default is 5
-s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ
if the trimmed read is shorter than seq_len_req, then
this read/pair is bad. Default is 35
```
***Debubble options (not suggested for regular tasks)***  
If you want to eliminate bubble artifact, turn debubble option on (this is slow, usually you don't need to do this):
```
--debubble enable debubble algorithm to remove the
reads in the bubbles. Default is False
--debubble_dir=DEBUBBLE_DIR
specify the folder to store output of debubble
algorithm, default is debubble
--draw=DRAW specify whether draw the pictures or not, when use
debubble or QC. Default is on
```
***Barcoded sequencing options***
```
--barcode=BARCODE specify whether deal with barcode sequencing files, default is on
--barcode_length=BARCODE_LENGTH
specify the designed length of barcode
--barcode_flag=BARCODE_FLAG
specify the name flag of a barcoded file, default is
barcode, which means a file with name *barcode* is a
barcoded file
--barcode=BARCODE specify whether deal with barcode sequencing files,
default is on, which means all files with barcode_flag
in filename will be treated as barcode sequencing
files
```
***QC options***
```shell
--qc_only enable this option, only QC result will be output, this
can be much faster
--qc_sample=QC_SAMPLE
sample up to qc_sample when do QC, default is 1000,000
--qc_kmer=QC_KMER specify the kmer length for KMER statistics for QC,
default is 8
```

# Understand the report
* `AfterQC` will generate a QC folder, which contains lots of figures.
* For pair-end sequencing data, both read1 and read2 figures will be in the same folder with the folder name of read1's filename. `R1` means `read1`, `R2` means `read2`.
* For single-end sequencing data, it will still have `R1`.
* `prefilter` means `before filtering`, `postfilter` means `after filtering`
* For pair-end sequencing data, `After` will do an `overlap analysis`. read1 and read2 will be overlapped when `read1_length + read2_length > DNA_template_length`.

# Cite AfterQC
Shifu Chen, Tanxiao Huang, Yanqing Zhou, Yue Han, Mingyan Xu and Jia Gu. AfterQC: automatic filtering, trimming, error removing and quality control for fastq data. BMC Bioinformatics 2017 18(Suppl 3):80 https://doi.org/10.1186/s12859-017-1469-3