{"id":13703737,"url":"https://github.com/OpenGene/AfterQC","last_synced_at":"2025-05-05T07:31:37.903Z","repository":{"id":35901747,"uuid":"40188505","full_name":"OpenGene/AfterQC","owner":"OpenGene","description":"Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data","archived":false,"fork":false,"pushed_at":"2020-05-14T07:15:54.000Z","size":1369,"stargazers_count":203,"open_issues_count":26,"forks_count":50,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-08-03T21:04:06.197Z","etag":null,"topics":["adapter-trimming","bioinformatics","error","fastq","filtering","ngs","overlap","qc","quality-control","sequencing","trimming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-04T14:08:33.000Z","updated_at":"2024-06-20T15:43:51.000Z","dependencies_parsed_at":"2022-08-22T18:10:14.257Z","dependency_job_id":null,"html_url":"https://github.com/OpenGene/AfterQC","commit_stats":null,"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FAfterQC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FAfterQC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FAfterQC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FAfterQC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGene","download_url":"https://codeload.github.com/OpenGene/AfterQC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224431268,"owners_count":17310078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapter-trimming","bioinformatics","error","fastq","filtering","ngs","overlap","qc","quality-control","sequencing","trimming"],"created_at":"2024-08-02T21:00:59.482Z","updated_at":"2024-11-13T10:31:05.600Z","avatar_url":"https://github.com/OpenGene.png","language":"Python","readme":"[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square)](http://bioconda.github.io/recipes/afterqc/README.html)\n# AfterQC\nAutomatic Filtering, Trimming, Error Removing and Quality Control for fastq data   \n`AfterQC` can simply go through all fastq files in a folder and then output three folders: \u003cb\u003egood\u003c/b\u003e, \u003cb\u003ebad\u003c/b\u003e and \u003cb\u003eQC\u003c/b\u003e folders, which contains good reads, bad reads and the QC results of each fastq file/pair.   \nCurrently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other [Illumina 1.8 or newer formats](http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm)   \n\nThe author has reimplemented this tool in C++ with multithreading support to make it much faster. The new tool is called `fastp` and can be found at: https://github.com/OpenGene/fastp . If you prefer a C++ based tool, please use `fastp` instead.   \n\n# An Example of Report\nThe report of AfterQC is a single HTML page with figures contained in. See an example: [http://opengene.org/AfterQC/report.html](http://opengene.org/AfterQC/report.html)\n\n# Features:\n`AfterQC` does following tasks automatically:  \n* Filters reads with too low quality, too short length or too many N\n* Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences\n* Does per-base quality control and plots the figures\n* Trims reads at front and tail, according to QC results\n* For pair-end sequencing data, `AfterQC` automatically corrects low quality wrong bases in overlapped area of read1/read2\n* Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues\n* Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing), `AfterQC` shifts the barcodes from the reads to the fastq query names\n* Support both single-end sequencing and pair-end sequencing data\n* Automatic adapter cutting for pair-end sequencing data\n* Sequencing error estimation, and error distribution profiling\n\n# Get AfterQC\n* with bioconda `conda install afterqc`\n* latest: `git clone https://github.com/OpenGene/AfterQC.git` or download [https://github.com/OpenGene/AfterQC/archive/master.zip](https://github.com/OpenGene/AfterQC/archive/master.zip)\n* stable: [Releases](https://github.com/OpenGene/AfterQC/releases)\n\n# PyPy suggestion:\n`AfterQC` is compitable with `PyPy`. Using `PyPy` to run `AfterQC` is strongly suggested since it can make `AfterQC` 3X faster than native Python (CPython).  To run with `pypy`, just replace `python` with `pypy` in the commands.\n\n# Simple usage:\n* Prepare your fastq files in a folder\n* For single-end sequencing, the filenames in the folder should be `*R1*`, otherwise you should specify `--read1_flag`\n* For pair-end sequencing, the filenames in the folder should be `*R1*` and `*R2*`, otherwise you should specify `--read1_flag` and `--read2_flag`\n```shell\ncd /path/to/fastq/folder\npython path/to/AfterQC/after.py\n```\n* three folders will be automatically generated, a folder `good` stores the good reads, a folder `bad` stores the bad reads and a folder `QC` stores the report of quality control\n* `AfterQC` will print some statistical information after it is done, such how many good reads, how many bad reads, and how many reads are corrected.\n* if you want to run `AfterQC` only with a single file/pair:\n```shell\n# with a single file\npython after.py -1 R1.fq\n\n# with a single pair\npython after.py -1 R1.fq -2 R2.fq\n```\n\n# Quality Control only\nIf you only want to get quality control statistics, run:  \n```shell\npython after.py --qc_only\n```\n\n# Gzip output\n* If the input FastQ files are gzipped, then the output will be also gzipped.   \n* If the input FastQ files are not gzipped, you can enable `--gzip` or `-z` option to force gzip compression.\n* Use `--compression` to change the compression level (0~9), default is 2. The better the compression, the lower the speed.\n\n# Full options:\n***Common options***\n```shell\n  --version             show program's version number and exit\n  -h, --help            show this help message and exit\n```\n***File (name) options***\n```\n\n  -1 READ1_FILE, --read1_file=READ1_FILE\n                        file name of read1, required. If input_dir is\n                        specified, then this arg is ignored.\n  -2 READ2_FILE, --read2_file=READ2_FILE\n                        file name of read2, if paired. If input_dir is\n                        specified, then this arg is ignored.\n  -7 INDEX1_FILE, --index1_file=INDEX1_FILE\n                        file name of 7' index. If input_dir is specified, then\n                        this arg is ignored.\n  -5 INDEX2_FILE, --index2_file=INDEX2_FILE\n                        file name of 5' index. If input_dir is specified, then\n                        this arg is ignored.\n  -d INPUT_DIR, --input_dir=INPUT_DIR\n                        the input dir to process automatically. If read1_file\n                        are input_dir are not specified, then current dir (.)\n                        is specified to input_dir\n  -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER\n                        the folder to store good reads, by default it is the\n                        same folder contains read1\n  -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER\n                        the folder to store bad reads, by default it is same\n                        as good_output_folder\n  --read1_flag=READ1_FLAG\n                        specify the name flag of read1, default is R1, which\n                        means a file with name *R1* is read1 file\n  --read2_flag=READ2_FLAG\n                        specify the name flag of read2, default is R2, which\n                        means a file with name *R2* is read2 file\n  --index1_flag=INDEX1_FLAG\n                        specify the name flag of index1, default is I1,\n                        which means a file with name *I1* is index2 file\n  --index2_flag=INDEX2_FLAG\n                        specify the name flag of index2, default is I2,\n                        which means a file with name *I2* is index2 file\n```\n***Filter options***\n```\n  -f TRIM_FRONT, --trim_front=TRIM_FRONT\n                        number of bases to be trimmed in the head of read. -1\n                        means auto detect\n  -t TRIM_TAIL, --trim_tail=TRIM_TAIL\n                        number of bases to be trimmed in the tail of read. -1\n                        means auto detect\n  --trim_pair_same=TRIM_PAIR_SAME\n                        use same trimming configuration for read1 and read2 to\n                        keep their sequence length identical, default is true\n                        lots of dedup algorithms require this feature\n  -q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED\n                        the quality value that a base is qualifyed. Default 20\n                        means base quality \u003e=Q20 is qualified.\n  -u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT\n                        if exists more than unqualified_base_limit bases that\n                        quality is lower than qualified quality, then this\n                        read/pair is bad. Default 0 means do not filter reads\n                        by low quality base count\n  -p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT\n                        if exists one polyX(polyG means GGGGGGGGG...), and its\n                        length is \u003e= poly_size_limit, then this read/pair is\n                        bad. Default is 35\n  -a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY\n                        the count of allowed mismatches when evaluating\n                        poly_X. Default 5 means disallow any mismatches\n  -n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT\n                        if exists more than maxn bases have N, then this\n                        read/pair is bad. Default is 5\n  -s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ\n                        if the trimmed read is shorter than seq_len_req, then\n                        this read/pair is bad. Default is 35\n```\n***Debubble options (not suggested for regular tasks)***    \nIf you want to eliminate bubble artifact, turn debubble option on (this is slow, usually you don't need to do this): \n```\n  --debubble            enable debubble algorithm to remove the\n                        reads in the bubbles. Default is False\n  --debubble_dir=DEBUBBLE_DIR\n                        specify the folder to store output of debubble\n                        algorithm, default is debubble\n  --draw=DRAW           specify whether draw the pictures or not, when use\n                        debubble or QC. Default is on\n```\n***Barcoded sequencing options***\n```\n  --barcode=BARCODE     specify whether deal with barcode sequencing files, default is on\n  --barcode_length=BARCODE_LENGTH\n                        specify the designed length of barcode\n  --barcode_flag=BARCODE_FLAG\n                        specify the name flag of a barcoded file, default is\n                        barcode, which means a file with name *barcode* is a\n                        barcoded file\n  --barcode=BARCODE     specify whether deal with barcode sequencing files,\n                        default is on, which means all files with barcode_flag\n                        in filename will be treated as barcode sequencing\n                        files\n```\n***QC options***\n```shell\n  --qc_only             enable this option, only QC result will be output, this\n                        can be much faster\n  --qc_sample=QC_SAMPLE\n                        sample up to qc_sample when do QC, default is 1000,000\n  --qc_kmer=QC_KMER     specify the kmer length for KMER statistics for QC,\n                        default is 8\n```\n                        \n# Understand the report\n* `AfterQC` will generate a QC folder, which contains lots of figures. \n* For pair-end sequencing data, both read1 and read2 figures will be in the same folder with the folder name of read1's filename. `R1` means `read1`, `R2` means `read2`.\n* For single-end sequencing data, it will still have `R1`.\n* `prefilter` means `before filtering`, `postfilter` means `after filtering`\n* For pair-end sequencing data, `After` will do an `overlap analysis`. read1 and read2 will be overlapped when `read1_length + read2_length \u003e DNA_template_length`. \n\n# Cite AfterQC\nShifu Chen, Tanxiao Huang, Yanqing Zhou, Yue Han, Mingyan Xu and Jia Gu.  AfterQC: automatic filtering, trimming, error removing and quality control for fastq data.  BMC Bioinformatics 2017 18(Suppl 3):80 https://doi.org/10.1186/s12859-017-1469-3\n","funding_links":[],"categories":["Next Generation Sequencing","Ranked by starred repositories"],"sub_categories":["Sequence Processing"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2FAfterQC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGene%2FAfterQC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2FAfterQC/lists"}