{"id":25501937,"url":"https://github.com/opengene/fastplong","last_synced_at":"2025-04-06T19:11:01.513Z","repository":{"id":258102791,"uuid":"847533499","full_name":"OpenGene/fastplong","owner":"OpenGene","description":"Ultra-fast preprocessing and quality control for long-read sequencing data","archived":false,"fork":false,"pushed_at":"2025-02-11T06:47:51.000Z","size":230,"stargazers_count":115,"open_issues_count":7,"forks_count":5,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-30T18:08:10.401Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-26T03:39:43.000Z","updated_at":"2025-03-25T02:12:13.000Z","dependencies_parsed_at":"2024-10-17T17:26:01.795Z","dependency_job_id":"fa64bf3a-84b0-47b9-9f62-f3918b94f983","html_url":"https://github.com/OpenGene/fastplong","commit_stats":null,"previous_names":["opengene/fastplong"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Ffastplong","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Ffastplong/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Ffastplong/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Ffastplong/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGene","download_url":"https://codeload.github.com/OpenGene/fastplong/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247535516,"owners_count":20954576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-19T04:59:34.950Z","updated_at":"2025-04-06T19:11:01.465Z","avatar_url":"https://github.com/OpenGene.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![install with conda](\nhttps://anaconda.org/bioconda/fastplong/badges/version.svg)](https://anaconda.org/bioconda/fastplong)\n# fastplong\nUltrafast preprocessing and quality control for long reads (Nanopore, PacBio, Cyclone, etc.).   \nIf you're searching for tools to preprocess short reads (Illumina, MGI, etc.), please use [fastp](https://github.com/OpenGene/fastp)  \n\n- [simple usage](#simple-usage)\n- [examples of report](#examples-of-report)\n- [get fastplong](#get-fastplong)\n  - [install with Bioconda](#install-with-bioconda)\n  - [download the latest prebuilt binary for Linux users](#download-the-latest-prebuilt-binary-for-linux-users)\n  - [or compile from source](#or-compile-from-source)\n- [input and output](#input-and-output)\n  - [output to STDOUT](#output-to-stdout)\n  - [input from STDIN](#input-from-stdin)\n  - [store the reads that fail the filters](#store-the-reads-that-fail-the-filters)\n  - [process only part of the data](#process-only-part-of-the-data)\n  - [do not overwrite exiting files](#do-not-overwrite-exiting-files)\n  - [split the output to multiple files for parallel processing](#split-the-output-to-multiple-files-for-parallel-processing)\n- [filtering](#filtering)\n  - [quality filter](#quality-filter)\n  - [length filter](#length-filter)\n  - [low complexity filter](#low-complexity-filter)\n  - [Other filter](#other-filter)\n- [adapters](#adapters)\n- [per read cutting by quality score](#per-read-cutting-by-quality-score)\n- [global trimming](#global-trimming)\n- [output splitting](#output-splitting)\n  - [splitting by limiting file number](#splitting-by-limiting-file-number)\n  - [splitting by limiting the lines of each file](#splitting-by-limiting-the-lines-of-each-file)\n- [all options](#all-options)\n\n# simple usage\n```\nfastplong -i in.fq -o out.fq\n```\nBoth input and output can be gzip compressed. By default, the HTML report is saved to `fastplong.html` (can be specified with `-h` option), and the JSON report is saved to `fastplong.json` (can be specified with `-j` option). \n\n# examples of report\n`fastplong` creates reports in both HTML and JSON format.\n* HTML report: http://opengene.org/fastplong/fastplong.html\n* JSON report: http://opengene.org/fastplong/fastplong.json\n\n# get fastplong\n## install with Bioconda\n[![install with conda](\nhttps://anaconda.org/bioconda/fastplong/badges/version.svg)](https://anaconda.org/bioconda/fastplong)\n```shell\nconda install -c bioconda fastplong\n```\n## download the latest prebuilt binary for Linux users\nThis binary was compiled on CentOS, and tested on CentOS/Ubuntu\n```shell\n# download the latest build\nwget http://opengene.org/fastplong/fastplong\nchmod a+x ./fastplong\n\n# or download specified version, i.e. fastplong v0.2.2\nwget http://opengene.org/fastplong/fastplong.0.2.2\nmv fastplong.0.2.2 fastplong\nchmod a+x ./fastplong\n```\n## or compile from source\n`fastplong` depends on `libdeflate` and `isa-l` for fast decompression and compression of zipped data, and depends on `libhwy` for SIMD acceleration. It's recommended to install all of them via Anaconda:\n```\nconda install conda-forge::libdeflate\nconda install conda-forge::isa-l\nconda install conda-forge::libhwy\n```\nYou can also try to install them with other package management systems like `apt/yum` on Linux, or `brew` on MacOS. Otherwise you can compile them from source (https://github.com/intel/isa-l, https://github.com/ebiggers/libdeflate, and https://github.com/google/highway)\n\n### download and build fastplong\n```shell\n# get source (you can also use browser to download from master or releases)\ngit clone https://github.com/OpenGene/fastplong.git\n\n# build\ncd fastplong\nmake -j\n\n# test\nmake test\n\n# Install\nsudo make install\n```\n\n# input and output\nSpecify input by `-i` or `--in`, and specify output by `-o` or `--out`.\n* if you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering.\n* the output will be gzip-compressed if its file name ends with `.gz`\n## output to STDOUT\n`fastplong` supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors like `bzip2`, or be passed to aligners like `minimap2` or `bowtie2`.\n* specify `--stdout` to enable this mode to stream output to STDOUT\n## input from STDIN\n* specify `--stdin` if you want to read the STDIN for processing.\n## store the reads that fail the filters\n* give `--failed_out` to specify the file name to store the failed reads.\n* if one read failed and is written to `--failed_out`, its `failure reason` will be appended to its read name. For example, `failed_quality_filter`, `failed_too_short` etc.\n## process only part of the data\nIf you don't want to process all the data, you can specify `--reads_to_process` to limit the reads to be processed. This is useful if you want to have a fast preview of the data quality, or you want to create a subset of the filtered data.\n## do not overwrite exiting files\nYou can enable the option `--dont_overwrite` to protect the existing files not to be overwritten by `fastplong`. In this case, `fastplong` will report an error and quit if it finds any of the output files (read, json report, html report) already exists before.\n## split the output to multiple files for parallel processing\nSee [output splitting](#output-splitting)\n\n# filtering\nMultiple filters have been implemented.\n## quality filter\nQuality filtering is enabled by default, but you can disable it by `-Q` or `disable_quality_filtering`. Currently it supports filtering by limiting the N base number (`-n, --n_base_limit`),  and the percentage of unqualified bases.  \n\nTo filter reads by its percentage of unqualified bases, two options should be provided:\n* `-q, --qualified_quality_phred`       the quality value that a base is qualified. Default 15 means phred quality \u003e=Q15 is qualified.\n* `-u, --unqualified_percent_limit`    how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%\n\nYou can also filter reads by its average quality score\n* `-m, --mean_qual`   if one read's average quality score \u003cavg_qual, then this read is discarded. Default 0 means no requirement (int [=0])\n\n## length filter\nLength filtering is enabled by default, but you can disable it by `-L` or `--disable_length_filtering`. The minimum length requirement is specified with `-l` or `--length_required`.\n\nYou can specify `--length_limit` to discard the reads longer than `length_limit`. The default value 0 means no limitation.\n\n## Other filter\nNew filters are being implemented. If you have a new idea or new request, please file an issue.\n\n# adapters\n`fastplong` trims adapter in both read start and read end. Adapter trimming is enabled by default, but you can disable it by `-A` or `--disable_adapter_trimming`.\n\n```\nfastplong -i in.fq -o out.fq -s AAGGATTCATTCCCACGGTAACAC -e GTGTTACCGTGGGAATGAATCCTT\n```\n* If the adapter sequences are known, it's recommended to specify `-s, --start_adapter` for read start adapter sequence, and `-e, --end_adapter` for read end adapter sequence as well.\n\n* If `--end_adapter` is not specified but `--start_adapter` is specified, then fastplong will use the reverse complement sequence of `start_adapter` to be `end_adapter`.\n\n* You can also specify `-a, --adapter_fasta` to give a FASTA file to tell `fastplong` to trim multiple adapters in this FASTA file. Here is a sample of such adapter FASTA file:\n```\n\u003eAdapter 1\nAGATCGGAAGAGCACACGTCTGAACTCCAGTCA\n\u003eAdapter 2\nAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT\n\u003epolyA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n```\n\n* The adapter sequence in the FASTA file should be at least 6bp long, otherwise it will be skipped. And you can give whatever you want to trim, rather than regular sequencing adapters (i.e. polyA).\n\n* If all these adapter options (`start_adapter`, `end_adapter` and `adapter_fasta`) are not specified, `fastplong` will try to detect the read start and read end adapters automatically. The detected adapter sequences may be a bit shorter or longer than the real ones. And there is a certain probability of misidentification, especially when most reads don't have adapters (it won't cause too bad result in this case).\n\n* fastplong calculates edit distance when detecting adapters. You can specify the `-d, --distance_threshold` to adjust the mismatch tolerance of adapter comparing. The default value is 0.25, which means allowing 25% mismatch ratio (i.e. allow 10 distance for 40bp adapter). Suggest to increase this value when the data is much noisy (high error rate), and decrease this value when the data is with high quality (low error rate).\n\n* to make a cleaner trimming, fastplong will trim a little more bases connected to the adapters. This option can be specified by `--trimming_extension`, with a default value of 10.\n\n# per read cutting by quality score\n`fastplong` supports per read sliding window cutting by evaluating the mean quality scores in the sliding window. `fastplong` supports 2 different operations, and you enable one or both:\n* `-5, --cut_front`             move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise. Default is disabled. The leading N bases are also trimmed. Use `cut_front_window_size` to set the widnow size, and `cut_front_mean_quality` to set the mean quality threshold. If the window size is 1, this is similar as the Trimmomatic `LEADING` method.\n* `-3, --cut_tail`              move a sliding window from tail (3') to front, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise. Default is disabled. The trailing N bases are also trimmed. Use `cut_tail_window_size` to set the widnow size, and `cut_tail_mean_quality` to set the mean quality threshold. If the window size is 1, this is similar as the Trimmomatic `TRAILING` method.\n\n\nIf you don't set window size and mean quality threshold for these function respectively, `fastplong` will use the values from `-W, --cut_window_size` and `-M, --cut_mean_quality `\n\n# global trimming\n`fastplong` supports global trimming, which means trim all reads in the front or the tail. This function is useful since sometimes you want to drop some cycles of a sequencing run.\n\nFor example, the last cycle is uaually with low quality, and it can be dropped with `-t 1` or `--trim_tail=1` option.\n\n* The front/tail trimming settings are given with `-f, --trim_front` and `-t, --trim_tail`.\n\n\n# output splitting\nFor parallel processing of FASTQ files (i.e. alignment in parallel), `fastplong` supports splitting the output into multiple files. The splitting can work with two different modes: `by limiting file number` or `by limiting lines of each file`. These two modes cannot be enabled together.   \n\nThe file names of these split files will have a sequential number prefix, adding to the original file name specified by `--out1` or `--out2`, and the width of the prefix is controlled by the `--split_prefix_digits` option. For example, `--split_prefix_digits=4`, `--out1=out.fq`, `--split=3`, then the output files will be `0001.out.fq`,`0002.out.fq`,`0003.out.fq`\n\n## splitting by limiting file number\nSpecify `--split` to specify how many files you want to have. `fastplong` evaluates the read number of a FASTQ by reading its first ~1M reads. This evaluation is not accurate so the file sizes of the last several files can be a little differnt (a bit bigger or smaller). For best performance, it is suggested to specify the file number to be a multiple of the thread number.\n\n## splitting by limiting the lines of each file\nSpecify `--split_by_lines` to limit the lines of each file. The last files may have smaller sizes since usually the input file cannot be perfectly divided. The actual file lines may be a little greater than the value specified by `--split_by_lines` since `fastplong` reads and writes data by blocks (a block = 1000 reads).\n\n\n# all options\n```shell\nusage: fastplong -i \u003cin\u003e -o \u003cout\u003e [options...]\nfastplong: ultra-fast FASTQ preprocessing and quality control for long reads\nversion 0.0.1\nusage: ./fastplong [options] ... \noptions:\n  -i, --in                           read input file name (string [=])\n  -o, --out                          read output file name (string [=])\n      --failed_out                   specify the file to store reads that cannot pass the filters. (string [=])\n  -z, --compression                  compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])\n      --stdin                        input from STDIN.\n      --stdout                       stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.\n      --reads_to_process             specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])\n      --dont_overwrite               don't overwrite existing files. Overwritting is allowed by default.\n  -V, --verbose                      output verbose log information (i.e. when every 1M reads are processed).\n  -A, --disable_adapter_trimming     adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled\n  -s, --start_adapter                the adapter sequence at read start (5'). (string [=auto])\n  -e, --end_adapter                  the adapter sequence at read end (3'). (string [=auto])\n  -a, --adapter_fasta                specify a FASTA file to trim both read by all the sequences in this FASTA file (string [=])\n  -d, --distance_threshold           threshold of sequence-adapter-distance/adapter-length (0.0 ~ 1.0), greater value means more adapters detected (double [=0.25])\n      --trimming_extension           when an adapter is detected, extend the trimming to make cleaner trimming, default 10 means trimming 10 bases more (int [=10])\n  -f, --trim_front                   trimming how many bases in front for read, default is 0 (int [=0])\n  -t, --trim_tail                    trimming how many bases in tail for read, default is 0 (int [=0])\n  -x, --trim_poly_x                  enable polyX trimming in 3' ends.\n      --poly_x_min_len               the minimum length to detect polyX in the read tail. 10 by default. (int [=10])\n  -5, --cut_front                    move a sliding window from front (5') to tail, drop the bases in the window if its mean quality \u003c threshold, stop otherwise.\n  -3, --cut_tail                     move a sliding window from tail (3') to front, drop the bases in the window if its mean quality \u003c threshold, stop otherwise.\n  -W, --cut_window_size              the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])\n  -M, --cut_mean_quality             the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])\n      --cut_front_window_size        the window size option of cut_front, default to cut_window_size if not specified (int [=4])\n      --cut_front_mean_quality       the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])\n      --cut_tail_window_size         the window size option of cut_tail, default to cut_window_size if not specified (int [=4])\n      --cut_tail_mean_quality        the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])\n  -Q, --disable_quality_filtering    quality filtering is enabled by default. If this option is specified, quality filtering is disabled\n  -q, --qualified_quality_phred      the quality value that a base is qualified. Default 15 means phred quality \u003e=Q15 is qualified. (int [=15])\n  -u, --unqualified_percent_limit    how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])\n  -n, --n_base_limit                 if one read's number of N base is \u003en_base_limit, then this read is discarded. Default is 5 (int [=5])\n  -m, --mean_qual                    if one read's mean_qual quality score \u003cmean_qual, then this read is discarded. Default 0 means no requirement (int [=0])\n  -L, --disable_length_filtering     length filtering is enabled by default. If this option is specified, length filtering is disabled\n  -l, --length_required              reads shorter than length_required will be discarded, default is 15. (int [=15])\n      --length_limit                 reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])\n  -y, --low_complexity_filter        enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).\n  -Y, --complexity_threshold         the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])\n  -j, --json                         the json format report file name (string [=fastplong.json])\n  -h, --html                         the html format report file name (string [=fastplong.html])\n  -R, --report_title                 should be quoted with ' or \", default is \"fastplong report\" (string [=fastplong report])\n  -w, --thread                       worker thread number, default is 3 (int [=3])\n      --split                        split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])\n      --split_by_lines               split output by limiting lines of each file with this option(\u003e=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])\n      --split_prefix_digits          the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])\n  -?, --help                         print this message\n```\n\n# citations\n### Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107\n### Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengene%2Ffastplong","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopengene%2Ffastplong","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengene%2Ffastplong/lists"}