{"id":13752394,"url":"https://github.com/OpenGene/repaq","last_synced_at":"2025-05-09T19:32:01.388Z","repository":{"id":62248336,"uuid":"141722807","full_name":"OpenGene/repaq","owner":"OpenGene","description":"A fast lossless FASTQ compressor with ultra-high compression ratio","archived":false,"fork":false,"pushed_at":"2024-10-22T01:52:08.000Z","size":190,"stargazers_count":133,"open_issues_count":8,"forks_count":21,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-05T14:03:01.651Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-20T14:35:55.000Z","updated_at":"2025-03-05T01:45:23.000Z","dependencies_parsed_at":"2022-10-29T06:18:41.590Z","dependency_job_id":"df949d2e-6edc-4c0c-a76a-a64c639723b7","html_url":"https://github.com/OpenGene/repaq","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Frepaq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Frepaq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Frepaq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2Frepaq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGene","download_url":"https://codeload.github.com/OpenGene/repaq/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253312304,"owners_count":21888617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:05.150Z","updated_at":"2025-05-09T19:32:01.073Z","avatar_url":"https://github.com/OpenGene.png","language":"C","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"[![install with conda](\nhttps://anaconda.org/bioconda/repaq/badges/version.svg)](https://anaconda.org/bioconda/repaq)\n# repaq\nA tool to compress FASTQ files with ultra-high compression ratio and high speed. `repaq` supports compressing the FASTQ to `.rfq` or `.rfq.xz` formats. Compressing to `.rfq` is ultra fast, while compressing to `.rfq.xz` provides very high compression ratio. \n\nFor NovaSeq data, as an example:  \n* the `.rfq` file can be much smaller than `.fq.gz`, and the compressing time is usually less than 1/5 of gzip compression.\n* The `.rfq.xz` file can be as small as 5% of the original FASTQ file, or smaller than 30% of the `.fq.gz` file.\n\nFor paired-end FASTQ files, `repaq` compresses them into one single file to provide higher compression ratio.\n\nThis tool also supports non-Illumina format FASTQ (i.e. the BGI-SEQ format), but the compression ratio is not as good Illumina format FASTQ.\n\n*Citation: Chen S, Chen Y, Wang Z, Qin W, Zhang J, Nand H, Zhang J, Li J, Zhang X, Liang X and Xu M (2023) Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front. Genet. 14:1260531. doi: 10.3389/fgene.2023.1260531*   \n\n# take a look at the compression ratio\nHere we demonstrate the compression ratio of two paired-end NovaSeq data. You can download these files and test locally.\n* `nova.R1.fq`: 1704 MB, the original read1 file, http://opengene.org/repaq/testdata/nova.R1.fq\n* `nova.R2.fq`: 1704 MB, the original read2 file, http://opengene.org/repaq/testdata/nova.R2.fq\n* `nova.R1.fq.gz`: 308 MB (CR 18.08%), the gzipped read1, http://opengene.org/repaq/testdata/nova.R1.fq.gz\n* `nova.R2.fq.gz`: 325 MB (CR 19.07%), the gzipped read2, http://opengene.org/repaq/testdata/nova.R2.fq.gz\n* `nova.rfq`: 333 MB (CR 9.77%), the repacked file of read1+read2, http://opengene.org/repaq/testdata/nova.rfq\n* `nova.rfq.xz`: 134 MB (CR 3.93%), the xz compressed `nova.rfq`, http://opengene.org/repaq/testdata/nova.rfq.xz\n\nSee? The size of final `nova.rfq.xz` is only 3.39% of the original FASTQ files! You can decompress it and check the md5 to see whether they are identical! \n\nTypically with one single CPU core, it takes less than 1 minute to convert `nova.R1.fq + nova.R2.fq` to `nova.rfq`, and takes less than 5 minutes to compress the `nova.rfq` to `nova.rfq.xz` by xz.\n\n# get repaq\n## install with Bioconda\n[![install with conda](\nhttps://anaconda.org/bioconda/repaq/badges/version.svg)](https://anaconda.org/bioconda/repaq)\n```shell\nconda install -c bioconda repaq\n```\n## download binary \nThis binary is only for Linux systems: http://opengene.org/repaq/repaq\n```shell\n# this binary was compiled on CentOS, and tested on CentOS/Ubuntu\nwget http://opengene.org/repaq/repaq\nchmod a+x ./repaq\n```\n## or compile from source\n```shell\n# get source (you can also use browser to download from master or releases)\ngit clone https://github.com/OpenGene/repaq.git\n\n# build\ncd repaq\nmake\n\n# Install\nsudo make install\n```\n\n# usage\nFor single-end mode:\n```shell\n# compress to .rfq.xz\nrepaq -c -i in.fq -o out.rfq.xz\n\n# decompress from .rfq.xz\nrepaq -d -i in.rfq.xz -o out.fq\n```\n\nFor paired-end mode:\n```shell\n# compress to .rfq.xz\nrepaq -c -i in.R1.fq -I in.R2.fq -o out.rfq.xz\n\n# decompress from .rfq.xz\nrepaq -d -i in.rfq.xz -o out.R1.fq -O out.R2.fq\n```\n\nTips:\n* `-i` and `-I` always denote the first and second input files, while `-o` and `-O` always denote the first and second output files.\n* the FASTQ input/output files can be gzipped if their names are ended with `.gz`.\n* for paired-end data. the .rfq file created in paired-end mode is usually much smaller than the sum of the .rfq files created in single-end mode for R1 and R2 respectively. To obtain high compression rate, please always use PE mode for PE data.\n* if you want higher speed and are not concern with compression ratio, replace `xxx.rfq.xz` with `xxx.rfq`, then repaq will compress or decompress `.rfq` format.\n\n# system requirements\n* Memory: 16G RAM\n* CPU: 4 cores\n\n# verify the compressed file\nrepaq offers a `compare` mode to check the consistency of the original FASTQ file(s) and the compressed .rfq or .rfq.xz file. \n* set `--compare` to enable the `compare` mode\n* specify the .rfq or .rfq.xz file by `-r` option\n* specify the FASTQ files by `-i` and `-I` options.\n\nExamples:\n```shell\n# for single-end data\nrepaq --compare -i original.R1.fq  -r compressed.rfq.xz\n\n# for paired-end data\nrepaq --compare -i original.R1.fq.gz -I original.R2.fq.gz  -r compressed.rfq.xz\n```\nWithout any expection, you will get an output of a JSON like:\n```json\n{\n\t\"result\":\"passed\",\n\t\"msg\":\"\",\n\t\"fastq_reads\":50000,\n\t\"rfq_reads\":50000,\n\t\"fastq_bases\":7419082,\n\t\"rfq_bases\":7419082\n}\n```\nThe `result` will be \"failed\" if the compressed file is not consistent with the original FASTQ files.\n\n# STDIN and STDOUT\nrepaq can read the input from STDIN, and write the output to STDOUT.\n* specify `--stdin` if you want to read the STDIN for compression or decompression.\n* specify `--stdout` if you want to output to the STDOUT for compression or decompression\n* in decompression mode, if `--stdout` is specified, the output will be interleaved PE stream.\n* if the STDIN is an interleaved paired-end stream, specify `--interleaved_in` to indicate that.\n* be noted that STDIN cannot be read when the input is a .xz file, and STDOUT cannot be written when the output is a .xz file\n\nHere gives you an example of compressing the interleaved PE output from fastp by directly using pipes:\n```shell\nfastp -i R1.fq -I R2.fq --stdout | repaq -c --interleaved_in --stdin -o out.rfq.xz\n```\n\n# FASTQ Format compatibility  \nrepaq was initially designed for compressing Illumina data, but it also works with data from other platforms, like BGI-Seq. To work with repaq, the FASTQ format should meet following condidtions:\n* only has bases A/T/C/G/N.\n* each FASTQ record has, and only has four lines (name, sequence, strand, quality).\n* the name and strand line cannot be longer than 255 bytes.\n* the number of different quality characters cannot be more than 127.\n\n`repaq` works best for Illumina data directly output by `bcl2fastq`.\n\n# all options\n```shell\noptions:\n  -i, --in1                    input file name (string [=])\n  -o, --out1                   output file name (string [=])\n  -I, --in2                    read2 input file name when encoding paired-end FASTQ files (string [=])\n  -O, --out2                   read2 output file name when decoding to paired-end FASTQ files (string [=])\n  -c, --compress               compress input to output\n  -d, --decompress             decompress input to output\n  -k, --chunk                  the chunk size (kilo bases) for encoding, default 1000=1000kb. (int [=1000])\n      --stdin                  input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.\n      --stdout                 write to STDOUT. When decompressing PE data, this option will result in interleaved FASTQ output for paired-end input. Disabled by defaut.\n      --interleaved_in         indicate that \u003cin1\u003e is an interleaved paired-end FASTQ which contains both read1 and read2. Disabled by defaut.\n  \n# following options are used to check the consistency of the compressed data\n  -p, --compare                compare the files read by read to check the compression consistency. \u003crfq_to_compare\u003e should be specified in this mode.\n  -r, --rfq_to_compare         the RFQ file to be compared with the input. This option is only used in compare mode. (string [=])\n  -j, --json_compare_result    the file to store the comparison result. This is optional since the result is also printed on STDOUT. (string [=])\n\n# options for .xz output\n  -t, --thread                 thread number for xz compression. Higher thread num means higher speed and lower compression ratio (1~16), default 1. (int [=1])\n  -z, --compression            compression level. Higher level means higher compression ratio, and more RAM usage (1~9), default 4. (int [=4])\n\n  -?, --help                   print this message\n```\n\n# external dependency\n`repaq` makes a system call in order to run the xz compression tool available on GNU/Linux systems. If xz isn't installed, `repaq` will fail with the message: \n\n```\nfailed to call xz, please confirm that xz is installed in your system\n```\n# citation\nShifu Chen, Yaru Chen, Zhouyang Wang, Wenjian Qin, Jing Zhang, Heera Nand, Jishuai Zhang, Jun Li, Xiaoni Zhang, Xiaoming Liang, Mingyan Xu. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Frontiers in Genetics, 2023, https://doi.org/10.3389/fgene.2023.1260531\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2Frepaq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGene%2Frepaq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGene%2Frepaq/lists"}