{"id":48071164,"url":"https://github.com/refresh-bio/colord","last_synced_at":"2026-04-04T14:43:00.990Z","repository":{"id":42012727,"uuid":"377760217","full_name":"refresh-bio/colord","owner":"refresh-bio","description":"A versatile compressor of third generation sequencing reads.","archived":false,"fork":false,"pushed_at":"2024-03-24T00:07:48.000Z","size":5800,"stargazers_count":45,"open_issues_count":8,"forks_count":10,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-03-26T22:33:49.348Z","etag":null,"topics":["bioinformatics","compression","fastq-files","genomics","long-reads","oxford-nanopore","pac-bio","sequencing"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/refresh-bio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-06-17T08:33:22.000Z","updated_at":"2024-02-16T13:37:21.000Z","dependencies_parsed_at":"2023-12-08T17:24:11.516Z","dependency_job_id":"ecfb0c79-c8f2-4b85-a0b3-3ddf2ba4031f","html_url":"https://github.com/refresh-bio/colord","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/refresh-bio/colord","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refresh-bio%2Fcolord","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refresh-bio%2Fcolord/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refresh-bio%2Fcolord/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refresh-bio%2Fcolord/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/refresh-bio","download_url":"https://codeload.github.com/refresh-bio/colord/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/refresh-bio%2Fcolord/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31403408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","compression","fastq-files","genomics","long-reads","oxford-nanopore","pac-bio","sequencing"],"created_at":"2026-04-04T14:43:00.935Z","updated_at":"2026-04-04T14:43:00.985Z","avatar_url":"https://github.com/refresh-bio.png","language":"C++","readme":"# CoLoRd - Compressing long reads\n\n[![GitHub downloads](https://img.shields.io/github/downloads/refresh-bio/colord/total.svg?style=flag\u0026label=GitHub%20downloads)](https://github.com/refresh-bio/colord/releases)\n[![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/colord.svg?style=flag\u0026label=Bioconda%20downloads)](https://anaconda.org/bioconda/colord)\n[![GitHub Actions CI](../../actions/workflows/main.yml/badge.svg)](../../actions/workflows/main.yml)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\nA versatile compressor of third generation sequencing reads.\n\n## Quick start\n\n```bash\ngit clone --recurse-submodules https://github.com/refresh-bio/colord\ncd colord \u0026\u0026 make\ncd bin\n\nINPUT=./../test\n\n# default compression presets (lossy quality, memory priority)\n./colord compress-ont ${INPUT}/M.bovis.fastq ont.default \t\t# Oxford Nanopore\n./colord compress-pbhifi ${INPUT}/D.melanogaster.fastq hifi.default\t# PacBio HiFi \n./colord compress-pbraw ${INPUT}/A.thaliana.fastq clr.default \t\t# PacBio CLR/subreads\n\n# print ONT archive information and decompress\n./colord info ont.default\n./colord decompress ont.default ont.fastq\n\n# compress HiFi reads preserving original quality levels\n./colord compress-pbhifi -q org ${INPUT}/D.melanogaster.fastq hifi.lossless\n\n# compress CLR reads with ratio priority using 48 threads\n./colord compress-pbraw -p ratio -t 48 ${INPUT}/A.thaliana.fastq clr.ratio\n\n# compress ONT reads w.r.t. reference genome (embed the reference in the archive)\n./colord compress-ont -G ${INPUT}/M.bovis-reference.fna -s ${INPUT}/M.bovis.fastq ont.refbased\n\n# decompress the reference-based archive\n./colord decompress ont.refbased ont.refbased.fastq\n\n```\n\n## Installation and configuration\n\nCoLoRd comes with a set of [precompiled binaries](https://github.com/refresh-bio/colord/releases) for Windows, Linux, and OS X. They can be found under Releases tab. \nThe software is also available on [Bioconda](https://anaconda.org/bioconda/colord):\n```\nconda install -c bioconda colord\n```\nFor detailed instructions how to set up Bioconda, please refer to the [Bioconda manual](https://bioconda.github.io/user/install.html#install-conda).\nCoLoRd can be also built from the sources distributed as:\n\n* Visual Studio 2019 solution for Windows,\n* MAKE project (G++ 8.4 required) for Linux and macOS.\n\nTo install G++ under under macOS, one can use *Homebrew* package manager:\n```\n/usr/bin/ruby -e \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)\"\nbrew install gcc@10\n```\nBefore running CoLoRd on macOS, the current limit of file descriptors should be increased:\n```\nulimit -n 2048\n```\n\n## Usage\n\n### Compression\n\n`colord \u003cmode\u003e [options] \u003cinput\u003e \u003carchive\u003e`\n\nModes:\n* `compress-ont` - compress Oxford Nanopore reads,\n* `compress-pbhifi` - compress PacBio HiFi reads,\n* `compress-pbraw` - compress PacBio CLR/subreads.\n\nPositionals: \n* `input` - input FASTQ/FASTA path (gzipped or not),\n* `output` - archive path. \n\nOptions:\n* `-h, --help` - print help\n* `-k, --kmer-len` - *k*-mer length, (15-28, default: auto adjust)\n* `-t, --threads` - number of threads (default: 12)\n* `-p, --priority` - compression priority:  `memory`, `balanced`, `ratio` (default: `memory`)\n* `-q, --qual` - quality compression mode: \n\t* `org` - original,\n\t* `none` - discard (Q0 for all bases),\n\t* `avg` - average over entire file,\n\t* `2-fix`,`4-fix`,`5-fix` - 2/4/5 bins with fixed representatives,\n\t* `2-avg`,`4-avg`,`5-avg` - 2/4/5 bins with averages as representatives; default value depends on the mode (`4-avg` for `ont`, `5-avg` for `pbhifi`, `none` for `pbraw`),                           \n* `-T, --qual-thresholds` - quality thresholds:\n\t* single value for `2-fix`/`2-avg` (default: 7),\n\t* three values for `4-fix`/`4-avg` (default: 7 14 26),\n\t* four values for `4-fix`/`4-avg` (default: 7 14 26 93),\n\t* not allowed for `avg`, `org` and `none` modes,\n* `-D, --qual-values` - bin representatives for decompression,\n   * single value for `none` mode (default: 0),\n   * two values for `2-fix` mode (default: 1 13),\n   * four values for `4-fix` mode (default: 3 10 18 35),\n   * five values for `5-fix` mode (default: 3 10 18 35 93),\n   * not allowed for `avg`, `org`, `2-avg`, `4-avg` and `5-avg` modes,\n* `-G, --reference-genome` - optional reference genome path (multi-FASTA gzipped or not), it enables reference-based mode which provides better compression ratios,\n* `-s, --store-reference` - stores the reference genome in the archive, use only with `-G` flag,\n* `-v, --verbose` - verbose mode.\n\nAdvanced options (default values may depend on the mode - please run `colord --help \u003cmode\u003e` to get the details):                             \n* `-a, --anchor-len` - anchor len (default: auto adjust),\n* `-L, --Lowest-count` - minimal *k*-mer count,\n* `-H, --Highest-count` - maximal *k*-mer count,\n* `-f, --filter-modulo` - k-mers for which *hash(k-mer) mod f != 0* will be filtered out before graph building,\n* `-c, --max-candidates` - maximal number of reference reads considered as reference,\n* `-e, --edit-script-mult` - multipier for predicted cost of storing read part as edit script,\n* `-r, --max-recurence-level` - maximal level of recurence when considering alternative reference reads,\n* `--min-to-alt` - minimum length of encoding part to consider using alternative read,\n* `--min-mmer-frac` - if *A* is set of m-mers in encode read R then read is refused from encoding if *|A| \u003c min-mmer-frac * len(R)*,\n* `--min-mmer-force-enc` - if *A* is set of m-mers in encode read R then read is accepted to encoding always if *|A| \u003e min-mmer-force-enc * len(R)*,\n* `--max-matches-mult` - if the number of matches between encode read *R* and reference read is *r*, then read is refused from encoding if *r \u003e max-matches-mult * len(R)*,\n* `--fill-factor-filtered-kmers` - fill factor of filtered *k*-mers hash table,\n* `--fill-factor-kmers-to-reads` - fill factor of *k*-mers to reads hash table,\n* `--min-anchors` - if number of anchors common to encode read and reference candidate is lower than minAnchors candidate is refused,\n* `-i, --identifier` header compression mode - `main`/`none`/`org` (default: `org`),                        \n* `-R, --Ref-reads-mode` - reference reads mode: `all`/`sparse` (default: `sparse`),                             \n* `-g, --sparse-range` - sparse mode range. The propability of reference read acceptance is *1 / pow(id/range_reads, exponent)*, where range_reads is determined based on the number of symbols, which in turn is determined by the number of trusted unique *k*-mers (estimated genome length) multiplied by the value of this parameter,\n* `-x, --sparse-exponent` - sparse mode exponent.\n\n#### Hints\nWhile the number of CoLoRd parameters is large, in most cases the default values will work just fine.\nIn terms of compression, there is always a trade off between compression ratio and resource requirements (mainly memory and compute time).\nIf the default behavior of CoLoRd is insufficient, the first attempt should be the change of compression priority mode (```-p``` parameter). \nThe compression priority modes aggregate multiple other parameters influencing compression ratio.\nThere are the following priority modes (ordered increasingly w.r.t. the compression efficiency and resource requirements): \n\n * ```memory``` \n * ```balanced``` \n * ```ratio``` \n\nThe ```memory``` priority mode is the default.\n\nQuality scores have a high impact on the compression. They are hard to compress due to their nature and, at the same time (as presented in the paper) their resolution can be safely reduced without affecting downstream analyses. For this reason, in each  priority mode, the quality scores are compressed lossy. If it is required to keep the original quality scores, one should use ```-q org```. Note, that there exist several other quality compression modes (see the paper).\n\nHere are compression results for a large set of human reads [NA12878](http://s3.amazonaws.com/nanopore-human-wgs/rel6/rel_6.fastq.gz) with a total size of 268,305,314,354 bytes.\n\n|                                            | Lossy           | Lossless        |\n| ------------------------------------------ | --------------- | --------------- |\n| Compressed in ```memory``` mode size [B]   | 42,120,596,486  | 105,807,350,384 |\n| Compressed in ```balanced``` mode size [B] | 39,833,878,505  | 103,367,993,362 |\n| Compressed in ```ratio``` mode size [B]    | 38,832,714,102  | 101,305,368,675 |\n| Time in ```memory``` mode [h:mm:ss]        | 1:12:42         | 1:26:02         |\n| Time in ```balanced``` mode [h:mm:ss]      | 1:33:18         | 2:11:21         |\n| Time in ```ratio``` mode [h:mm:ss]         | 3:18:46         | 4:57:09         |\n| Memory in ```memory``` mode [KB]           | 13,715,168      | 14,341,128      |\n| Memory in ```balanced``` mode [KB]         | 26,728,108      | 27,293,824      |\n| Memory in ```ratio``` mode [KB]            | 97,922,208      | 99,133,548      |\n\n\nIf one wants to check how much CoLoRd can squeeze the input data regardless of the resource requirements, the ```ratio``` mode should be used.\nIf more control over execution is in demand, the remaining parameters may be configured. \nThe simplest way to settle the direction without the need to understand the meaning of parameters is to display the defaults for a given compression priority mode with ```--help``` switch.\nFor example, let's say you want to find out if you should increase or decrease the ```-f``` parameter to improve the compression ratio while compressing ONT data.\nYou may run CoLoRd twice with the following parameters:\n```\n./colord compress-ont --help -p balanced\n./colord compress-ont --help -p ratio\n``` \nYou will notice the default for ```-f``` is higher for ```balanced``` mode, which means lowering it will increase the compression ratio. The same approach may be applied for other parameters (```-L```, ```-H```, ```-c```, ```-r```, ```--min-to-alt```, etc.).\n\nIn the ```ratio``` priority mode all the input reads may serve as a reference to encode other reads. This will increase RAM usage, especially for large datasets. In the remaining modes, only part of the reads may serve as a reference. If needed ```-g``` and ```-x``` may be used.\n\nThe values for ```-k``` and ```-a``` parameters are auto-adjusted based on the size of the data to be compressed. The general rule is, the larger the input size is, the values of these parameters should be higher.\n\n\n\n### Decompression\n\n`colord decompress [options] \u003carchive\u003e \u003coutput\u003e`\n\nPositionals:\n* `input` - archive path,\n* `output` - output file path.\n\nOptions:\n* `-h, --help` - print help,\n* `-G, --reference-genome` - optional reference genome path (multi-FASTA gzipped or not), required for reference-based archives with no reference genome embedded (`-G` compression without `-s` switch),\n* `-v, --verbose` - verbose mode.\n\n\n### Archive information\n\n`colord info \u003carchive\u003e`\n\n## API\n\nCoLoRd comes with a C++ API allowing straightforward access to the existing archive. Below one can find an example of using API in the code.\n\n```c++\n#include \"colord_api.h\"\n#include \u003ciostream\u003e\n\nint main(int argc, char** argv) {\n\ttry {\n\t\tcolord::DecompressionStream stream(\"archive.colord\");\t// load a CoLoRd archive\n\t\tauto info = stream.GetInfo();\t\t\t\t// get and print archive information\n\t\tstd::cerr \u003c\u003c \"Archive info:\\n\\n\";\t\t\t//\n\t\tinfo.ToOstream(std::cerr);\t\t\t\t//\t\n\n     \t\t// iterate over records in the archive\n\t\twhile (auto x = stream.NextRecord()) {\n\t\t\tif (info.isFastq) {\n\t\t\t\tstd::cout \u003c\u003c \"@\" \u003c\u003c x.ReadHeader() \u003c\u003c \"\\n\";\n\t\t\t\tstd::cout \u003c\u003c x.Read() \u003c\u003c \"\\n\";\n\t\t\t\tstd::cout \u003c\u003c \"+\" \u003c\u003c x.QualHeader() \u003c\u003c \"\\n\";\n\t\t\t\tstd::cout \u003c\u003c x.Qual() \u003c\u003c \"\\n\";\n\t\t\t} else {\n\t\t\t\tstd::cout \u003c\u003c \"\u003e\" \u003c\u003c x.ReadHeader() \u003c\u003c \"\\n\";\n\t\t\t\tstd::cout \u003c\u003c x.Read() \u003c\u003c \"\\n\";\n\t\t\t}\n\t\t}\n\t}\n\tcatch (const std::exception\u0026 ex) {\n\t\tstd::cerr \u003c\u003c \"Error: \" \u003c\u003c ex.what() \u003c\u003c \"\\n\";\n\t\treturn -1;\n\t}\t\n\treturn 0;\n}\n```\n\n### Compiling own code utilizing colord API\nTo use an API one needs to include ```colord_api.h``` header file and link against ```libcolord_api.a```. ```libcolord_api.a``` uses ```std::thread```s and zlib, so ```-lpthreads``` and ```-lz``` flags are needed for linking. For example, to compile and link the code above one could use the following command:\n```\ng++ -O3 $SRC_FILE -I$INCLUDE_DIR $LIB_DIR/libcolord_api.a -lz -lpthread -o example -no-pie\n```\nwhere\n * ```SRC_FILE``` is a path to a source code\n * ```INCLUDE_DIR``` is a path of the directory where ```colord_api.h``` file is (when one compiles ```colord``` from sources there is ```include``` directory created at the same location where ```Makefile``` is)\n * ```LIB_DIR``` is a path of the directory where ```libcolord_api.a``` file is (when one compiles ```colord``` from sources there is ```bin``` directory created at the same location where ```Makefile``` is, it contains (among others) ```libcolord_api.a```)\n\n\n## Citing\n[Kokot, M., Gudyś, A., Li, H. and Deorowicz, S. (2022) CoLoRd: Compressing long reads. *Nature Methods*, https://doi.org/10.1038/s41592-022-01432-3](https://doi.org/10.1038/s41592-022-01432-3)\n\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frefresh-bio%2Fcolord","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frefresh-bio%2Fcolord","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frefresh-bio%2Fcolord/lists"}