{"id":43927040,"url":"https://github.com/algbio/ggcat","last_synced_at":"2026-02-06T23:09:05.882Z","repository":{"id":41086206,"uuid":"222917950","full_name":"algbio/ggcat","owner":"algbio","description":"Compacted and colored de Bruijn graph construction and querying","archived":false,"fork":false,"pushed_at":"2025-12-01T19:54:08.000Z","size":5017,"stargazers_count":85,"open_issues_count":8,"forks_count":12,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-12-04T09:55:03.024Z","etag":null,"topics":["bioinformatics","bioinformatics-pipeline","de-novo-assembly","debruijn-graph","genome-assembly","rust","sequencing"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/algbio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-11-20T11:03:17.000Z","updated_at":"2025-12-01T19:54:13.000Z","dependencies_parsed_at":"2024-03-09T17:28:28.713Z","dependency_job_id":"e086cb0c-da40-4b84-8a76-15c5314588f9","html_url":"https://github.com/algbio/ggcat","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/algbio/ggcat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fggcat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fggcat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fggcat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fggcat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/algbio","download_url":"https://codeload.github.com/algbio/ggcat/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fggcat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29179641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T22:12:24.066Z","status":"ssl_error","status_checked_at":"2026-02-06T22:12:09.859Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bioinformatics-pipeline","de-novo-assembly","debruijn-graph","genome-assembly","rust","sequencing"],"created_at":"2026-02-06T23:09:05.227Z","updated_at":"2026-02-06T23:09:05.870Z","avatar_url":"https://github.com/algbio.png","language":"Rust","readme":"[![BioConda Install](https://img.shields.io/conda/dn/bioconda/ggcat.svg?style=flag\u0026label=BioConda%20install)](https://anaconda.org/bioconda/ggcat)\n[![License](https://img.shields.io/github/license/algbio/ggcat)](https://mit-license.org/)\n[![GitHub release (latest by date)](https://img.shields.io/github/v/release/algbio/ggcat)](https://github.com/algbio/ggcat/releases/)\n[![GitHub Downloads](https://img.shields.io/github/downloads/algbio/ggcat/total.svg?style=social\u0026logo=github\u0026label=Download)](https://github.com/algbio/ggcat/releases)\n\n# GGCAT - compacted and colored de Bruijn graph construction and querying\n\nGGCAT is a tool for building compacted (and optionally colored) de Bruijn graphs from raw sequencing data or for merging multiple existing cDBG in a single graph. It also supports sequence queryies to either a colored or non-colored graph (i.e. number/percentage of present kmers).\n\n## Install\n\nGGCAT can be downloaded from https://github.com/algbio/ggcat/releases or installed via conda:\n\n```\nconda install  -c conda-forge -c bioconda ggcat\n```\n\n## Tool usage\n\n### Build a new graph\n\nTo build a new graph with a specified k of some input files, run:\n\n```\nggcat build -k \u003ck_value\u003e -j \u003cthreads_count\u003e \u003cinput_files\u003e -o \u003coutput_file\u003e\n```\n\nOr if you have a file with a list of input files:\n\n```\nggcat build -k \u003ck_value\u003e -j \u003cthreads_count\u003e -l \u003cinput_files_list\u003e -o \u003coutput_file\u003e\n```\n\n#### Building a colored graph\n\nTo build a colored graph, add the `-c` flag to the above commands\n\nBy default the color name is equal to the file name, this behavior can be overridden\nby specifying color names with associated input files in a separate file, and by passing it to ggcat with the `-d` flag. The color and file in each line should be separated by one `\u003cTAB\u003e` character.\n\nExample `color_mapping.in`:\n\n```\ncolor1\tfile1.fa\ncolor2\tfile2.fa\ncolor2\tfile3.fa\ncolor1\tdir/file4.fa\ncolor3\tdir2/file5.fa\n```\n\nThen the graph can be built with the command:\n\n```\nggcat build -k \u003ck_value\u003e -j \u003cthreads_count\u003e -c -d color_mapping.in -o \u003coutput_file\u003e\n```\n\n#### Building links\n\nTo build links between maximal unitigs in BCALM2 like format, use the `-e` flag\n\n#### Building minimum-plain text representations of kmer sets\n\nUnitigs are a plain-text representation of the set of kmers in the input reads / genomes, but not of minimum size. GGCAT integrates the [matchtigs \u0026 eulertigs](https://github.com/algbio/matchtigs) libraries. These libraries assume a set of maximal unitigs as input, and compute such minimum representations, allowing or forbidding repetitions of kmers, respectively. To build greedy matchtigs, use the `-g` flag; to build eulertigs, use the `--eulertigs` flag; to build a greedy version of eulertigs, use the `--pathtigs` flag.\n\nHere are all listed the available options for graph building:\n\n```\n\u003e ggcat build --help\nUsage: ggcat build [OPTIONS] --kmer-length \u003cKMER_LENGTH\u003e [INPUT]...\n\nArguments:\n  [INPUT]...  The input files\n\nOptions:\n  -l, --input-lists \u003cINPUT_LISTS\u003e\n          The lists of input files\n  -o, --output-file \u003cOUTPUT_FILE\u003e\n          [default: output.fasta.lz4]\n  -c, --colors\n          Enable colors\n  -d, --colored-input-lists \u003cCOLORED_INPUT_LISTS\u003e\n          The lists of input files with colors in format \u003cCOLOR_NAME\u003e\u003cTAB\u003e\u003cFILE_PATH\u003e\n  -s, --min-multiplicity \u003cMIN_MULTIPLICITY\u003e\n          Minimum multiplicity required to keep a kmer [default: 2]\n  -k, --kmer-length \u003cKMER_LENGTH\u003e\n          The k-mers length\n  -t, --temp-dir \u003cTEMP_DIR\u003e\n          Directory for temporary files [default: .temp_files]\n  -j, --threads-count \u003cTHREADS_COUNT\u003e\n          [default: 16]\n  -f, --forward-only\n          Treats reverse complementary kmers as different\n  -m, --memory \u003cMEMORY\u003e\n          Maximum suggested memory usage (GB) The tool will try use only up to this GB of memory to store temporary files without writing to disk. This usage does not include the needed memory for the processing steps. GGCAT can allocate extra memory for files if the current memory is not enough to complete the current operation [default: 2]\n  -p, --prefer-memory\n          Use all the given memory before writing to disk\n  -h, --help\n          Print help\n\nOutput mode:\n  -e, --generate-maximal-unitigs-links  Generate maximal unitigs connections references, in BCALM2 format L:\u003c+/-\u003e:\u003cother id\u003e:\u003c+/-\u003e\n      --simplitigs                      Generate simplitigs instead of maximal unitigs\n      --eulertigs                       Generate eulertigs instead of maximal unitigs\n      --greedy-matchtigs                Generate greedy matchtigs instead of maximal unitigs\n      --gfa-v1                          Output the graph in GFA format v1\n      --gfa-v2                          Output the graph in GFA format v2\n\nAdvanced Options:\n      --minimizer-length \u003cMINIMIZER_LENGTH\u003e\n          Overrides the default m-mers (minimizers) length\n      --keep-temp-files\n          Keep intermediate temporary files for debugging purposes\n  -w, --hash-type \u003cHASH_TYPE\u003e\n          Hash type used to identify kmers [default: auto] [possible values: auto, seq-hash, rabin-karp128]\n  -b, --buckets-count-log \u003cBUCKETS_COUNT_LOG\u003e\n          The log2 of the number of buckets\n      --intermediate-compression-level \u003cINTERMEDIATE_COMPRESSION_LEVEL\u003e\n          The level of lz4 compression to be used for the intermediate files\n```\n\n### Querying a graph\n\nTo query an uncolored graph use the command:\n\n```\nggcat query -k \u003ck_value\u003e -j \u003cthreads_count\u003e \u003cinput-graph\u003e \u003cinput-query\u003e\n```\n\nThe provided k value must match the one used for graph construction.\nTo query a colored graph use the command:\n\n```\nggcat query --colors -k \u003ck_value\u003e -j \u003cthreads_count\u003e \u003cinput-graph\u003e \u003cinput-query\u003e\n```\n\nThe tool automatically searches for the colormap file associated with the\ninput graph, that must have the same name as the graph with extension '.colors.dat'\n\nThe colors in the output are by default represented by an integer, to recover a mapping between the integers\nand the color filenames, use the command `ggcat dump-colors \u003ccolormap\u003e \u003coutput_file\u003e`.\n\nIf you instead want the color file names to be written directly in the query output (leading to a potentially much bigger output file),\npass the option `-f JsonLinesWithNames`.\n\nHere are listed all the available options for graph querying:\n\n```\n\u003e ggcat query --help\nUsage: ggcat query [OPTIONS] --kmer-length \u003cKMER_LENGTH\u003e \u003cINPUT_GRAPH\u003e \u003cINPUT_QUERY\u003e\n\nArguments:\n  \u003cINPUT_GRAPH\u003e  The input graph\n  \u003cINPUT_QUERY\u003e  The input query as a .fasta file\n\nOptions:\n  -c, --colors\n          Enable colors\n  -o, --output-file-prefix \u003cOUTPUT_FILE_PREFIX\u003e\n          [default: output]\n      --colored-query-output-format \u003cCOLORED_QUERY_OUTPUT_FORMAT\u003e\n          [possible values: json-lines-with-numbers, json-lines-with-names]\n  -x, --step \u003cSTEP\u003e\n          [default: MinimizerBucketing] [possible values: minimizer-bucketing, kmers-counting, counters-sorting, color-map-reading]\n  -k, --kmer-length \u003cKMER_LENGTH\u003e\n          The k-mers length\n  -t, --temp-dir \u003cTEMP_DIR\u003e\n          Directory for temporary files [default: .temp_files]\n  -j, --threads-count \u003cTHREADS_COUNT\u003e\n          [default: 16]\n  -f, --forward-only\n          Treats reverse complementary kmers as different\n  -m, --memory \u003cMEMORY\u003e\n          Maximum suggested memory usage (GB) The tool will try use only up to this GB of memory to store temporary files without writing to disk. This usage does not include the needed memory for the processing steps. GGCAT can allocate extra memory for files if the current memory is not enough to complete the current operation [default: 2]\n  -p, --prefer-memory\n          Use all the given memory before writing to disk\n  -h, --help\n          Print help\n\nAdvanced Options:\n      --minimizer-length \u003cMINIMIZER_LENGTH\u003e\n          Overrides the default m-mers (minimizers) length\n      --keep-temp-files\n          Keep intermediate temporary files for debugging purposes\n  -w, --hash-type \u003cHASH_TYPE\u003e\n          Hash type used to identify kmers [default: auto] [possible values: auto, seq-hash, rabin-karp128]\n  -b, --buckets-count-log \u003cBUCKETS_COUNT_LOG\u003e\n          The log2 of the number of buckets\n      --intermediate-compression-level \u003cINTERMEDIATE_COMPRESSION_LEVEL\u003e\n          The level of lz4 compression to be used for the intermediate files\n```\n\n## Building from source\n\nTo build the tool the Rust stable (\u003e= 1.75) toolchain is required, and can be downloaded with the following commands:\n\n### Linux/Mac\n\n```\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nrustup toolchain install stable\n\n```\n\n### Windows\n\nFollow the instructions at the site:\nhttps://rustup.rs/\n\n### Additional opt-in features\n\nAdditional features can be enabled by specifying them in the command line while building/installing GGCAT (ex. --features \"feature1,feature2\"):\n\n- **kmer-counters**: Adds kmer abundance for each unitig, in a BCALM2 compatible format. If enabled GGCAT uses more memory while building colored graphs\n\n### Building\n\nThen the tool can be installed with the commands:\n\n```\ngit clone https://github.com/algbio/ggcat\ncd ggcat/\ncargo install --path crates/cmdline/ --locked\n```\n\nthe binary is automatically copied to `$HOME/.cargo/bin`\n\nTo launch the tool directly from the command line, the above directory should be added to the `$PATH` variable.\n\n## API usage\n\nGGCAT has an API for both Rust and C++.\n\n### Rust\n\nAdd a dependency to the crates/api/ crate to use it in your project.\nCheck crates/api/example for usage examples.\n\n### C++\n\nRun the makefile inside crates/capi/ggcat-cpp-api to build the library.\nCheck crates/capi/ggcat-cpp-api/example for usage examples.\n\n## Citing\n\nIf you use GGCAT in your research, please cite the following article:\n\n### [GGCAT](https://doi.org/10.1101/gr.277615.122)\n\u003e _Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT_   \n\u003e Andrea Cracco, Alexandru I. Tomescu  \n\u003e **Genome Research** 33, 1198--1207 (2023), DOI: [10.1101/gr.277615.122](https://doi.org/10.1101/gr.277615.122)  \n\nIf you use a matchtigs/eulertigs output, please also cite the following articles:  \n\n#### [Matchtigs](https://doi.org/10.1101/2021.12.15.472871)\n\u003e _Matchtigs: minimum plain text representation of kmer sets_  \n\u003e Sebastian Schmidt, Shahbaz Khan, Jarno N. Alanko, Giulio E. Pibiri \u0026 Alexandru I. Tomescu  \n\u003e **Genome Biology**, Volume 24, article number 136 (2023), DOI: [10.1101/2021.12.15.472871](https://doi.org/10.1101/2021.12.15.472871)  \n\n#### [Eulertigs](https://doi.org/10.1186/s13015-023-00227-1)  \n\u003e _Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time_  \n\u003e Sebastian Schmidt and Jarno N. Alanko  \n\u003e **Algorithms for Molecular Biology**, Volume 18, article number 5 (2023), DOI: [10.1186/s13015-023-00227-1](https://doi.org/10.1186/s13015-023-00227-1).  \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falgbio%2Fggcat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falgbio%2Fggcat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falgbio%2Fggcat/lists"}