{"id":43927105,"url":"https://github.com/algbio/themisto","last_synced_at":"2026-02-06T23:09:18.631Z","repository":{"id":42066408,"uuid":"248765968","full_name":"algbio/themisto","owner":"algbio","description":"Space-efficient pseudoalignment with a colored de Bruijn graph","archived":false,"fork":false,"pushed_at":"2025-03-13T12:57:25.000Z","size":16956,"stargazers_count":52,"open_issues_count":14,"forks_count":4,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-12-08T17:29:26.686Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/algbio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-20T13:45:37.000Z","updated_at":"2025-09-09T09:07:55.000Z","dependencies_parsed_at":"2023-02-16T20:00:26.530Z","dependency_job_id":"f9c01900-69af-4bc0-90f8-a3be380053cc","html_url":"https://github.com/algbio/themisto","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/algbio/themisto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fthemisto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fthemisto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fthemisto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fthemisto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/algbio","download_url":"https://codeload.github.com/algbio/themisto/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/algbio%2Fthemisto/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29179641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T22:12:24.066Z","status":"ssl_error","status_checked_at":"2026-02-06T22:12:09.859Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-06T23:09:18.069Z","updated_at":"2026-02-06T23:09:18.618Z","avatar_url":"https://github.com/algbio.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NEWS 1. May 2024\n\nThemisto version 3.2.2 is out, fixing a bug where the output was sometimes not fully flushed to disk when using gzipped sorted output. See [release notes](https://github.com/algbio/themisto/releases/tag/v3.2.2) for details and pre-compiled Linux binaries.\n\n\n# About Themisto\nThemisto is a succinct colored k-mer index supporting pseudo-alignment against a database of reference sequences similar to the tool Kallisto, Bifrost and Metagraph. For more information, see the [preprint](https://www.biorxiv.org/content/10.1101/2023.02.24.529942v3). This software is currently developed by the [Compressed Data Structures group](https://www.helsinki.fi/en/researchgroups/algorithmic-bioinformatics/teams/compressed-data-structures) at the University of Helsinki.\n\n## Installation\nPrecompiled binaries are available for\n- Linux x86_64\n- macOS arm64\n- macOS x86_64\n\nVisit the [Releases page](https://github.com/algbio/themisto/releases) to download a binary.\n\n## Compiling\n### Requirements\n\nWe currently support only Linux and macOS. For compilation, you will need a C++20 compliant compiler with OpenMP support, CMake v3.1 or newer, and [Rust](https://www.rust-lang.org/tools/install) 1.77. If compiling with g++, make sure that the version is at least g++-10, or you might run into compilation errors with the standard library \u0026lt;filesystem\u0026gt; header.\n\n### Linux\n\nThese instructions have been tested to work on Ubuntu 18.04 with the aforementioned dependencies installed. To build the software, enter the Themisto directory and run\n\n```\ngit submodule update --init --recursive\ncd build\ncmake .. -DMAX_KMER_LENGTH=31\nmake\n```\n\nIf there is a linking error at the very end, try runnning `make` again. Where 31 is the maximum k-mer length (node length) to support, up to 255. The larger the k-mer length, the more time and memory the index construction takes. Values that are one less than a multiple of 32 work the best. This will create the binary at`build/bin/themisto`.\n\n**Troubleshooting**: If you run into problems involving the \u0026lt;filesystem\u0026gt; header, you probably need to update your compiler. The compiler `g++-10` should be sufficient. Install a new compiler and direct CMake to use it with the `-DCMAKE_CXX_COMPILER` option. For example, to set the compiler to `g++-10`, run CMake with the option `-DCMAKE_CXX_COMPILER=g++-10`.\n\n## MacOS\n\nThe MacOS build only works with the gcc and g++ compilers. Install those compilers on the system and make sure that CXX environment variable is set to the g++ compiler. Also make sure that the g++ executable is actually the GCC g++ compiler and not just a link to Clang.\n\n```\ngit submodule update --init --recursive\ncd build\ncmake -DCMAKE_C_COMPILER=$(which gcc) -DCMAKE_CXX_COMPILER=$(which g++) -DMAX_KMER_LENGTH=31 ..\nmake\n```\n\nHere 31 is the maximum k-mer length to support, up to 255. The larger the k-mer length, the more time and memory the index construction takes. If you run into problems involving zlib, add `-DCMAKE_BUILD_ZLIB=1` into the cmake command.\n\nNote that macOS has a very small limit for the number of concurrently opened files. Themisto can use temporary files to conserve RAM, and may run into this limit. To increase the limit, run the command\n\n```\nulimit -n 2048\n```\n\n# Usage\n\n## Quick start\n\nTo build the Themisto index for a set of genomes, you need to pass in a text file that contains the paths to the FASTA files of the genomes, one file per line. Each FASTA file is given a different color 0,1,2,3... in the same order as they appear in the list. There are three example genomes of E. coli in `example_input` and a file at `example_input/coli_file_list.txt` listing the file names. To build the index for this data, run the following command:\n\n```\n./build/bin/themisto build -k 31 -i example_input/coli_file_list.txt --index-prefix my_index --temp-dir temp --mem-gigas 2 --n-threads 4 --file-colors\n```\n\nThis builds an index with k = 31, such that the index files are written to `my_index.tdbg` and `my_index.tcolors`, using the directory `temp` as temporary storage, using four threads and up to 2GB of memory. We recommend to use a fast SSD drive for the temporary directory.\n\nTo align the four sequences in `example_input/queries.fna` against the index we just built, writing output to `out.txt` run:\n\n```\n./build/bin/themisto pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp --out-file out.txt --n-threads 4 --threshold 0.7\n```\n\nThis reports all colors such that at least a fraction 0.7 of the k-mers of the query are in the reference genome of the color, ignoring k-mers that are not found in any reference.\n\nThis should produce the following output file:\n\n```\n0 0 2\n1 0 1 2\n2 2\n3 2\n```\n\nThere is one line for each query sequence. The lines may appear in a different order if parallelism was used. The first integer on a line is the 0-based rank of a query sequence in the query file, and the rest of the integers are the colors that are pseudoaligned with the query. For example, here the query with rank 1 (i.e. the second sequence in the query file) pseudoaligns to colors 0, 1 and 2.\n\n## Full instructions for index construction\n\n```\nBuild the Themisto index:\nUsage:\n  build [OPTION...]\n\n Basic options:\n  -k, --node-length arg   The k of the k-mers. (default: 0)\n  -i, --input-file arg    The input sequences in FASTA or FASTQ format. The\n\t\t\t  format is inferred from the file extension. If\n\t\t\t  the extension is .txt, the file is interpreted as\n\t\t\t  a list of filenames, one per line\n  -o, --index-prefix arg  The de Bruijn graph will be written to\n\t\t\t  [prefix].tdbg and the color structure to\n\t\t\t  [prefix].tcolors.\n      --temp-dir arg      Directory for temporary files. This directory\n\t\t\t  should have fast I/O operations and should have\n\t\t\t  as much space as possible.\n  -v, --verbose           More verbose progress reporting into stderr.\n\n Coloring (give only one) options:\n  -f, --file-colors        Default if the input has multiple sequence\n\t\t\t   files. Creates a distinct color 0,1,2,... for\n\t\t\t   each file in the input file list, in the order\n\t\t\t   the files appear in the list\n  -e, --sequence-colors    Default if the input has just a single sequence\n\t\t\t   file. Creates a distinct color 0,1,2,... for\n\t\t\t   each sequence in the input.\n  -c, --manual-colors arg  A file containing one integer color per\n\t\t\t   sequence, one color per line. Colors may be\n\t\t\t   repeated. If there are multiple sequence files,\n\t\t\t   then this file should be a text file containing\n\t\t\t   the corresponding color filename for each\n\t\t\t   sequence file, one filename per line.\n      --no-colors          Build only the de Bruijn graph without colors.\n\t\t\t   Can be loaded later with --load-dbg (see\n\t\t\t   --help-advanced)\n\n Computational resources options:\n      --mem-gigas arg  Number of gigabytes allowed for external memory\n\t\t       algorithms (must be at least 2). (default: 2)\n  -t, --n-threads arg  Number of parallel exectuion threads. Default: 1\n\t\t       (default: 1)\n\n Advanced options:\n      --forward-strand-only     Do not add reverse complements of sequences\n\t\t\t\tto the index\n      --load-dbg                If given, loads a precomputed de Bruijn\n\t\t\t\tgraph from the index prefix. If this is\n\t\t\t\tgiven, the value of parameter -k is ignored\n\t\t\t\tbecause the order k is defined by the\n\t\t\t\tprecomputed de Bruijn graph.\n      --randomize-non-ACGT      Replace non-ACGT letters with random\n\t\t\t\tnucleotides. If this option is not given,\n\t\t\t\tk-mers containing a non-ACGT character are\n\t\t\t\tdeleted instead.\n  -d, --colorset-pointer-tradeoff arg\n\t\t\t\tThis option controls a time-space tradeoff\n\t\t\t\tfor storing and querying color sets. If\n\t\t\t\tgiven a value d, we store color set\n\t\t\t\tpointers only for every d nodes on every\n\t\t\t\tunitig. The higher the value of d, the\n\t\t\t\tsmaller then index, but the slower the\n\t\t\t\tqueries. The savings might be significant\n\t\t\t\tif the number of distinct color sets is\n\t\t\t\tsmall and the graph is large and has long\n\t\t\t\tunitigs. (default: 1)\n  -s, --coloring-structure-type arg\n\t\t\t\tType of coloring structure to build\n\t\t\t\t(\"sdsl-hybrid\", \"roaring\"). (default:\n\t\t\t\tsdsl-hybrid)\n      --from-index arg          Take as input a pre-built Themisto index.\n\t\t\t\tBuilds a new index in the format specified\n\t\t\t\tby --coloring-structure-type. This is\n\t\t\t\tcurrently implemented by decompressing the\n\t\t\t\tdistinct color sets in memory before\n\t\t\t\tre-encoding them, so this might take a lot\n\t\t\t\tof RAM.\n      --silent                  Print as little as possible to stderr (only\n\t\t\t\terrors).\n Help options:\n  -h, --help           Print usage instructions for commonly used options.\n      --help-advanced  Print advanced options usage.\n\nUsage example:\n./build/bin/themisto build -k 31 -i example_input/coli_file_list.txt --index-prefix my_index --temp-dir temp --mem-gigas 2 --n-threads 4 --file-colors\n```\n\n## Full instructions for `pseudoalign`\n\nThis program aligns query sequences against an index that has been built previously. The output is one line per input read. Each line consists of a space-separated list of integers. The first integer specifies the rank of the read in the input file, and the rest of the integers are the identifiers of the colors of the sequences that the read pseudoaligns with. If the program is ran with more than one thread, the output lines are not necessarily in the same order as the reads in the input file. This can be fixed with the option --sort-output, but this will slow down the program.\n\nThe query can be given as one file, or as a file with a list of files. In the former case, we must specify one output file with the options --out-file, and in the latter case, we must give a file that lists one output filename per line using the option --out-file-list.\n\nThe query file(s) should be in fasta of fastq format. The format is inferred from the file extension. Recognized file extensions for fasta are: .fasta, .fna, .ffn, .faa and .frn . Recognized extensions for fastq are: .fastq and .fq. Gzipped sequence files with the extension .gz are also supported.\n\n```\n\nUsage:\n  pseudoalign [OPTION...]\n\n Basic options:\n  -q, --query-file arg       Input file of the query sequences (default:\n\t\t\t     \"\")\n      --query-file-list arg  A list of query filenames, one line per\n\t\t\t     filename (default: \"\")\n  -o, --out-file arg         Output filename. Print results if no output\n\t\t\t     filename is given. (default: \"\")\n      --out-file-list arg    A file containing a list of output filenames,\n\t\t\t     one per line. (default: \"\")\n  -i, --index-prefix arg     The index prefix that was given to the build\n\t\t\t     command.\n      --temp-dir arg         Directory for temporary files.\n      --gzip-output          Compress the output files with gzip.\n      --sort-output          Sort the lines of the out files by sequence\n\t\t\t     rank in the input files.\n  -v, --verbose              More verbose progress reporting into stderr.\n\n Algorithm options:\n      --threshold arg          Fraction of k-mer matches required to report\n\t\t\t       a color. If this is equal to 1, the\n\t\t\t       algorithm is implemented with a specialized\n\t\t\t       set intersection method. (default: 0.7)\n      --include-unknown-kmers  Include all k-mers in the pseudoalignment,\n\t\t\t       even those which do not occur in the index.\n      --report-relevant-kmer-count\n\t\t\t\tAppends to each output line a semicolon\n\t\t\t\tfollowed by a space and then the number of\n\t\t\t\tk-mers of the query that had at least 1\n\t\t\t\tcolor.\n      --relevant-kmers-fraction arg\n\t\t\t\tAccept a pseudoalignment only if at least\n\t\t\t\tthis fraction of k-mers of the read had at\n\t\t\t\tleast 1 color. (default: 0.0)\n\n Computational resources options:\n  -t, --n-threads arg  Number of parallel execution threads. Default: 1\n\t\t       (default: 1)\n\n Advanced options:\n      --rc                     Include reverse complement matches in the\n\t\t\t       pseudoalignment. This option only makes\n\t\t\t       sense if the index was built with\n\t\t\t       --forward-strand-only. Otherwise this option\n\t\t\t       has no effect except to slow down the query.\n      --buffer-size-megas arg  Size of the input buffer in megabytes in\n\t\t\t       each thread. If this is larger than the\n\t\t\t       number of nucleotides in the input divided\n\t\t\t       by the number of threads, then some threads\n\t\t\t       will be idle. So if your input files are\n\t\t\t       really small and you have a lot of threads,\n\t\t\t       consider using a small buffer. (default:\n\t\t\t       8.0)\n      --silent                 Print as little as possible to stderr (only\n\t\t\t       errors).\n\n Help options:\n  -h, --help           Print usage instructions for commonly used options.\n      --help-advanced  Print advanced usage instructions.\n\nUsage example:\npseudoalign pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp --out-file out.txt --n-threads 4 --threshold 0.7\n\n```\n\nExamples:\n\nPseudoalign example_input/queries.fna against an index and print results:\n```\n./build/bin/themisto pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp\n```\n\nPseudoalign example_input/queries.fna against an index and write results to out.txt:\n```\n./build/bin/themisto pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp --out-file out.txt\n```\n\nPseudoalign a list of fasta files in input_list.txt into output filenames in output_list.txt:\n```\n./build/bin/themisto pseudoalign --query-file-list input_list.txt --index-prefix my_index --temp-dir temp --out-file-list output_list.txt\n```\n\n## Extracting unitigs with `extract-unitigs`\n\nThis command dumps the unitigs and optionally their colors out of an existing Themisto index.\n\n```\nUsage:\n  extract-unitigs [OPTION...]\n\n  -i, --index-prefix arg  The index prefix that was given to the build\n\t\t\t  command.\n      --fasta-out arg     Output filename for the unitigs in FASTA format\n\t\t\t  (optional). (default: \"\")\n      --gfa-out arg       Output the unitig graph in GFA1 format\n\t\t\t  (optional). (default: \"\")\n      --colors-out arg    Output filename for the unitig colors (optional).\n\t\t\t  If this option is not given, the colors are not\n\t\t\t  computed. Note that giving this option affects\n\t\t\t  the unitigs written to unitigs-out: if a unitig\n\t\t\t  has nodes with different color sets, the unitig\n\t\t\t  is split into maximal segments of nodes that have\n\t\t\t  equal color sets. The file format of the color\n\t\t\t  file is as follows: there is one line for each\n\t\t\t  unitig. The lines contain space-separated\n\t\t\t  strings. The first string on a line is the FASTA\n\t\t\t  header of a unitig (without the '\u003e'), and the\n\t\t\t  following strings on the line are the integer\n\t\t\t  color labels of the colors of that unitig. The\n\t\t\t  unitigs appear in the same order as in the FASTA\n\t\t\t  file. (default: \"\")\n      --min-colors arg    Extract maximal unitigs with at least (\u003e=)\n\t\t\t  min-colors in each node. Can't be used with\n\t\t\t  --colors-out. (optional) (default: 0)\n  -v, --verbose           More verbose progress reporting into stderr.\n  -h, --help              Print usage\n```\n\n## Extracting index statistics with `stats`\n\nThis command prints various statistics about the k-mers and colors in an existing index.\n\n```\nUsage:\n  stats [OPTION...]\n\n  -i, --index-prefix arg  The index prefix that was given to the build\n\t\t\t  command.\n      --unitigs           Also compute statistics on unitigs. This takes a\n\t\t\t  while and requires the temporary directory to be\n\t\t\t  set.\n      --temp-dir arg      Directory for temporary files.\n  -h, --help              Print usage\n```\n\n## Dumping the color matrix with `dump-color-matrix`\n\nThis command prints a file where each line corresponds to a k-mer in the index. The line starts with the k-mer, followed by space, followed by the color set of that k-mer. If `--sparse` is given, the color set is printed as a space-separated list of integers. Otherwise, the color set is printed as a string of zeroes and ones such that the i-th character is '1' iff color i is present in the color set.\n\nExample:\n\n```\n./build/bin/themisto dump-color-matrix -i my_index -o dump.txt --sparse\n```\n\nFull instructions:\n\n```\nUsage:\n  dump-color-matrix [OPTION...]\n\n  -i, --index-prefix arg  The index prefix that was given to the build\n\t\t\t  command.\n  -o, --output-file arg   The output file for the dump.\n  -v, --verbose           More verbose progress reporting into stderr.\n      --silent            Print as little as possible to stderr (only\n\t\t\t  errors).\n      --sparse            Print only the indices of non-zero entries.\n  -h, --help              Print usage\n```\n\n\n# For developers: building the tests\n\n```\ngit submodule init\ngit submodule update\ncd googletest\nmkdir build\ncd build\ncmake ..\nmake\ncd ../../build\ncmake .. -DCMAKE_BUILD_TYPE=Debug -DBUILD_THEMISTO_TESTS=1\nmake\n```\n\nThis builds the tests to `build/bin/themisto_tests`. The test executable must be ran at the root of the repository, or otherwise it wont find the test input files at `example_input`.\n\nTo build release binaries for Linux, use a machine with as old of a libc as possible for maximum compatibility. It's also important to disable architecture-specific optimizations in Roaring, so use the following cmake command:\n\n```\ncmake .. -DCMAKE_BUILD_ZLIB=1 -DCMAKE_BUILD_BZIP2=1 -DROARING_DISABLE_NATIVE=ON -DCMAKE_BUILD_TYPE=Release\n```\n\nSee the Wiki in Github for instructions on how to set up the build environment.\n\n# License\n\nThis software is licensed under GPLv2. See LICENSE.txt.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falgbio%2Fthemisto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falgbio%2Fthemisto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falgbio%2Fthemisto/lists"}