{"id":34653477,"url":"https://github.com/jermp/sshash","last_synced_at":"2026-04-21T22:03:34.193Z","repository":{"id":38414631,"uuid":"447541855","full_name":"jermp/sshash","owner":"jermp","description":"📖 🧬 SSHash is a compressed, associative, exact, and weighted dictionary for k-mers.","archived":false,"fork":false,"pushed_at":"2026-04-20T18:40:06.000Z","size":28888,"stargazers_count":98,"open_issues_count":0,"forks_count":18,"subscribers_count":4,"default_branch":"master","last_synced_at":"2026-04-20T20:34:25.782Z","etag":null,"topics":["bioinformatics","dictionary","hashing","k-mer","minimal-perfect-hash"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jermp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-01-13T09:38:46.000Z","updated_at":"2026-04-20T18:40:10.000Z","dependencies_parsed_at":"2023-12-30T18:30:40.244Z","dependency_job_id":"6237198c-fce8-4cd0-b727-f1705e35afa3","html_url":"https://github.com/jermp/sshash","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/jermp/sshash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jermp%2Fsshash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jermp%2Fsshash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jermp%2Fsshash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jermp%2Fsshash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jermp","download_url":"https://codeload.github.com/jermp/sshash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jermp%2Fsshash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32112030,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-21T11:25:29.218Z","status":"ssl_error","status_checked_at":"2026-04-21T11:25:28.499Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","dictionary","hashing","k-mer","minimal-perfect-hash"],"created_at":"2025-12-24T17:59:36.776Z","updated_at":"2026-04-21T22:03:34.181Z","avatar_url":"https://github.com/jermp.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build](https://github.com/jermp/sshash/actions/workflows/build.yml/badge.svg)](https://github.com/jermp/sshash/actions/workflows/build.yml)\n[![CodeQL](https://github.com/jermp/sshash/actions/workflows/codeql.yml/badge.svg)](https://github.com/jermp/sshash/actions/workflows/codeql.yml)\n[![install with bioconda](https://img.shields.io/conda/dn/bioconda/sshash.svg?style=flag\u0026logo=anaconda\u0026logoColor=lightgray\u0026labelColor=rgb(40,47,56)\u0026color=rgb(68,190,80)\u0026label=Install%20with%20bioconda)](http://bioconda.github.io/recipes/sshash/README.html)\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7772316.svg)](https://doi.org/10.5281/zenodo.7772316)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7239205.svg)](https://doi.org/10.5281/zenodo.7239205)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17582116.svg)](https://doi.org/10.5281/zenodo.17582116)\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"img/sshash_on_dark.png\"\u003e\n  \u003cimg src=\"img/sshash.png\" width=\"350\" alt=\"Logo\"\u003e\n\u003c/picture\u003e\n\n**SSHash** is a compressed dictionary data structure for k-mers\n(strings of length k over the DNA alphabet {A,C,G,T}), based on **S**parse and **S**kew **Hash**ing.\n\n**NEWS:** A Rust port of SSHash is available [here](https://github.com/COMBINE-lab/sshash-rs)!\n\nThe data structure is described in the following papers (most recent first):\n\n* [Optimizing sparse and skew hashing: faster k-mer dictionaries](https://www.biorxiv.org/content/10.64898/2026.01.21.700884v1) [1]\n* [On weighted k-mer dictionaries](https://almob.biomedcentral.com/articles/10.1186/s13015-023-00226-2) [2,3]\n* [Sparse and skew hashing of k-mers](https://doi.org/10.1093/bioinformatics/btac245) [4]\n\n**Please, cite these papers if you use SSHash.**\n\nFor a dictionary of n k-mers,\ntwo basic queries are supported:\n\n- i = **Lookup**(x), where i is in [0,n) if the k-mer x is found in the dictionary or i = -1 otherwise;\n- x = **Access**(i), where x is the k-mer associated to the identifier i.\n\nIf also the weights of the k-mers (their frequency counts) are stored in the dictionary, then the dictionary is said to be *weighted* and it also supports:\n\n- w = **Weight**(i), where i is a given k-mer identifier and w is the weight of the k-mer.\n\nOther supported queries are:\n\n- **Membership Queries**: determine if a given k-mer is present in the dictionary or not.\n- **Streaming Queries**: stream through all k-mers of a given DNA file\n(.fasta or .fastq formats) to determine their membership to the dictionary.\n- **Navigational Queries**: given a k-mer x[1..k] determine if x[2..k]+c is present (forward neighbourhood) and if c+x[1..k-1] is present (backward neighbourhood), for c in {A,C,G,T} ('+' here means string concatenation).\nSSHash internally stores a set of strings, each associated to a distinct identifier.\nIf a string identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the string are returned.\n\nIf you are interested in a **membership-only** version of SSHash, have a look at [SSHash-Lite](https://github.com/jermp/sshash-lite). It also works for input files with duplicate k-mers (e.g., [matchtigs](https://github.com/algbio/matchtigs) [5]). For a query sequence S and a given coverage threshold E in [0,1], the sequence is considered to be present in the dictionary if at least E*(|S|-k+1) of the k-mers of S are positive.\n\n**NOTE**: It is assumed that two k-mers being the *reverse complement* of each other are the same.\n\n#### Table of contents\n* [Compiling the Code](#compiling-the-code)\n* [Dependencies](#dependencies)\n* [Tools and Usage](#tools-and-usage)\n* [Examples](#Examples)\n* [Input Files](#input-files)\n* [Create a New Release](#create-a-new-release)\n* [Benchmarks](#benchmarks)\n* [References](#references)\n\nCompiling the Code\n------------------\n\nThe code is tested on Linux with `gcc` and on Mac with `clang`.\nTo build the code, [`CMake`](https://cmake.org/) is required.\n\nClone the repository with\n\n    git clone --recursive https://github.com/jermp/sshash.git\n\nIf you have cloned the repository **without** `--recursive`, be sure you pull the dependencies with the following command before\ncompiling:\n\n    git submodule update --init --recursive\n\nTo compile the code for a release environment (see file `CMakeLists.txt` for the used compilation flags), it is sufficient to do the following:\n\n    mkdir build\n    cd build\n    cmake ..\n    make -j\n\n**NOTE**: For best performance on `x86` architectures, the option `-D SSHASH_USE_ARCH_NATIVE` can be specified as well.\n\nFor a testing environment, use the following instead:\n\n    mkdir debug_build\n    cd debug_build\n    cmake .. -D CMAKE_BUILD_TYPE=Debug -D SSHASH_USE_SANITIZERS=On\n    make -j\n\n### Encoding of Nucleotides\n\nSSHash uses by default the following 2-bit encoding of nucleotides.\n\n\t A     65     01000.00.1 -\u003e 00\n\t C     67     01000.01.1 -\u003e 01\n\t G     71     01000.11.1 -\u003e 11\n\t T     84     01010.10.0 -\u003e 10\n\n\t a     97     01100.00.1 -\u003e 00\n\t c     99     01100.01.1 -\u003e 01\n\t g    103     01100.11.1 -\u003e 11\n\t t    116     01110.10.0 -\u003e 10\n\nIf you want to use the \"traditional\" encoding\n\n\t A     65     01000001 -\u003e 00\n\t C     67     01000011 -\u003e 01\n\t G     71     01000111 -\u003e 10\n\t T     84     01010100 -\u003e 11\n\n\t a     97     01100001 -\u003e 00\n\t c     99     01100011 -\u003e 01\n\t g    103     01100111 -\u003e 10\n\t t    116     01110100 -\u003e 11\n\nfor compatibility issues with other software, then\ncompile SSHash with the flag `-DSSHASH_USE_TRADITIONAL_NUCLEOTIDE_ENCODING=On`.\n\n### K-mer Length\n\nBy default, SSHash uses a maximum k-mer length of 31.\nIf you want to support k-mer lengths up to (and including) 63,\ncompile the library with the flag `-DSSHASH_USE_MAX_KMER_LENGTH_63=On`.\n\nDependencies\n------------\n\nThe repository has minimal dependencies: it only uses the [PTHash](https://github.com/jermp/pthash) library (for minimal perfect hashing), and `zlib` to read gzip-compressed streams.\n\nTo automatically pull the PTHash dependency, just clone the repo with\n`--recursive` as explained in [Compiling the Code](#compiling-the-code).\n\nIf you do not have `zlib` installed, you can do\n\n    sudo apt-get install zlib1g\n\nif you are on Linux/Ubuntu, or\n\n    brew install zlib\n\nif you have a Mac.\n\nTools and Usage\n---------------\n\nThere is one executable called `sshash` after the compilation, which can be used to run a tool.\nRun `./sshash` as follows to see a list of available tools.\n\nFor large-scale indexing, it could be necessary to increase the number of file descriptors that can be opened simultaneously:\n\n\tulimit -n 2048\n\nExamples\n--------\n\nFor the examples, we are going to use some collections\nof *stitched unitigs* from the directory `data/unitigs_stitched`.\n\n**Important note:** The value of k used during the formation of the unitigs\nis indicated in the name of each file and the dictionaries\n**must** be built with that value as well to ensure correctness.\n\nFor example, `data/unitigs_stitched/ecoli4_k31_ust.fa.gz` indicates the value k = 31, whereas `data/unitigs_stitched/se.ust.k63.fa.gz` indicates the value k = 63.\n\nFor all the examples below, we are going to use k = 31.\n\n(The directory `data/unitigs_stitched/with_weights` contains some files with k-mers' weights too.)\n\nIn the section [Input Files](#input-files), we explain how\nsuch collections of stitched unitigs can be obtained from raw FASTA files.\n\n### Example 1\n\n    ./sshash build -i ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz -k 31 -m 13 --check -o salmonella_enterica.sshash\n\nThis example builds a dictionary for the k-mers read from the file `../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz`,\nwith k = 31 and m = 13. It also check the correctness of the dictionary (`--check` option) and serializes the index on disk to the file `salmonella_enterica.sshash`.\n\nTo run a performance benchmark after construction of the index,\nuse:\n\n    ./sshash bench -i salmonella_enterica.sshash\n\nTo also store the weights, use the option `--weighted`:\n\n    ./sshash build -i ../data/unitigs_stitched/with_weights/salmonella_enterica.ust.k31.fa.gz -k 31 -m 13 --weighted --check --verbose\n\n### Example 2\n\n    ./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 15 -o salmonella_100.sshash\n\nThis example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (a pangenome consisting in 100 genomes of *Salmonella Enterica*), with k = 31, m = 15, and l = 2. It also serializes the index on disk to the file `salmonella_100.sshash`.\n\nTo perform some streaming membership queries, use:\n\n    ./sshash query -i salmonella_100.sshash -q ../data/queries/SRR5833294.10K.fastq.gz\n\nif your queries are meant to be read from a FASTQ file, or\n\n    ./sshash query -i salmonella_100.sshash -q ../data/queries/salmonella_enterica.fasta.gz --multiline\n\nif your queries are to be read from a (multi-line) FASTA file.\n\n### Example 3\n\n    ./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 13 --canonical -o salmonella_100.canon.sshash\n\nThis example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (same used in Example 2), with k = 31, m = 13, and with the canonical parsing modality (option `--canonical`). The dictionary is serialized on disk to the file `salmonella_100.canon.sshash`.\n\nThe \"canonical\" version of the dictionary offers more speed for only a little space increase, especially under low-hit workloads -- when the majority of k-mers are not found in the dictionary. (For all details, refer to the paper.)\n\nBelow a comparison between the dictionary built in Example 2 (not canonical)\nand the one just built (Example 3, canonical).\n\n    ./sshash query -i salmonella_100.sshash -q ../data/queries/SRR5833294.10K.fastq.gz\n\n    ./sshash query -i salmonella_100.canon.sshash -q ../data/queries/SRR5833294.10K.fastq.gz\n\nBoth queries should originate the following report (reported here for reference):\n\n    ==== query report:\n    num_kmers = 460000\n    num_positive_kmers = 46 (0.01%)\n    num_searches = 42/46 (91.3043%)\n    num_extensions = 4/46 (8.69565%)\n\nThe canonical dictionary can be twice as fast as the regular dictionary\nfor low-hit workloads, even on this tiny example, for only +0.3 bits/k-mer.\n\n### Example 4\n\n    ./sshash permute -i ../data/unitigs_stitched/with_weights/ecoli_sakai.ust.k31.fa.gz -k 31 -o ecoli_sakai.permuted.fa\n\nThis command re-orders (and possibly reverse-complement) the strings in the collection as to *minimize* the number of runs in the weights and, hence, optimize the encoding of the weights.\nThe result is saved to the file `ecoli_sakai.permuted.fa`.\n\nIn this example for the E.Coli collection (Sakai strain) we reduce the number of runs in the weights from 5820 to 3723.\n\nThen use the `build` command as usual to build the permuted collection:\n\n    ./sshash build -i ecoli_sakai.permuted.fa -k 31 -m 13 --weighted --verbose\n\nThe index built on the permuted collection\noptimizes the storage space for the weights which results in a 15.1X better space than the empirical entropy of the weights.\n\nFor reference, the index built on the original collection:\n\n    ./sshash build -i ../data/unitigs_stitched/with_weights/ecoli_sakai.ust.k31.fa.gz -k 31 -m 13 --weighted --verbose\n\nalready achieves a 12.4X better space than the empirical entropy.\n\nInput Files\n-----------\n\nSSHash is meant to index k-mers from collections that **do not contain duplicates\nnor invalid k-mers** (strings containing symbols different from {A,C,G,T}).\nThese collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph, or eulertigs, using the [GGCAT](https://github.com/algbio/ggcat) algorithm.\n\n**NOTE**: Input files are expected to have **one DNA sequence per line**. If a sequence spans multiple lines (e.g., multi-fasta), the lines should be concatenated before indexing.\n\n#### Datasets\n\nThe script `scripts/download_and_preprocess_datasets.sh` of [this release](https://github.com/jermp/sshash/releases/tag/v3.0.0)\ncontains all the needed steps to download and pre-process\nthe datasets that we used in [1].\n\nFor the experiments in [2] and [3], we used the datasets available at [https://doi.org/10.5281/zenodo.7772316](https://doi.org/10.5281/zenodo.7772316).\n\nFor the latest benchmarks maintained in [this other repository](https://github.com/jermp/kmer_sets_benchmark)\nwe used the datasets described at [https://zenodo.org/records/17582116](https://zenodo.org/records/17582116).\n\n#### Weights\n\nUsing the option `-all-abundance-counts` of [BCALM2](https://github.com/GATB/bcalm), it is possible to also include the abundance counts of the k-mers in the BCALM2 output. Then, use the option `-a 1` of [UST](https://github.com/jermp/UST) to include such counts in the stitched unitigs.\n\nCreate a New Release\n--------------------\n\nIt is recommended to create a new release with the script `script/create_release.sh` which\n**also includes the source code for the dependencies** in `external`\n(this is not done by GitHub).\n\nTo create a new release, run the following command *from the parent directory*:\n\n    bash script/create_release.sh --format zip [RELEASE-NAME]\n\nfor example\n\n    bash script/create_release.sh --format zip v4.0.0.tar.gz\n\nThen upload the created archive here https://github.com/jermp/sshash/releases. It should appear under the \"Assets\" section of the corresponding release.\n\n**Note 1**: The sha256 hash code printed at the end is needed for distribution via Bioconda.\n\n**Note 2**: Avoid dashes in the name of the release because Bioconda does not like them.\n\nBenchmarks\n----------\n\nThe directory [`benchmarks`](/benchmarks) includes some performance benchmarks.\n\nReferences\n----------\n\n* [1] Giulio Ermanno Pibiri and Rob Patro. [Optimizing sparse and skew hashing: faster k-mer dictionaries](https://www.biorxiv.org/content/10.64898/2026.01.21.700884v1). BioRxiv. 2026.\n* [2] Giulio Ermanno Pibiri. [On weighted k-mer dictionaries](https://almob.biomedcentral.com/articles/10.1186/s13015-023-00226-2). Algorithms for Molecular Biology (ALGOMB). 2023.\n* [3] Giulio Ermanno Pibiri. [On weighted k-mer dictionaries](https://drops.dagstuhl.de/opus/volltexte/2022/17043/). International Workshop on Algorithms in Bioinformatics (WABI). 2022.\n* [4] Giulio Ermanno Pibiri. [Sparse and skew hashing of k-mers](https://doi.org/10.1093/bioinformatics/btac245). Bioinformatics. 2022.\n* [5] Schmidt, S., Khan, S., Alanko, J., Pibiri, G. E., and Tomescu, A. I. [Matchtigs: minimum plain text representation of k-mer sets](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02968-z). Genome Biology 24, 136. 2023.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjermp%2Fsshash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjermp%2Fsshash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjermp%2Fsshash/lists"}