{"id":22188427,"url":"https://github.com/vmikk/seqhasher","last_synced_at":"2025-07-25T02:10:07.955Z","repository":{"id":227733736,"uuid":"772268723","full_name":"vmikk/seqhasher","owner":"vmikk","description":"SeqHasher - A tool for hashing individual sequences in FASTA files","archived":false,"fork":false,"pushed_at":"2024-12-31T12:19:53.000Z","size":226,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-30T00:29:55.265Z","etag":null,"topics":["dna-sequences","fasta","fastq","hashing"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vmikk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-14T21:25:06.000Z","updated_at":"2024-12-31T12:19:56.000Z","dependencies_parsed_at":"2024-06-21T02:27:43.410Z","dependency_job_id":"7ea372bc-2fd3-4814-bae7-459c9ac8f7f3","html_url":"https://github.com/vmikk/seqhasher","commit_stats":null,"previous_names":["vmikk/rechimizer","vmikk/seqhasher"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmikk%2Fseqhasher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmikk%2Fseqhasher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmikk%2Fseqhasher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmikk%2Fseqhasher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vmikk","download_url":"https://codeload.github.com/vmikk/seqhasher/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245344005,"owners_count":20599867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dna-sequences","fasta","fastq","hashing"],"created_at":"2024-12-02T11:10:28.168Z","updated_at":"2025-03-24T20:14:46.327Z","avatar_url":"https://github.com/vmikk.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SeqHasher\n\n[![Go Test](https://github.com/vmikk/seqhasher/actions/workflows/go-test.yml/badge.svg)](https://github.com/vmikk/seqhasher/actions/workflows/go-test.yml)\n[![Integration Tests](https://github.com/vmikk/seqhasher/actions/workflows/bash.yml/badge.svg)](https://github.com/vmikk/seqhasher/actions/workflows/bash.yml)\n[![codecov](https://codecov.io/gh/vmikk/seqhasher/branch/main/graph/badge.svg)](https://codecov.io/gh/vmikk/seqhasher)\n[![DOI](https://zenodo.org/badge/772268723.svg)](https://doi.org/10.5281/zenodo.14311356)\n\n## Overview\n`seqhasher` is a high-performance command-line tool designed to calculate a hash (digest or fingerprint) for each sequence in a FASTA or FASTQ file and add it to the sequence header. It supports multiple hashing algorithms and offers various output options.\n\n## Features\n\n- Fast processing of FASTA/FASTQ files (thanks to [shenwei356/bio](https://github.com/shenwei356/bio) package)\n- Support for multiple hash algorithms: SHA-1, SHA-3, MD5, xxHash, CityHash, MurmurHash3, ntHash, and BLAKE3\n- Automatic support for compressed input files (`gzip`, `zstd`, `xz`, and `bzip2`)\n- Supports reading from STDIN and writing to STDOUT\n- Option to output only headers or full sequences\n- Case-sensitive hashing option\n- Customizable output format (e.g., include filename or a custom text string in the header)\n\n## Quick start\n\nInput data (e.g., `input.fasta`):\n```\n\u003eseq1\nAAAA\n\u003eseq2\nACTG\n\u003eseq3\naaaa\n``` \n\nBasic usage (default SHA1 hash):\n`seqhasher input.fasta -`\n```\n\u003einput.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1\nAAAA\n\u003einput.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2\nACTG\n\u003einput.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3\nAAAA\n```\n\nCustom name instead of input filename (e.g., useful when processing stdin):\n`seqhasher --name \"test_file\" input.fasta -`\n```\n\u003etest_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1\nAAAA\n\u003etest_file;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2\nACTG\n\u003etest_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3\nAAAA\n```\n\nOutput only headers:\n`seqhasher --headersonly input.fasta -`\n```\ninput.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1\ninput.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2\ninput.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3\n```\n\nOmit filename from output:\n`seqhasher --headersonly --nofilename input.fasta -`\n```\ne2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1\n65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2\ne2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3\n```\n\nUse different hash functions (xxHash) and case-sensitive mode:\n`seqhasher --headersonly --nofilename --hash xxhash --casesensitive input.fasta -`\n```\ncf40b5b72bc43e77;seq1\n704b34bf20faedf2;seq2\n42a70d1abf84bf32;seq3\n```\n\nMultiple hashes (useful to ensure absence of collisions):\n`seqhasher --headersonly --nofilename --hash sha1,xxhash --casesensitive input.fasta -`\n```\ne2512172abf8cc9f67fdd49eb6cacf2df71bbad3;cf40b5b72bc43e77;seq1\n65c89f59d38cdbf90dfaf0b0a6884829df8396b0;704b34bf20faedf2;seq2\n70c881d4a26984ddce795f6f71817c9cf4480e79;42a70d1abf84bf32;seq3\n```\n\n## Usage\n\n```plaintext\nseqhasher [options] \u003cinput_file\u003e [output_file]\n\nOptions:\n  -o, --headersonly   Output only sequence headers, excluding the sequences themselves\n  -H, --hash \u003ctype1,type2,...\u003e Hash algorithm(s): sha1 (default), sha3, md5, xxhash, cityhash, murmur3, nthash, blake3\n  -c, --casesensitive Take into account sequence case. By default, sequences are converted to uppercase\n  -n, --nofilename    Omit the file name from the sequence header\n  -f, --name \u003ctext\u003e   Replace the input file's name in the header with \u003ctext\u003e\n  -v, --version       Print the version of the program and exit\n  -h, --help          Show this help message and exit\n\nArguments:\n  \u003cinput_file\u003e     Path to the input FASTA/FASTQ file (supports gzip, zstd, xz, or bzip2 compression)\n                   or '-' for standard input (stdin)\n  [output_file]    Path to the output file or '-' for standard output (stdout)\n                   If omitted, output is sent to stdout.\n```\n\n### Description\n\nThe tool can either read the input from a specified file or from standard input (`stdin`), \nand similarly, it can write the output to a specified file or standard output (`stdout`).  \n\nThe `--name` option allows to customize the header of the output by specifying \na text to replace the input file name.\n\nThe `--hash` option allows to specify which hash function to use \n(multiple coma-separated values allowed, e.g., `--hash sha1,nthash`). \nCurrently, the following hash functions are supported:  \n- `sha1`: [SHA-1](https://en.wikipedia.org/wiki/SHA-1) (default), 160-bit hash value\n- `sha3`: [SHA-3](https://en.wikipedia.org/wiki/SHA-3), Keccak-based secure cryptographic hash standard, 512-bit hash value\n- `md5`: [MD5](https://en.wikipedia.org/wiki/MD5), 128-bit hash value\n- `xxhash`: [xxHash](https://xxhash.com/), extremely fast algorithm, 64-bit hash value\n- `cityhash`: [CityHash](https://opensource.googleblog.com/2011/04/introducing-cityhash.html) (e.g., used in [VSEARCH](https://github.com/torognes/vsearch/)), 128-bit hash value\n- `murmur3`: [Murmur3](https://en.wikipedia.org/wiki/MurmurHash) (e.g., used in [Sourmash](https://github.com/sourmash-bio/sourmash), but 64-bit), 128-bit hash value\n- `nthash`: [ntHash](https://github.com/bcgsc/ntHash) (designed for DNA sequences), 64-bit hash value. This implementation uses the full length of the sequence as the k-mer size, effectively hashing the entire sequence at once using the non-canonical (forward) hash of the sequence\n- `blake3`: [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) (fast cryptographic hash function), 256-bit hash value\n\n\u003e [!NOTE]\n\u003e The probability of a collision (when different DNA sequences end up with the same hash) \n\u003e is roughly 1 in 2\u003csup\u003e*nbits*\u003c/sup\u003e, where *nbits* is the length of the hash in bits. \n\u003e This means that functions with shorter bit-lengths (e.g., `Murmur3` and `CityHash`) \n\u003e are more likely to have collisions as the dataset grows, \n\u003e while `SHA-3` has a much lower chance of collisions because of its larger bit length. \n\u003e However, shorter hashes are generally faster to compute \n\u003e and take up less space when saved to a file, \n\u003e making them more efficient for some tasks despite the higher collision risk.\n\n### Examples\n\nTo process a FASTA file and output to another file:\n```bash\nseqhasher input.fasta output.fasta\n```\n\nTo process a FASTA file from standard input and output to standard output, while replacing the file name in the header with 'Sample':\n```bash\ncat input.fasta | seqhasher --name 'Sample' - - \u003e output.fasta\n# OR\nseqhasher --name 'Sample' - - \u003c input.fasta \u003e output.fasta\n```\n\n## Benchmark\n\nTo evaluate the performance of two solutions for processing DNA sequences, \nwe utilized [`hyperfine`](https://github.com/sharkdp/hyperfine).\n\nBenchmarks were performed on a system with the following specifications:\n- CPU: Intel Core i7-10510U (Comet Lake)\n- Storage: NVMe SSD\n\n### Test data\n\nFirst, let's create the test data - \na FASTA file containing 500,000 sequences, each 30 to 3000 nucleotides long \n(this should take a couple of minutes):  \n\n```bash\nawk -v numSeq=500000 'BEGIN{\n    srand();\n    for(i=1; i\u003c=numSeq; i++){\n        seqLen=int(rand()*(2971))+30;\n        printf(\"\u003eseq_%d\\n\", i);\n        for(j=1; j\u003c=seqLen; j++){\n            r=rand();\n            if(r \u003c 0.25) nucleotide=\"A\";\n            else if(r \u003c 0.5) nucleotide=\"C\";\n            else if(r \u003c 0.75) nucleotide=\"G\";\n            else nucleotide=\"T\";\n            printf(\"%s\", nucleotide);\n        }\n        printf(\"\\n\");\n    }\n}' \u003e big.fasta\n```\nThe size of the file is ~760MB.\n\n\n### Hashing functions performance\n\n```bash\nhyperfine \\\n  --runs 10 --warmup 3 \\\n  --export-markdown hashing_benchmark.md \\\n  'seqhasher --headersonly --casesensitive --hash md5      big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash sha1     big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash sha3     big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash xxhash   big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash murmur3  big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash cityhash big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash nthash   big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --casesensitive --hash blake3   big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --hash sha1,blake3    big.fasta - \u003e /dev/null' \\\n  'seqhasher --headersonly --hash xxhash,murmur3 big.fasta - \u003e /dev/null'\n```\n\n| Command          |      Mean [s] | Min [s] | Max [s] |    Relative |\n|:-----------------|--------------:|--------:|--------:|------------:|\n| `md5`            | 1.712 ± 0.069 |   1.651 |   1.847 | 1.75 ± 0.10 |\n| `sha1`           | 1.614 ± 0.021 |   1.586 |   1.645 | 1.65 ± 0.08 |\n| `sha3`           | 4.823 ± 0.135 |   4.707 |   5.090 | 4.93 ± 0.26 |\n| `xxhash`         | 0.977 ± 0.043 |   0.941 |   1.079 |        1.00 |\n| `murmur3`        | 1.106 ± 0.058 |   1.058 |   1.233 | 1.13 ± 0.08 |\n| `cityhash`       | 1.078 ± 0.019 |   1.048 |   1.111 | 1.10 ± 0.05 |\n| `nthash`         | 2.138 ± 0.022 |   2.112 |   2.170 | 2.19 ± 0.10 |\n| `blake3`         | 1.718 ± 0.066 |   1.645 |   1.864 | 1.76 ± 0.10 |\n| `sha1,blake3`    | 3.384 ± 0.096 |   3.290 |   3.640 | 3.46 ± 0.18 |\n| `xxhash,murmur3` | 2.234 ± 0.073 |   2.193 |   2.422 | 2.29 ± 0.13 |\n\n`Values are in seconds per 500,000 sequences (756,622,201 bp)`\n\nAs shown, xxHash provides the best performance, followed by CityHash and MurmurHash3. \nThese hash functions produce relatively short hash fingerprints (64 and 128 bits, respectively). \nIn contrast, SHA-3 is the slowest hash function in this benchmark, generating the longest hash (512 bits).  \n\n\u003e [!NOTE]\n\u003e However, it's important to note that these values may depend on \n\u003e the instruction set of the CPU being used, as some processors may \n\u003e optimize specific algorithms differently (e.g., via `SIMD` or other hardware acceleration). \n\u003e For example, modern CPUs may use **SHA Extensions** to accelerate SHA-family algorithms. \n\u003e Additionally, the performance reported here is tied to the particular implementations \n\u003e of the hash algorithms used in `seqhasher`. Other implementations may yield different results, \n\u003e and these values should not be interpreted as a definitive ranking of the algorithms themselves.\n\n\n## Installation\n\n### Pre-built binaries\n\nDownload the latest release for your platform from the [Releases](https://github.com/vmikk/seqhasher/releases) page.\n\n### Building from source\n\nEnsure you have Go 1.23 or later [installed](https://go.dev/dl/).  \nThen, to install `seqhasher` v.1.1.1 run:\n\n``` bash\ngit clone --depth 1 --branch 1.1.1 https://github.com/vmikk/seqhasher\ncd seqhasher\ngo build -ldflags=\"-w -s\" seqhasher.go\n```\n\n## Known issues and limitations\n\n- Seqhasher does not take line wrapping in FASTA file into account (whitespace characters are stripped from the sequence before processing);\n- The tool may not work correctly with sequences containing non-ASCII characters;\n- IUPAC ambiguity codes (R,Y,S,W,K,M,B,D,H,V,N), characters denoting gaps ('-' or '.'), **and any other non-DNA characters** are handled \"as is\" (hash will depend on them);\n- Empty sequences return an empty hash;\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmikk%2Fseqhasher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvmikk%2Fseqhasher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmikk%2Fseqhasher/lists"}