https://github.com/vmikk/seqhasher
SeqHasher - A tool for hashing individual sequences in FASTA files
https://github.com/vmikk/seqhasher
dna-sequences fasta fastq hashing
Last synced: 2 months ago
JSON representation
SeqHasher - A tool for hashing individual sequences in FASTA files
- Host: GitHub
- URL: https://github.com/vmikk/seqhasher
- Owner: vmikk
- License: gpl-3.0
- Created: 2024-03-14T21:25:06.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-31T12:19:53.000Z (5 months ago)
- Last Synced: 2025-01-30T00:29:55.265Z (4 months ago)
- Topics: dna-sequences, fasta, fastq, hashing
- Language: Go
- Homepage:
- Size: 221 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# SeqHasher
[](https://github.com/vmikk/seqhasher/actions/workflows/go-test.yml)
[](https://github.com/vmikk/seqhasher/actions/workflows/bash.yml)
[](https://codecov.io/gh/vmikk/seqhasher)
[](https://doi.org/10.5281/zenodo.14311356)## Overview
`seqhasher` is a high-performance command-line tool designed to calculate a hash (digest or fingerprint) for each sequence in a FASTA or FASTQ file and add it to the sequence header. It supports multiple hashing algorithms and offers various output options.## Features
- Fast processing of FASTA/FASTQ files (thanks to [shenwei356/bio](https://github.com/shenwei356/bio) package)
- Support for multiple hash algorithms: SHA-1, SHA-3, MD5, xxHash, CityHash, MurmurHash3, ntHash, and BLAKE3
- Automatic support for compressed input files (`gzip`, `zstd`, `xz`, and `bzip2`)
- Supports reading from STDIN and writing to STDOUT
- Option to output only headers or full sequences
- Case-sensitive hashing option
- Customizable output format (e.g., include filename or a custom text string in the header)## Quick start
Input data (e.g., `input.fasta`):
```
>seq1
AAAA
>seq2
ACTG
>seq3
aaaa
```Basic usage (default SHA1 hash):
`seqhasher input.fasta -`
```
>input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1
AAAA
>input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2
ACTG
>input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3
AAAA
```Custom name instead of input filename (e.g., useful when processing stdin):
`seqhasher --name "test_file" input.fasta -`
```
>test_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1
AAAA
>test_file;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2
ACTG
>test_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3
AAAA
```Output only headers:
`seqhasher --headersonly input.fasta -`
```
input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1
input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2
input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3
```Omit filename from output:
`seqhasher --headersonly --nofilename input.fasta -`
```
e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1
65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2
e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3
```Use different hash functions (xxHash) and case-sensitive mode:
`seqhasher --headersonly --nofilename --hash xxhash --casesensitive input.fasta -`
```
cf40b5b72bc43e77;seq1
704b34bf20faedf2;seq2
42a70d1abf84bf32;seq3
```Multiple hashes (useful to ensure absence of collisions):
`seqhasher --headersonly --nofilename --hash sha1,xxhash --casesensitive input.fasta -`
```
e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;cf40b5b72bc43e77;seq1
65c89f59d38cdbf90dfaf0b0a6884829df8396b0;704b34bf20faedf2;seq2
70c881d4a26984ddce795f6f71817c9cf4480e79;42a70d1abf84bf32;seq3
```## Usage
```plaintext
seqhasher [options] [output_file]Options:
-o, --headersonly Output only sequence headers, excluding the sequences themselves
-H, --hash Hash algorithm(s): sha1 (default), sha3, md5, xxhash, cityhash, murmur3, nthash, blake3
-c, --casesensitive Take into account sequence case. By default, sequences are converted to uppercase
-n, --nofilename Omit the file name from the sequence header
-f, --name Replace the input file's name in the header with
-v, --version Print the version of the program and exit
-h, --help Show this help message and exitArguments:
Path to the input FASTA/FASTQ file (supports gzip, zstd, xz, or bzip2 compression)
or '-' for standard input (stdin)
[output_file] Path to the output file or '-' for standard output (stdout)
If omitted, output is sent to stdout.
```### Description
The tool can either read the input from a specified file or from standard input (`stdin`),
and similarly, it can write the output to a specified file or standard output (`stdout`).The `--name` option allows to customize the header of the output by specifying
a text to replace the input file name.The `--hash` option allows to specify which hash function to use
(multiple coma-separated values allowed, e.g., `--hash sha1,nthash`).
Currently, the following hash functions are supported:
- `sha1`: [SHA-1](https://en.wikipedia.org/wiki/SHA-1) (default), 160-bit hash value
- `sha3`: [SHA-3](https://en.wikipedia.org/wiki/SHA-3), Keccak-based secure cryptographic hash standard, 512-bit hash value
- `md5`: [MD5](https://en.wikipedia.org/wiki/MD5), 128-bit hash value
- `xxhash`: [xxHash](https://xxhash.com/), extremely fast algorithm, 64-bit hash value
- `cityhash`: [CityHash](https://opensource.googleblog.com/2011/04/introducing-cityhash.html) (e.g., used in [VSEARCH](https://github.com/torognes/vsearch/)), 128-bit hash value
- `murmur3`: [Murmur3](https://en.wikipedia.org/wiki/MurmurHash) (e.g., used in [Sourmash](https://github.com/sourmash-bio/sourmash), but 64-bit), 128-bit hash value
- `nthash`: [ntHash](https://github.com/bcgsc/ntHash) (designed for DNA sequences), 64-bit hash value. This implementation uses the full length of the sequence as the k-mer size, effectively hashing the entire sequence at once using the non-canonical (forward) hash of the sequence
- `blake3`: [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) (fast cryptographic hash function), 256-bit hash value> [!NOTE]
> The probability of a collision (when different DNA sequences end up with the same hash)
> is roughly 1 in 2*nbits*, where *nbits* is the length of the hash in bits.
> This means that functions with shorter bit-lengths (e.g., `Murmur3` and `CityHash`)
> are more likely to have collisions as the dataset grows,
> while `SHA-3` has a much lower chance of collisions because of its larger bit length.
> However, shorter hashes are generally faster to compute
> and take up less space when saved to a file,
> making them more efficient for some tasks despite the higher collision risk.### Examples
To process a FASTA file and output to another file:
```bash
seqhasher input.fasta output.fasta
```To process a FASTA file from standard input and output to standard output, while replacing the file name in the header with 'Sample':
```bash
cat input.fasta | seqhasher --name 'Sample' - - > output.fasta
# OR
seqhasher --name 'Sample' - - < input.fasta > output.fasta
```## Benchmark
To evaluate the performance of two solutions for processing DNA sequences,
we utilized [`hyperfine`](https://github.com/sharkdp/hyperfine).Benchmarks were performed on a system with the following specifications:
- CPU: Intel Core i7-10510U (Comet Lake)
- Storage: NVMe SSD### Test data
First, let's create the test data -
a FASTA file containing 500,000 sequences, each 30 to 3000 nucleotides long
(this should take a couple of minutes):```bash
awk -v numSeq=500000 'BEGIN{
srand();
for(i=1; i<=numSeq; i++){
seqLen=int(rand()*(2971))+30;
printf(">seq_%d\n", i);
for(j=1; j<=seqLen; j++){
r=rand();
if(r < 0.25) nucleotide="A";
else if(r < 0.5) nucleotide="C";
else if(r < 0.75) nucleotide="G";
else nucleotide="T";
printf("%s", nucleotide);
}
printf("\n");
}
}' > big.fasta
```
The size of the file is ~760MB.### Hashing functions performance
```bash
hyperfine \
--runs 10 --warmup 3 \
--export-markdown hashing_benchmark.md \
'seqhasher --headersonly --casesensitive --hash md5 big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash sha1 big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash sha3 big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash xxhash big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash murmur3 big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash cityhash big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash nthash big.fasta - > /dev/null' \
'seqhasher --headersonly --casesensitive --hash blake3 big.fasta - > /dev/null' \
'seqhasher --headersonly --hash sha1,blake3 big.fasta - > /dev/null' \
'seqhasher --headersonly --hash xxhash,murmur3 big.fasta - > /dev/null'
```| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:-----------------|--------------:|--------:|--------:|------------:|
| `md5` | 1.712 ± 0.069 | 1.651 | 1.847 | 1.75 ± 0.10 |
| `sha1` | 1.614 ± 0.021 | 1.586 | 1.645 | 1.65 ± 0.08 |
| `sha3` | 4.823 ± 0.135 | 4.707 | 5.090 | 4.93 ± 0.26 |
| `xxhash` | 0.977 ± 0.043 | 0.941 | 1.079 | 1.00 |
| `murmur3` | 1.106 ± 0.058 | 1.058 | 1.233 | 1.13 ± 0.08 |
| `cityhash` | 1.078 ± 0.019 | 1.048 | 1.111 | 1.10 ± 0.05 |
| `nthash` | 2.138 ± 0.022 | 2.112 | 2.170 | 2.19 ± 0.10 |
| `blake3` | 1.718 ± 0.066 | 1.645 | 1.864 | 1.76 ± 0.10 |
| `sha1,blake3` | 3.384 ± 0.096 | 3.290 | 3.640 | 3.46 ± 0.18 |
| `xxhash,murmur3` | 2.234 ± 0.073 | 2.193 | 2.422 | 2.29 ± 0.13 |`Values are in seconds per 500,000 sequences (756,622,201 bp)`
As shown, xxHash provides the best performance, followed by CityHash and MurmurHash3.
These hash functions produce relatively short hash fingerprints (64 and 128 bits, respectively).
In contrast, SHA-3 is the slowest hash function in this benchmark, generating the longest hash (512 bits).> [!NOTE]
> However, it's important to note that these values may depend on
> the instruction set of the CPU being used, as some processors may
> optimize specific algorithms differently (e.g., via `SIMD` or other hardware acceleration).
> For example, modern CPUs may use **SHA Extensions** to accelerate SHA-family algorithms.
> Additionally, the performance reported here is tied to the particular implementations
> of the hash algorithms used in `seqhasher`. Other implementations may yield different results,
> and these values should not be interpreted as a definitive ranking of the algorithms themselves.## Installation
### Pre-built binaries
Download the latest release for your platform from the [Releases](https://github.com/vmikk/seqhasher/releases) page.
### Building from source
Ensure you have Go 1.23 or later [installed](https://go.dev/dl/).
Then, to install `seqhasher` v.1.1.1 run:``` bash
git clone --depth 1 --branch 1.1.1 https://github.com/vmikk/seqhasher
cd seqhasher
go build -ldflags="-w -s" seqhasher.go
```## Known issues and limitations
- Seqhasher does not take line wrapping in FASTA file into account (whitespace characters are stripped from the sequence before processing);
- The tool may not work correctly with sequences containing non-ASCII characters;
- IUPAC ambiguity codes (R,Y,S,W,K,M,B,D,H,V,N), characters denoting gaps ('-' or '.'), **and any other non-DNA characters** are handled "as is" (hash will depend on them);
- Empty sequences return an empty hash;