https://github.com/vmikk/seqhasher

SeqHasher - A tool for hashing individual sequences in FASTA files
https://github.com/vmikk/seqhasher
dna-sequences fasta fastq hashing
Last synced: 2 months ago
JSON representation
SeqHasher - A tool for hashing individual sequences in FASTA files
Host: GitHub
URL: https://github.com/vmikk/seqhasher
Owner: vmikk
License: gpl-3.0
Created: 2024-03-14T21:25:06.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-12-31T12:19:53.000Z (5 months ago)
Last Synced: 2025-01-30T00:29:55.265Z (4 months ago)
Topics: dna-sequences, fasta, fastq, hashing
Language: Go
Homepage:
Size: 221 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project

README

        # SeqHasher

[![Go Test](https://github.com/vmikk/seqhasher/actions/workflows/go-test.yml/badge.svg)](https://github.com/vmikk/seqhasher/actions/workflows/go-test.yml)

[![Integration Tests](https://github.com/vmikk/seqhasher/actions/workflows/bash.yml/badge.svg)](https://github.com/vmikk/seqhasher/actions/workflows/bash.yml)

[![codecov](https://codecov.io/gh/vmikk/seqhasher/branch/main/graph/badge.svg)](https://codecov.io/gh/vmikk/seqhasher)

[![DOI](https://zenodo.org/badge/772268723.svg)](https://doi.org/10.5281/zenodo.14311356)

## Overview

`seqhasher` is a high-performance command-line tool designed to calculate a hash (digest or fingerprint) for each sequence in a FASTA or FASTQ file and add it to the sequence header. It supports multiple hashing algorithms and offers various output options.

## Features

- Fast processing of FASTA/FASTQ files (thanks to [shenwei356/bio](https://github.com/shenwei356/bio) package)

- Support for multiple hash algorithms: SHA-1, SHA-3, MD5, xxHash, CityHash, MurmurHash3, ntHash, and BLAKE3

- Automatic support for compressed input files (`gzip`, `zstd`, `xz`, and `bzip2`)

- Supports reading from STDIN and writing to STDOUT

- Option to output only headers or full sequences

- Case-sensitive hashing option

- Customizable output format (e.g., include filename or a custom text string in the header)

## Quick start

Input data (e.g., `input.fasta`):

```

>seq1

AAAA

>seq2

ACTG

>seq3

aaaa

``` 

Basic usage (default SHA1 hash):

`seqhasher input.fasta -`

```

>input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1

AAAA

>input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2

ACTG

>input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

AAAA

```

Custom name instead of input filename (e.g., useful when processing stdin):

`seqhasher --name "test_file" input.fasta -`

```

>test_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1

AAAA

>test_file;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2

ACTG

>test_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

AAAA

```

Output only headers:

`seqhasher --headersonly input.fasta -`

```

input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1

input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2

input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

```

Omit filename from output:

`seqhasher --headersonly --nofilename input.fasta -`

```

e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1

65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2

e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

```

Use different hash functions (xxHash) and case-sensitive mode:

`seqhasher --headersonly --nofilename --hash xxhash --casesensitive input.fasta -`

```

cf40b5b72bc43e77;seq1

704b34bf20faedf2;seq2

42a70d1abf84bf32;seq3

```

Multiple hashes (useful to ensure absence of collisions):

`seqhasher --headersonly --nofilename --hash sha1,xxhash --casesensitive input.fasta -`

```

e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;cf40b5b72bc43e77;seq1

65c89f59d38cdbf90dfaf0b0a6884829df8396b0;704b34bf20faedf2;seq2

70c881d4a26984ddce795f6f71817c9cf4480e79;42a70d1abf84bf32;seq3

```

## Usage

```plaintext

seqhasher [options]  [output_file]

Options:

  -o, --headersonly   Output only sequence headers, excluding the sequences themselves

  -H, --hash  Hash algorithm(s): sha1 (default), sha3, md5, xxhash, cityhash, murmur3, nthash, blake3

  -c, --casesensitive Take into account sequence case. By default, sequences are converted to uppercase

  -n, --nofilename    Omit the file name from the sequence header

  -f, --name    Replace the input file's name in the header with 

  -v, --version       Print the version of the program and exit

  -h, --help          Show this help message and exit

Arguments:

       Path to the input FASTA/FASTQ file (supports gzip, zstd, xz, or bzip2 compression)

                   or '-' for standard input (stdin)

  [output_file]    Path to the output file or '-' for standard output (stdout)

                   If omitted, output is sent to stdout.

```

### Description

The tool can either read the input from a specified file or from standard input (`stdin`), 

and similarly, it can write the output to a specified file or standard output (`stdout`).  

The `--name` option allows to customize the header of the output by specifying 

a text to replace the input file name.

The `--hash` option allows to specify which hash function to use 

(multiple coma-separated values allowed, e.g., `--hash sha1,nthash`). 

Currently, the following hash functions are supported:  

- `sha1`: [SHA-1](https://en.wikipedia.org/wiki/SHA-1) (default), 160-bit hash value

- `sha3`: [SHA-3](https://en.wikipedia.org/wiki/SHA-3), Keccak-based secure cryptographic hash standard, 512-bit hash value

- `md5`: [MD5](https://en.wikipedia.org/wiki/MD5), 128-bit hash value

- `xxhash`: [xxHash](https://xxhash.com/), extremely fast algorithm, 64-bit hash value

- `cityhash`: [CityHash](https://opensource.googleblog.com/2011/04/introducing-cityhash.html) (e.g., used in [VSEARCH](https://github.com/torognes/vsearch/)), 128-bit hash value

- `murmur3`: [Murmur3](https://en.wikipedia.org/wiki/MurmurHash) (e.g., used in [Sourmash](https://github.com/sourmash-bio/sourmash), but 64-bit), 128-bit hash value

- `nthash`: [ntHash](https://github.com/bcgsc/ntHash) (designed for DNA sequences), 64-bit hash value. This implementation uses the full length of the sequence as the k-mer size, effectively hashing the entire sequence at once using the non-canonical (forward) hash of the sequence

- `blake3`: [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) (fast cryptographic hash function), 256-bit hash value

> [!NOTE]

> The probability of a collision (when different DNA sequences end up with the same hash) 

> is roughly 1 in 2^*nbits*, where *nbits* is the length of the hash in bits. 

> This means that functions with shorter bit-lengths (e.g., `Murmur3` and `CityHash`) 

> are more likely to have collisions as the dataset grows, 

> while `SHA-3` has a much lower chance of collisions because of its larger bit length. 

> However, shorter hashes are generally faster to compute 

> and take up less space when saved to a file, 

> making them more efficient for some tasks despite the higher collision risk.

### Examples

To process a FASTA file and output to another file:

```bash

seqhasher input.fasta output.fasta

```

To process a FASTA file from standard input and output to standard output, while replacing the file name in the header with 'Sample':

```bash

cat input.fasta | seqhasher --name 'Sample' - - > output.fasta

# OR

seqhasher --name 'Sample' - - < input.fasta > output.fasta

```

## Benchmark

To evaluate the performance of two solutions for processing DNA sequences, 

we utilized [`hyperfine`](https://github.com/sharkdp/hyperfine).

Benchmarks were performed on a system with the following specifications:

- CPU: Intel Core i7-10510U (Comet Lake)

- Storage: NVMe SSD

### Test data

First, let's create the test data - 

a FASTA file containing 500,000 sequences, each 30 to 3000 nucleotides long 

(this should take a couple of minutes):  

```bash

awk -v numSeq=500000 'BEGIN{

    srand();

    for(i=1; i<=numSeq; i++){

        seqLen=int(rand()*(2971))+30;

        printf(">seq_%d\n", i);

        for(j=1; j<=seqLen; j++){

            r=rand();

            if(r < 0.25) nucleotide="A";

            else if(r < 0.5) nucleotide="C";

            else if(r < 0.75) nucleotide="G";

            else nucleotide="T";

            printf("%s", nucleotide);

        }

        printf("\n");

    }

}' > big.fasta

```

The size of the file is ~760MB.

### Hashing functions performance

```bash

hyperfine \

  --runs 10 --warmup 3 \

  --export-markdown hashing_benchmark.md \

  'seqhasher --headersonly --casesensitive --hash md5      big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash sha1     big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash sha3     big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash xxhash   big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash murmur3  big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash cityhash big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash nthash   big.fasta - > /dev/null' \

  'seqhasher --headersonly --casesensitive --hash blake3   big.fasta - > /dev/null' \

  'seqhasher --headersonly --hash sha1,blake3    big.fasta - > /dev/null' \

  'seqhasher --headersonly --hash xxhash,murmur3 big.fasta - > /dev/null'

```

| Command          |      Mean [s] | Min [s] | Max [s] |    Relative |

|:-----------------|--------------:|--------:|--------:|------------:|

| `md5`            | 1.712 ± 0.069 |   1.651 |   1.847 | 1.75 ± 0.10 |

| `sha1`           | 1.614 ± 0.021 |   1.586 |   1.645 | 1.65 ± 0.08 |

| `sha3`           | 4.823 ± 0.135 |   4.707 |   5.090 | 4.93 ± 0.26 |

| `xxhash`         | 0.977 ± 0.043 |   0.941 |   1.079 |        1.00 |

| `murmur3`        | 1.106 ± 0.058 |   1.058 |   1.233 | 1.13 ± 0.08 |

| `cityhash`       | 1.078 ± 0.019 |   1.048 |   1.111 | 1.10 ± 0.05 |

| `nthash`         | 2.138 ± 0.022 |   2.112 |   2.170 | 2.19 ± 0.10 |

| `blake3`         | 1.718 ± 0.066 |   1.645 |   1.864 | 1.76 ± 0.10 |

| `sha1,blake3`    | 3.384 ± 0.096 |   3.290 |   3.640 | 3.46 ± 0.18 |

| `xxhash,murmur3` | 2.234 ± 0.073 |   2.193 |   2.422 | 2.29 ± 0.13 |

`Values are in seconds per 500,000 sequences (756,622,201 bp)`

As shown, xxHash provides the best performance, followed by CityHash and MurmurHash3. 

These hash functions produce relatively short hash fingerprints (64 and 128 bits, respectively). 

In contrast, SHA-3 is the slowest hash function in this benchmark, generating the longest hash (512 bits).  

> [!NOTE]

> However, it's important to note that these values may depend on 

> the instruction set of the CPU being used, as some processors may 

> optimize specific algorithms differently (e.g., via `SIMD` or other hardware acceleration). 

> For example, modern CPUs may use **SHA Extensions** to accelerate SHA-family algorithms. 

> Additionally, the performance reported here is tied to the particular implementations 

> of the hash algorithms used in `seqhasher`. Other implementations may yield different results, 

> and these values should not be interpreted as a definitive ranking of the algorithms themselves.

## Installation

### Pre-built binaries

Download the latest release for your platform from the [Releases](https://github.com/vmikk/seqhasher/releases) page.

### Building from source

Ensure you have Go 1.23 or later [installed](https://go.dev/dl/).  

Then, to install `seqhasher` v.1.1.1 run:

``` bash

git clone --depth 1 --branch 1.1.1 https://github.com/vmikk/seqhasher

cd seqhasher

go build -ldflags="-w -s" seqhasher.go

```

## Known issues and limitations

- Seqhasher does not take line wrapping in FASTA file into account (whitespace characters are stripped from the sequence before processing);

- The tool may not work correctly with sequences containing non-ASCII characters;

- IUPAC ambiguity codes (R,Y,S,W,K,M,B,D,H,V,N), characters denoting gaps ('-' or '.'), **and any other non-DNA characters** are handled "as is" (hash will depend on them);

- Empty sequences return an empty hash;
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vmikk/seqhasher

Awesome Lists containing this project

README