https://github.com/suchapalaver/kmerust

Bioinformatics 101 tool for counting unique k-length substrings in DNA
https://github.com/suchapalaver/kmerust

beginner-friendly bioinformatics bioinformatics-tool bitpacking bytes clap dashmap genomics insta k-mer k-mer-counting k-mers kmer kmer-counting kmers needletail parallelization rayon rust rust-bio

Last synced: 3 months ago
JSON representation

Bioinformatics 101 tool for counting unique k-length substrings in DNA

Host: GitHub
URL: https://github.com/suchapalaver/kmerust
Owner: suchapalaver
License: mit
Created: 2021-09-06T18:32:57.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2026-01-27T20:57:24.000Z (4 months ago)
Last Synced: 2026-01-28T08:54:14.649Z (4 months ago)
Topics: beginner-friendly, bioinformatics, bioinformatics-tool, bitpacking, bytes, clap, dashmap, genomics, insta, k-mer, k-mer-counting, k-mers, kmer, kmer-counting, kmers, needletail, parallelization, rayon, rust, rust-bio
Language: Rust
Homepage:
Size: 23.3 MB
Stars: 33
Watchers: 0
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # kmerust

[![Crates.io](https://img.shields.io/crates/v/kmerust.svg)](https://crates.io/crates/kmerust)

[![Documentation](https://docs.rs/kmerust/badge.svg)](https://docs.rs/kmerust)

[![CI](https://github.com/suchapalaver/kmerust/workflows/CI/badge.svg)](https://github.com/suchapalaver/kmerust/actions)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A fast, parallel [k-mer](https://en.wikipedia.org/wiki/K-mer) counter for DNA sequences in FASTA and FASTQ files.

## Features

- **Fast parallel processing** using [rayon](https://docs.rs/rayon) and [dashmap](https://docs.rs/dashmap)

- **FASTA and FASTQ support** with automatic format detection from file extension

- **Canonical k-mers** - outputs the lexicographically smaller of each k-mer and its reverse complement

- **Flexible k-mer lengths** from 1 to 32

- **Handles N bases** by skipping invalid k-mers

- **Jellyfish-compatible output** format for easy integration with existing pipelines

- **Tested for accuracy** against [Jellyfish](https://github.com/gmarcais/Jellyfish)

## Installation

### From crates.io

```bash

cargo install kmerust

```

### From source

```bash

git clone https://github.com/suchapalaver/kmerust.git

cd kmerust

cargo install --path .

```

## Usage

```bash

kmerust  

```

### Arguments

- `` - K-mer length (1-32)

- `` - Path to a FASTA or FASTQ file (use `-` or omit for stdin)

### Options

- `-f, --format ` - Output format: `fasta` (default), `tsv`, `json`, or `histogram`

- `-i, --input-format ` - Input format: `auto` (default), `fasta`, or `fastq`

- `-m, --min-count ` - Minimum count threshold (default: 1)

- `-Q, --min-quality ` - Minimum Phred quality score for FASTQ (0-93); bases below this are skipped

- `--save ` - Save k-mer counts to a binary index file for fast querying

- `-q, --quiet` - Suppress informational output

- `-h, --help` - Print help information

- `-V, --version` - Print version information

### Examples

Count 21-mers in a FASTA file:

```bash

kmerust 21 sequences.fa > kmers.txt

```

Count 21-mers in a FASTQ file (format auto-detected):

```bash

kmerust 21 reads.fq > kmers.txt

```

Count 5-mers:

```bash

kmerust 5 sequences.fa > kmers.txt

```

### Unix Pipeline Integration

kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:

```bash

# Pipe from another command

cat genome.fa | kmerust 21

# Decompress and count

zcat large.fa.gz | kmerust 21 > counts.tsv

# Sample reads and count

seqtk sample reads.fa 0.1 | kmerust 17

# Explicit stdin marker

cat genome.fa | kmerust 21 -

# FASTQ from stdin (specify format explicitly)

cat reads.fq | kmerust 21 --input-format fastq

zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsv

```

### Output Formats

Use `--format` to choose the output format:

```bash

# TSV format (tab-separated)

kmerust 21 sequences.fa --format tsv

# JSON format

kmerust 21 sequences.fa --format json

# FASTA-like format (default)

kmerust 21 sequences.fa --format fasta

# Histogram format (k-mer frequency spectrum)

kmerust 21 sequences.fa --format histogram

```

### Histogram Output

The histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:

```bash

kmerust 21 genome.fa --format histogram > spectrum.tsv

```

Output is tab-separated with columns `count` and `frequency`:

```

1       1523456    # 1.5M k-mers appear exactly once (likely errors)

2       234567     # 234K k-mers appear twice

10      45678      # 45K k-mers appear 10 times

...

```

### Quality Filtering (FASTQ)

For FASTQ files, use `--min-quality` to filter out k-mers containing low-quality bases:

```bash

# Skip k-mers with any base below Q20

kmerust 21 reads.fq --min-quality 20

# Higher threshold for stricter filtering

kmerust 21 reads.fq -Q 30 --format tsv

```

K-mers containing bases with Phred quality scores below the threshold are skipped entirely.

### Index Serialization

For large genomes, save k-mer counts to a binary index file to avoid re-counting:

```bash

# Count and save to index

kmerust 21 genome.fa --save counts.kmix

# Counts are also written to stdout as usual

kmerust 21 genome.fa --save counts.kmix > counts.tsv

```

The index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the `.gz` extension:

```bash

# Save with gzip compression

kmerust 21 genome.fa --save counts.kmix.gz

```

### Querying a Saved Index

Use the `query` subcommand to look up k-mer counts from a saved index:

```bash

# Query a single k-mer

kmerust query counts.kmix ACGTACGTACGTACGTACGTA

# Output: 42 (or 0 if not found)

# Queries are case-insensitive and canonicalized

kmerust query counts.kmix acgtacgtacgtacgtacgta  # Same result

kmerust query counts.kmix TGTACGTACGTACGTACGTAC  # Reverse complement, same result

```

The query k-mer length must match the index's k value.

### Sequence Readers

kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:

- `rust-bio` (default) - Uses the [rust-bio](https://docs.rs/bio) library

- `needletail` - Uses the [needletail](https://docs.rs/needletail) library

To use needletail instead:

```bash

cargo run --release --no-default-features --features needletail -- 21 sequences.fa

```

With needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (`.fa`, `.fasta`, `.fna` for FASTA; `.fq`, `.fastq` for FASTQ).

### Production Features

Enable production features for additional capabilities:

```bash

cargo build --release --features production

```

Or enable individual features:

- `gzip` - Read gzip-compressed FASTA files (`.fa.gz`)

- `mmap` - Memory-mapped I/O for large files

- `tracing` - Structured logging and diagnostics

#### Gzip Compressed Input

With the `gzip` feature, kmerust can directly read gzip-compressed files:

```bash

cargo run --release --features gzip -- 21 sequences.fa.gz

```

#### Tracing/Logging

With the `tracing` feature, use the `RUST_LOG` environment variable for diagnostic output:

```bash

RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.fa

```

## Output Format

Output is written to stdout in FASTA-like format:

```

>{count}

{canonical_kmer}

```

Example output:

```

>114928

ATGCC

>289495

AATCA

```

## Library Usage

kmerust can also be used as a library:

```rust

use kmerust::run::count_kmers;

use std::path::PathBuf;

fn main() -> Result<(), Box> {

    // Works with both FASTA and FASTQ (format auto-detected)

    let path = PathBuf::from("sequences.fa");

    let counts = count_kmers(&path, 21)?;

    for (kmer, count) in counts {

        println!("{kmer}: {count}");

    }

    Ok(())

}

```

### Explicit Format Selection

When using the builder API, you can explicitly specify the input format:

```rust

use kmerust::builder::KmerCounter;

use kmerust::format::SequenceFormat;

fn main() -> Result<(), Box> {

    let counts = KmerCounter::new()

        .k(21)?

        .input_format(SequenceFormat::Fastq)

        .count("reads.fq")?;

    Ok(())

}

```

### Progress Reporting

Monitor progress during long-running operations:

```rust

use kmerust::run::count_kmers_with_progress;

fn main() -> Result<(), Box> {

    let counts = count_kmers_with_progress("genome.fa", 21, |progress| {

        eprintln!(

            "Processed {} sequences ({} bases)",

            progress.sequences_processed,

            progress.bases_processed

        );

    })?;

    Ok(())

}

```

### Memory-Mapped I/O

For large files, use memory-mapped I/O (requires `mmap` feature):

```rust

use kmerust::run::count_kmers_mmap;

fn main() -> Result<(), Box> {

    let counts = count_kmers_mmap("large_genome.fa", 21)?;

    println!("Found {} unique k-mers", counts.len());

    Ok(())

}

```

### Streaming API

For memory-efficient processing:

```rust

use kmerust::streaming::count_kmers_streaming;

fn main() -> Result<(), Box> {

    let counts = count_kmers_streaming("genome.fa", 21)?;

    println!("Found {} unique k-mers", counts.len());

    Ok(())

}

```

### Reading from Any Source

Count k-mers from any `BufRead` source, including stdin or in-memory data:

```rust

use kmerust::streaming::count_kmers_from_reader;

use std::io::BufReader;

fn main() -> Result<(), Box> {

    // From in-memory data

    let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";

    let reader = BufReader::new(&fasta_data[..]);

    let counts = count_kmers_from_reader(reader, 4)?;

    // From stdin

    // use kmerust::streaming::count_kmers_stdin;

    // let counts = count_kmers_stdin(21)?;

    Ok(())

}

```

## Performance

kmerust uses parallel processing to efficiently count k-mers:

- Sequences are processed in parallel using rayon

- A concurrent hash map (dashmap) allows lock-free updates

- FxHash provides fast hashing for 64-bit packed k-mers

## License

MIT License - see [LICENSE](LICENSE) for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/suchapalaver/kmerust

Awesome Lists containing this project

README