https://github.com/suchapalaver/kmerust
Bioinformatics 101 tool for counting unique k-length substrings in DNA
https://github.com/suchapalaver/kmerust
beginner-friendly bioinformatics bioinformatics-tool bitpacking bytes clap dashmap genomics insta k-mer k-mer-counting k-mers kmer kmer-counting kmers needletail parallelization rayon rust rust-bio
Last synced: 18 days ago
JSON representation
Bioinformatics 101 tool for counting unique k-length substrings in DNA
- Host: GitHub
- URL: https://github.com/suchapalaver/kmerust
- Owner: suchapalaver
- License: mit
- Created: 2021-09-06T18:32:57.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2026-01-27T20:57:24.000Z (about 1 month ago)
- Last Synced: 2026-01-28T08:54:14.649Z (about 1 month ago)
- Topics: beginner-friendly, bioinformatics, bioinformatics-tool, bitpacking, bytes, clap, dashmap, genomics, insta, k-mer, k-mer-counting, k-mers, kmer, kmer-counting, kmers, needletail, parallelization, rayon, rust, rust-bio
- Language: Rust
- Homepage:
- Size: 23.3 MB
- Stars: 33
- Watchers: 0
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# kmerust
[](https://crates.io/crates/kmerust)
[](https://docs.rs/kmerust)
[](https://github.com/suchapalaver/kmerust/actions)
[](https://opensource.org/licenses/MIT)
A fast, parallel [k-mer](https://en.wikipedia.org/wiki/K-mer) counter for DNA sequences in FASTA and FASTQ files.
## Features
- **Fast parallel processing** using [rayon](https://docs.rs/rayon) and [dashmap](https://docs.rs/dashmap)
- **FASTA and FASTQ support** with automatic format detection from file extension
- **Canonical k-mers** - outputs the lexicographically smaller of each k-mer and its reverse complement
- **Flexible k-mer lengths** from 1 to 32
- **Handles N bases** by skipping invalid k-mers
- **Jellyfish-compatible output** format for easy integration with existing pipelines
- **Tested for accuracy** against [Jellyfish](https://github.com/gmarcais/Jellyfish)
## Installation
### From crates.io
```bash
cargo install kmerust
```
### From source
```bash
git clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .
```
## Usage
```bash
kmerust
```
### Arguments
- `` - K-mer length (1-32)
- `` - Path to a FASTA or FASTQ file (use `-` or omit for stdin)
### Options
- `-f, --format ` - Output format: `fasta` (default), `tsv`, `json`, or `histogram`
- `-i, --input-format ` - Input format: `auto` (default), `fasta`, or `fastq`
- `-m, --min-count ` - Minimum count threshold (default: 1)
- `-Q, --min-quality ` - Minimum Phred quality score for FASTQ (0-93); bases below this are skipped
- `--save ` - Save k-mer counts to a binary index file for fast querying
- `-q, --quiet` - Suppress informational output
- `-h, --help` - Print help information
- `-V, --version` - Print version information
### Examples
Count 21-mers in a FASTA file:
```bash
kmerust 21 sequences.fa > kmers.txt
```
Count 21-mers in a FASTQ file (format auto-detected):
```bash
kmerust 21 reads.fq > kmers.txt
```
Count 5-mers:
```bash
kmerust 5 sequences.fa > kmers.txt
```
### Unix Pipeline Integration
kmerust supports reading from stdin, enabling seamless integration with Unix pipelines:
```bash
# Pipe from another command
cat genome.fa | kmerust 21
# Decompress and count
zcat large.fa.gz | kmerust 21 > counts.tsv
# Sample reads and count
seqtk sample reads.fa 0.1 | kmerust 17
# Explicit stdin marker
cat genome.fa | kmerust 21 -
# FASTQ from stdin (specify format explicitly)
cat reads.fq | kmerust 21 --input-format fastq
zcat reads.fq.gz | kmerust 21 -i fastq > counts.tsv
```
### Output Formats
Use `--format` to choose the output format:
```bash
# TSV format (tab-separated)
kmerust 21 sequences.fa --format tsv
# JSON format
kmerust 21 sequences.fa --format json
# FASTA-like format (default)
kmerust 21 sequences.fa --format fasta
# Histogram format (k-mer frequency spectrum)
kmerust 21 sequences.fa --format histogram
```
### Histogram Output
The histogram format outputs the k-mer frequency spectrum (count of counts), useful for genome size estimation and error detection:
```bash
kmerust 21 genome.fa --format histogram > spectrum.tsv
```
Output is tab-separated with columns `count` and `frequency`:
```
1 1523456 # 1.5M k-mers appear exactly once (likely errors)
2 234567 # 234K k-mers appear twice
10 45678 # 45K k-mers appear 10 times
...
```
### Quality Filtering (FASTQ)
For FASTQ files, use `--min-quality` to filter out k-mers containing low-quality bases:
```bash
# Skip k-mers with any base below Q20
kmerust 21 reads.fq --min-quality 20
# Higher threshold for stricter filtering
kmerust 21 reads.fq -Q 30 --format tsv
```
K-mers containing bases with Phred quality scores below the threshold are skipped entirely.
### Index Serialization
For large genomes, save k-mer counts to a binary index file to avoid re-counting:
```bash
# Count and save to index
kmerust 21 genome.fa --save counts.kmix
# Counts are also written to stdout as usual
kmerust 21 genome.fa --save counts.kmix > counts.tsv
```
The index file uses a compact binary format with CRC32 checksums for integrity verification. Gzip compression is auto-detected from the `.gz` extension:
```bash
# Save with gzip compression
kmerust 21 genome.fa --save counts.kmix.gz
```
### Querying a Saved Index
Use the `query` subcommand to look up k-mer counts from a saved index:
```bash
# Query a single k-mer
kmerust query counts.kmix ACGTACGTACGTACGTACGTA
# Output: 42 (or 0 if not found)
# Queries are case-insensitive and canonicalized
kmerust query counts.kmix acgtacgtacgtacgtacgta # Same result
kmerust query counts.kmix TGTACGTACGTACGTACGTAC # Reverse complement, same result
```
The query k-mer length must match the index's k value.
### Sequence Readers
kmerust supports two sequence readers via feature flags, both supporting FASTA and FASTQ:
- `rust-bio` (default) - Uses the [rust-bio](https://docs.rs/bio) library
- `needletail` - Uses the [needletail](https://docs.rs/needletail) library
To use needletail instead:
```bash
cargo run --release --no-default-features --features needletail -- 21 sequences.fa
```
With needletail, format is auto-detected from file content. With rust-bio, format is detected from file extension (`.fa`, `.fasta`, `.fna` for FASTA; `.fq`, `.fastq` for FASTQ).
### Production Features
Enable production features for additional capabilities:
```bash
cargo build --release --features production
```
Or enable individual features:
- `gzip` - Read gzip-compressed FASTA files (`.fa.gz`)
- `mmap` - Memory-mapped I/O for large files
- `tracing` - Structured logging and diagnostics
#### Gzip Compressed Input
With the `gzip` feature, kmerust can directly read gzip-compressed files:
```bash
cargo run --release --features gzip -- 21 sequences.fa.gz
```
#### Tracing/Logging
With the `tracing` feature, use the `RUST_LOG` environment variable for diagnostic output:
```bash
RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.fa
```
## Output Format
Output is written to stdout in FASTA-like format:
```
>{count}
{canonical_kmer}
```
Example output:
```
>114928
ATGCC
>289495
AATCA
```
## Library Usage
kmerust can also be used as a library:
```rust
use kmerust::run::count_kmers;
use std::path::PathBuf;
fn main() -> Result<(), Box> {
// Works with both FASTA and FASTQ (format auto-detected)
let path = PathBuf::from("sequences.fa");
let counts = count_kmers(&path, 21)?;
for (kmer, count) in counts {
println!("{kmer}: {count}");
}
Ok(())
}
```
### Explicit Format Selection
When using the builder API, you can explicitly specify the input format:
```rust
use kmerust::builder::KmerCounter;
use kmerust::format::SequenceFormat;
fn main() -> Result<(), Box> {
let counts = KmerCounter::new()
.k(21)?
.input_format(SequenceFormat::Fastq)
.count("reads.fq")?;
Ok(())
}
```
### Progress Reporting
Monitor progress during long-running operations:
```rust
use kmerust::run::count_kmers_with_progress;
fn main() -> Result<(), Box> {
let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
eprintln!(
"Processed {} sequences ({} bases)",
progress.sequences_processed,
progress.bases_processed
);
})?;
Ok(())
}
```
### Memory-Mapped I/O
For large files, use memory-mapped I/O (requires `mmap` feature):
```rust
use kmerust::run::count_kmers_mmap;
fn main() -> Result<(), Box> {
let counts = count_kmers_mmap("large_genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}
```
### Streaming API
For memory-efficient processing:
```rust
use kmerust::streaming::count_kmers_streaming;
fn main() -> Result<(), Box> {
let counts = count_kmers_streaming("genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}
```
### Reading from Any Source
Count k-mers from any `BufRead` source, including stdin or in-memory data:
```rust
use kmerust::streaming::count_kmers_from_reader;
use std::io::BufReader;
fn main() -> Result<(), Box> {
// From in-memory data
let fasta_data = b">seq1\nACGTACGT\n>seq2\nTGCATGCA\n";
let reader = BufReader::new(&fasta_data[..]);
let counts = count_kmers_from_reader(reader, 4)?;
// From stdin
// use kmerust::streaming::count_kmers_stdin;
// let counts = count_kmers_stdin(21)?;
Ok(())
}
```
## Performance
kmerust uses parallel processing to efficiently count k-mers:
- Sequences are processed in parallel using rayon
- A concurrent hash map (dashmap) allows lock-free updates
- FxHash provides fast hashing for 64-bit packed k-mers
## License
MIT License - see [LICENSE](LICENSE) for details.