https://github.com/refresh-bio/kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
https://github.com/refresh-bio/kmer-db
bioinformatics genomics indexing k-mer sequence-similarity
Last synced: about 2 months ago
JSON representation
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
- Host: GitHub
- URL: https://github.com/refresh-bio/kmer-db
- Owner: refresh-bio
- License: gpl-3.0
- Created: 2018-02-09T12:57:04.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2025-05-20T09:19:14.000Z (about 1 year ago)
- Last Synced: 2025-05-20T10:31:49.390Z (about 1 year ago)
- Topics: bioinformatics, genomics, indexing, k-mer, sequence-similarity
- Language: C++
- Homepage:
- Size: 5.6 MB
- Stars: 89
- Watchers: 6
- Forks: 18
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Kmer-db
[](https://github.com/refresh-bio/kmer-db/releases)
[](https://anaconda.org/bioconda/kmer-db)
[](../../actions/workflows/main.yml)
[](https://www.gnu.org/licenses/gpl-3.0.html)






Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
## Quick start
```bash
git clone --recurse-submodules https://github.com/refresh-bio/kmer-db
cd kmer-db && gmake
INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT
# build a database from all 18-mers (default) contained in a set of sequences
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db
# establish numbers of common k-mers between new sequences and the database
./bin/kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv
# calculate jaccard index from common k-mers
./bin/kmer-db distance jaccard $OUTPUT/n2a.csv $OUTPUT/n2a.jaccard
# extend the database with new sequences
./bin/kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db
# establish numbers of common k-mers between all sequences in the database
./bin/kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv
# build a database from 10% of 25-mers using 16 threads
./bin/kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db
# establish number of common 25-mers between single sequence and the database
# (minhash filtering that retains 10% of MT159713 k-mers is done prior to the comparison)
./bin/kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv
# build two partial databases
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.parts1.db
./bin/kmer-db build $INPUT/seqs.part2.list $OUTPUT/k18.parts2.db
# establish numbers of common k-mers between all sequences in the databases,
# computations are done in the sparse mode, the output matrix is also sparse
echo $OUTPUT/k18.parts1.db > $OUTPUT/db.list
echo $OUTPUT/k18.parts2.db >> $OUTPUT/db.list
./bin/kmer-db all2all-parts $OUTPUT/db.list $OUTPUT/k18.parts.csv
```
### Table of contents
1. [Installation](#1-installation)
2. [Usage](#2-usage)
1. [Building a database](#21-building-a-database)
2. [Counting common k-mers](#22-counting-common-k-mers)
3. [Calculating similarities or distances](#23-calculating-similarities-or-distances)
4. [Storing minhashed k-mers](#24-storing-minhashed-k-mers)
3. [Datasets](#3-datasets)
# 1. Installation
Kmer-db comes with a set of [precompiled binaries](https://github.com/refresh-bio/kmer-db/releases) for Linux, macOS, and Windows.
The software is also available on [Bioconda](https://anaconda.org/bioconda/kmer-db):
```
conda install -c bioconda kmer-db
```
For detailed instructions how to set up Bioconda, please refer to the [Bioconda manual](https://bioconda.github.io/).
Kmer-db can be also built from the sources distributed as:
* GNU Make project for Linux and macOS (gmake 4.3 and gcc/g++ 11 or newer required),
* Visual Studio 2022 solution for Windows.
## Vector extensions
Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting `PLATFORM` variable for `make`:
```bash
make PLATFORM=none # unspecified platform, no extensions
make PLATFORM=sse2 # x86-64 with SSE2
make PLATFORM=avx # x86-64 with AVX
make PLATFORM=avx2 # x86-64 with AVX2 (default)
make PLATFORM=native # x86-64 with AVX2 and native architecture
make PLATFORM=arm8 # ARM64 8 with NEON
make PLATFORM=m1 # ARM64 8.4 (especially Apple M1) with NEON
```
Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.
# 2. Usage
`kmer-db [options] `
Kmer-db operates in one of the following modes:
* `build` - building a database from samples,
* `all2all` - counting common k-mers - all samples in the database,
* `all2all-sp` - counting common k-mers - all samples in the database (sparse computation),
* `all2all-parts` - counting common k-mers - all samples within from databases (sparse computation),
* `new2all` - counting common k-mers - set of new samples versus database,
* `one2all` - counting common k-mers - single sample versus database,
* `distance` - calculating similarities/distances,
* `minhash` - storing minhashed k-mers.
Common options:
* `-t ` - number of threads (default: number of available cores),
The meaning of other options and positional arguments depends on the selected mode.
## 2.1. Building a database
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
* compressed or uncompressed genomes/reads:
```kmer-db build [-k ] [-f ] [-multisample-fasta] [-extend] [-alphabet ] [-preserve-strand] [-t ] ```
* [KMC-generated](https://github.com/refresh-bio/KMC) k-mers:
```kmer-db build -from-kmers [-f ] [-extend] [-t ] ```
* [minhashed k-mers](#24-storing-minhashed-k-mers) produced by `minhash` mode:
```kmer-db build -from-minhash [-extend] [-t ] ```
Parameters:
* `samples` (input) - one of the following:
* FASTA file (*fa*, *fna*, *fasta*, *fa.gz*, *fna.gz*, *fasta.gz*) with one or multiple (`-multisample-fasta` switch) samples
* file with a newline-separated list of samples:
```
sample_file_1
sample_file_2
sample_file_3
...
```
Every file can be in one of the formats:
1. FASTA genomes/reads (default). If a file on the list cannot be found, the following extensions are tested: *fa*, *fna*, *fasta*, *gz*, *fa.gz*, *fna.gz*, *fasta.gz*.
2. [KMC-generated](https://github.com/refresh-bio/KMC) k-mer files (`-from-kmers` switch specified). A set of two KMC files (*.kmc_pre* + *.kmc_suf*) is required for every list entry.
3. minhashed k-mers (`-from-minhash` switch specified). Minhashed k-mer files (*.minhash*) must be generated by `minhash` command [prior to the database construction](#24-storing-minhashed-k-mers).
Note, that minhashing may be also done during the database construction by specyfying `-f` option.
* `database` (output) - file with generated k-mer database,
* `-k ` - length of k-mers (default: 18); ignored when `-from-kmers` or `-from-minhash` switch is specified,
* `-f ` - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when `-from-minhash` switch is present,
* `-multisample-fasta` - each sequence in a FASTA file is treated as a separate sample,
* `-extend` - extend the existing database with new samples,
* `-alphabet` - alphabet:
* `nt` (4 symbol nucleotide with indistinguishable T/U; default)
* `aa` (20 symbol amino acid)
* `aa12_mmseqs` (amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,W
* `aa11_diamond` (amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STA
* `aa6_dayhoff` (amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C
* `-preserve-strand`- preserve strand instead of taking canonical k-mers (allowed only in `nt` alphabet; default: off)
* `-t ` - number of threads (default: number of available cores).
## 2.2. Counting common k-mers
### Samples in the database against each other:
Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (`-sparse` switch).
`kmer-db all2all [-buffer ] [-t ] [-sparse [-min [:]]* [-max [:]]* ] `
Sparse computations - recommended when the distance matrix contains many zeros. Output matrix is always in the sparse form:
`kmer-db all2all-sp [-buffer ] [-t ] [-min [:]]* [-max [:]]* [-sample-rows [:]] `
Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:
`kmer-db all2all-parts [-buffer ] [-t ] [-min [:]]* [-max [:]]* [-sample-rows [:]] `
Parameters:
* `database` (input) - k-mer database file created by `build` mode,
* `db_list` (input) - file containing list of databases files created by `build` mode,
* `common_table` (output) - file containing table with common k-mer counts,
* `-buffer ` - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,
* `-t ` - number of threads (default: number of available cores),
* `-sparse` - stores output matrix in a sparse form (always on in `all2all-sp` and `all2all-parts` modes),
* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below),
* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),
* `-sample-rows [:]` - retains `count` elements in every row using one of the strategies: (i) random selection (no `criterion`); (ii) the best elements with respect to `criterion`.
`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). No `criterion` indicates `num-kmers` (filtering) or random elements selection (sampling). Multiple filters can be combined.
### New samples against the database:
`kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-t ] [-sparse [-min [:]]* [-max [:]]* ] `
Parameters:
* `database` (input) - k-mer database file created by `build` mode,
* `samples` (input) - file containing samples in one of the supported formats (see `build` mode); if samples are given as genomes (default) or k-mers (`-from-kmers` switch), the minhashing is done automatically with the same filter as in the database,
* `common_table` (output) - file containing table with common k-mer counts,
* `-multisample-fasta` / `-from-kmers` / `-from-minhash` - see `build` mode for details,
* `-t ` - number of threads (default: number of available cores),
* `-sparse` - stores output matrix in a sparse form,
* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below),
* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),
`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). No `criterion` indicates `num-kmers`. Multiple filters can be combined.
### Single sample against the database:
`kmer-db one2all [-from-kmers | -from-minhash] [-t ] `
The meaning of the parameters is the same as in `new2all` mode, but instead of specifying file with sample list, a single sample file is used as a query.
### Output format
Modes `all2all`, `all2all-sp`, `all2all-parts`, `new2all`, and `one2all` produce a comma-separated table with numbers of common k-mers. For `all2all`, `new2all`, and `one2all` modes, the table is by default stored in a dense form:
| | | | | | |
| :---: | :---: | :---: | :---: | :---: | :---: |
| kmer-length: *k* fraction: *f* | db-samples | *s1* | *s2* | ... | *sn* |
| query-samples | total-kmers | |*s1*| | |*s2*| | ... | |*sn*| |
| *q1* | |*q1*| | |*q1 ∩ s1*| | |*q1 ∩ s2*| | ... | |*q1 ∩ sn*| |
| *q2* | |*q2*| | |*q2 ∩ s1*| | |*q2 ∩ s2*| | ... | |*q2 ∩ sn*| |
| ... | ... | ... | ... | ... | ... |
| *qm* | |*qm*| | |*qm ∩ s1*| | |*qm ∩ s2*| | ... | |*qm ∩ sn*| |
where:
* *k* - k-mer length,
* *f* - minhash fraction (1, when minhashing is disabled),
* *s1*, *s2*, ..., *sn* - database sample names,
* *q1*, *q2*, ..., *qm* - query sample names,
* |*a*| - number of k-mers in sample *a*,
* |*a ∩ b*| - number of k-mers common for samples *a* and *b*.
When `-sparse` switch is specified or `all2all-sp`, `all2all-parts` modes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (*column_id*: *value*) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
| | | | | | |
| :---: | :---: | :---: | :---: | :---: | :---: |
| kmer-length: *k* fraction: *f* | db-samples | *s1* | *s2* | ... | *sn* |
| query-samples | total-kmers | |*s1*| | |*s2*| | ... | |*sn*| |
| *q1* | |*q1*| | *i11*: |*q1 ∩ si11*| | *i12*: |*q1 ∩ si12*| | ||
| *q2* | |*q2*| | *i21*: |*q2 ∩ si21*| | *i22*: |*q2 ∩ si22*| | *i23*: |*q2 ∩ si23*| | |
| *q2* | |*q2*| | ||||
| ... | ... | ... ||||
| *qm* | |*qm*| | *im1*: |*qm ∩ sim1*| | |||
For performance reasons, `all2all`, `all2all-sp`, and `all2all-parts` modes produce a lower triangular matrix.
## 2.3. Calculating similarities or distances
`kmer-db distance [-sparse [-min [:]]* [-max [:]]* ] `
Parameters:
* `measure` - names of the similarity/distance measure to be calculated, can be one of the following:
* `jaccard`: $J(q,s) = |p \cap q| / |p \cup q|$,
* `min`: $\min(q,s) = |p \cap q| / \min(|p|,|q|)$,
* `max`: $\max(q,s) = |p \cap q| / \max(|p|,|q|)$,
* `cosine`: $\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$,
* `mash` (Mash distance): $\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$,
* `ani` (average nucleotide identity): $\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$,
* `ani-shorter` - same as `ani` but with `min` used instead of `jaccard`.
* `common_table` (input) - file containing table with numbers of common k-mers produced by `all2all`, `new2all`, or `one2all` mode (both, dense and sparse matrices are supported),
* `output_table` (output) - file containing table with calculated distance measure,
* `-phylip-out` - store output distance matrix in a Phylip format,
* `-sparse` - outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs),
* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below),
* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),
`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). If no `criterion` is specified, `measure` argument is used by default. Multiple filters can be combined.
## 2.4. Storing minhashed k-mers
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by `build`, `new2all`, or `one2all` modes with `-from-minhash` switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:
`kmer-db minhash [-f ] [-k ] [-multisample-fasta] [-alphabet ] [-preserve-strand] `
`kmer-db minhash -from-kmers [-f ] `
Parameters:
* `sample_list` (input) - file containing list of samples in one of the supported formats (see `build` mode),
* `-f ` - fraction of all k-mers to be accepted by the minhash filter (default: 0.01),
* `-k ` - length of k-mers (default: 18; maximum: 30); ignored when `-from-kmers` switch is specified,
* `-multisample-fasta` / `-from-kmers` - see `build` mode for details.
* `-alphabet` - alphabet:
* `nt` (4 symbol nucleotide with indistinguishable T/U; default)
* `aa` (20 symbol amino acid)
* `aa12_mmseqs` (amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,W
* `aa11_diamond` (amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STA
* `aa6_dayhoff` (amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C
* `-preserve-strand`- preserve strand instead of taking canonical k-mers (allowed only in `nt` alphabet; default: off)
For each sample from the list, a binary file with *.minhash* extension containing filtered k-mers is created.
# 3. Datasets
List of the pathogens investigated in Kmer-db study can be found [here](https://github.com/refresh-bio/kmer-db/tree/master/data)
## Citing
[Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2019) Kmer-db: instant evolutionary distance estimation, Bioinformatics, 35(1): 133–136](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty610/5050791?redirectedFrom=fulltext)
[Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. Nat Methods. https://doi.org/10.1038/s41592-025-02701-7](https://doi.org/10.1038/s41592-025-02701-7)