https://github.com/refresh-bio/kmer-db

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
https://github.com/refresh-bio/kmer-db
bioinformatics genomics indexing k-mer sequence-similarity
Last synced: 4 months ago
JSON representation
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Host: GitHub
URL: https://github.com/refresh-bio/kmer-db
Owner: refresh-bio
License: gpl-3.0
Created: 2018-02-09T12:57:04.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-05-20T09:19:14.000Z (about 1 year ago)
Last Synced: 2025-05-20T10:31:49.390Z (about 1 year ago)
Topics: bioinformatics, genomics, indexing, k-mer, sequence-similarity
Language: C++
Homepage:
Size: 5.6 MB
Stars: 89
Watchers: 6
Forks: 18
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Kmer-db

[![GitHub downloads](https://img.shields.io/github/downloads/refresh-bio/kmer-db/total.svg?style=flag&label=GitHub%20downloads)](https://github.com/refresh-bio/kmer-db/releases)

[![Bioconda downloads](https://img.shields.io/conda/dn/bioconda/kmer-db.svg?style=flag&label=Bioconda%20downloads)](https://anaconda.org/bioconda/kmer-db)

[![Build and tests](../../workflows/Build%20and%20tests/badge.svg)](../../actions/workflows/main.yml)

[![License](https://anaconda.org/bioconda/famsa/badges/license.svg)](https://www.gnu.org/licenses/gpl-3.0.html)

![x86-64](https://img.shields.io/static/v1?label=%E2%80%8B&message=x86-64&color=yellow&logo=PCGamingWiki&logoColor=white)

![ARM](https://img.shields.io/static/v1?label=%E2%80%8B&message=ARM&color=yellow&logo=Raspberry%20Pi&logoColor=white)

![Apple M](https://img.shields.io/static/v1?label=%E2%80%8B&message=Apple%20M&color=yellow&logo=Apple&logoColor=white)

![Windows](https://img.shields.io/badge/%E2%80%8B-Windows-00A98F?logo=windows)

![Linux](https://img.shields.io/static/v1?label=%E2%80%8B&message=Linux&color=00A98F&logo=linux&logoColor=white)

![macOS](https://img.shields.io/badge/%E2%80%8B-macOS-00A98F?logo=apple)

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.). 

## Quick start

```bash

git clone --recurse-submodules https://github.com/refresh-bio/kmer-db

cd kmer-db && gmake

INPUT=./test/virus

OUTPUT=./output

mkdir $OUTPUT

# build a database from all 18-mers (default) contained in a set of sequences

./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db

# establish numbers of common k-mers between new sequences and the database

./bin/kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv

# calculate jaccard index from common k-mers

./bin/kmer-db distance jaccard $OUTPUT/n2a.csv $OUTPUT/n2a.jaccard

# extend the database with new sequences

./bin/kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db

# establish numbers of common k-mers between all sequences in the database

./bin/kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv

# build a database from 10% of 25-mers using 16 threads

./bin/kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db

# establish number of common 25-mers between single sequence and the database 

# (minhash filtering that retains 10% of MT159713 k-mers is done prior to the comparison)  

./bin/kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv

# build two partial databases

./bin/kmer-db build $INPUT/seqs.part1.list  $OUTPUT/k18.parts1.db

./bin/kmer-db build $INPUT/seqs.part2.list  $OUTPUT/k18.parts2.db

# establish numbers of common k-mers between all sequences in the databases,

# computations are done in the sparse mode, the output matrix is also sparse

echo $OUTPUT/k18.parts1.db > $OUTPUT/db.list

echo $OUTPUT/k18.parts2.db >> $OUTPUT/db.list

./bin/kmer-db all2all-parts $OUTPUT/db.list $OUTPUT/k18.parts.csv

```

### Table of contents

1. [Installation](#1-installation)

2. [Usage](#2-usage)

    1. [Building a database](#21-building-a-database)

    2. [Counting common k-mers](#22-counting-common-k-mers)

    3. [Calculating similarities or distances](#23-calculating-similarities-or-distances)

    4. [Storing minhashed k-mers](#24-storing-minhashed-k-mers)

3. [Datasets](#3-datasets)

# 1. Installation

Kmer-db comes with a set of [precompiled binaries](https://github.com/refresh-bio/kmer-db/releases) for Linux, macOS, and Windows. 

The software is also available on [Bioconda](https://anaconda.org/bioconda/kmer-db):

```

conda install -c bioconda kmer-db

```

For detailed instructions how to set up Bioconda, please refer to the [Bioconda manual](https://bioconda.github.io/).

Kmer-db can be also built from the sources distributed as:

* GNU Make project for Linux and macOS (gmake 4.3 and gcc/g++ 11 or newer required),

* Visual Studio 2022 solution for Windows.

## Vector extensions

Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting `PLATFORM` variable for `make`:

```bash

make PLATFORM=none    # unspecified platform, no extensions

make PLATFORM=sse2    # x86-64 with SSE2 

make PLATFORM=avx     # x86-64 with AVX 

make PLATFORM=avx2    # x86-64 with AVX2 (default)

make PLATFORM=native  # x86-64 with AVX2 and native architecture

make PLATFORM=arm8    # ARM64 8 with NEON  

make PLATFORM=m1      # ARM64 8.4 (especially Apple M1) with NEON 

```   

Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.

# 2. Usage

`kmer-db  [options] `

Kmer-db operates in one of the following modes:

* `build` - building a database from samples,

* `all2all` - counting common k-mers - all samples in the database,

* `all2all-sp` - counting common k-mers - all samples in the database (sparse computation),

* `all2all-parts` - counting common k-mers - all samples within from databases (sparse computation),

* `new2all` - counting common k-mers - set of new samples versus database,

* `one2all` - counting common k-mers - single sample versus database,

* `distance` - calculating similarities/distances,

* `minhash` - storing minhashed k-mers.

    

Common options:

* `-t ` - number of threads (default: number of available cores),

    

The meaning of other options and positional arguments depends on the selected mode.

    

## 2.1. Building a database

Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:

* compressed or uncompressed genomes/reads:

    ```kmer-db build [-k ] [-f ] [-multisample-fasta] [-extend] [-alphabet ] [-preserve-strand] [-t ]  ```

* [KMC-generated](https://github.com/refresh-bio/KMC) k-mers: 

    ```kmer-db build -from-kmers [-f ] [-extend] [-t ]  ```

  

* [minhashed k-mers](#24-storing-minhashed-k-mers) produced by `minhash` mode:

    ```kmer-db build -from-minhash [-extend] [-t ]  ```

Parameters:

 * `samples` (input) - one of the following: 

    * FASTA file (*fa*, *fna*, *fasta*, *fa.gz*, *fna.gz*, *fasta.gz*) with one or multiple (`-multisample-fasta` switch) samples

    * file with a newline-separated list of samples:

      ```

      sample_file_1

      sample_file_2

      sample_file_3

      ...

      ```

   Every file can be in one of the formats:

     1. FASTA genomes/reads (default). If a file on the list cannot be found, the following extensions are tested: *fa*, *fna*, *fasta*, *gz*, *fa.gz*, *fna.gz*, *fasta.gz*.

     2. [KMC-generated](https://github.com/refresh-bio/KMC) k-mer files (`-from-kmers` switch specified). A set of two KMC files (*.kmc_pre* + *.kmc_suf*) is required for every list entry.

     3. minhashed k-mers (`-from-minhash` switch specified). Minhashed k-mer files (*.minhash*) must be generated by `minhash` command [prior to the database construction](#24-storing-minhashed-k-mers).  

	Note, that minhashing may be also done during the database construction by specyfying `-f` option.

* `database` (output) - file with generated k-mer database, 

* `-k ` - length of k-mers (default: 18); ignored when `-from-kmers` or `-from-minhash` switch is specified,

* `-f ` - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when `-from-minhash` switch is present,

* `-multisample-fasta` - each sequence in a FASTA file is treated as a separate sample,

* `-extend` - extend the existing database with new samples,

* `-alphabet` - alphabet:

  * `nt` (4 symbol nucleotide with indistinguishable T/U; default) 

  * `aa` (20 symbol amino acid)

  * `aa12_mmseqs` (amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,W

  * `aa11_diamond` (amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STA

  * `aa6_dayhoff` (amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C

* `-preserve-strand`- preserve strand instead of taking canonical k-mers (allowed only in `nt` alphabet; default: off)

* `-t ` - number of threads (default: number of available cores).

## 2.2. Counting common k-mers 

### Samples in the database against each other:

Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (`-sparse` switch).

`kmer-db all2all [-buffer ] [-t ] [-sparse [-min [:]]* [-max [:]]* ]  `

 

Sparse computations - recommended when the distance matrix contains many zeros. Output matrix is always in the sparse form:

`kmer-db all2all-sp [-buffer ] [-t ] [-min [:]]* [-max [:]]* [-sample-rows [:]]  `

Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:

`kmer-db all2all-parts [-buffer ] [-t ] [-min [:]]* [-max [:]]* [-sample-rows [:]]  `

 

Parameters:

* `database` (input) - k-mer database file created by `build` mode,

* `db_list` (input) - file containing list of databases files created by `build` mode,

* `common_table` (output) - file containing table with common k-mer counts,

* `-buffer ` - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,

* `-t ` - number of threads (default: number of available cores),

* `-sparse` - stores output matrix in a sparse form (always on in `all2all-sp` and `all2all-parts` modes),

* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below), 

* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),

* `-sample-rows [:]` - retains `count` elements in every row using one of the strategies: (i) random selection (no `criterion`); (ii) the best elements with respect to `criterion`.

`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). No `criterion` indicates `num-kmers` (filtering) or random elements selection (sampling). Multiple filters can be combined. 

### New samples against the database:

`kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-t ]  [-sparse [-min [:]]* [-max [:]]* ]   `

Parameters:

* `database` (input) - k-mer database file created by `build` mode,

* `samples` (input) - file containing samples in one of the supported formats (see `build` mode); if samples are given as genomes (default) or k-mers (`-from-kmers` switch), the minhashing is done automatically with the same filter as in the database,

* `common_table` (output) - file containing table with common k-mer counts,

* `-multisample-fasta` / `-from-kmers` / `-from-minhash` - see `build` mode for details,

* `-t ` - number of threads (default: number of available cores),

* `-sparse` - stores output matrix in a sparse form,

* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below), 

* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),

`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). No `criterion` indicates `num-kmers`. Multiple filters can be combined.

 

### Single sample against the database:

`kmer-db one2all [-from-kmers | -from-minhash] [-t ]   `

The meaning of the parameters is the same as in `new2all` mode, but instead of specifying file with sample list, a single sample file is used as a query.

### Output format

Modes `all2all`, `all2all-sp`, `all2all-parts`, `new2all`, and `one2all` produce a comma-separated table with numbers of common k-mers. For `all2all`, `new2all`, and `one2all` modes, the table is by default stored in a dense form:

| 									| 								| 					| 					|		|			|	

| :---: 							| :---: 						| :---: 			| :---:				| :---:	|  :---:	| 

| kmer-length: *k* fraction: *f* 	| db-samples 					| *s₁*									| *s₂* 										| ... 	|  *s_n* |

| query-samples 					| total-kmers 					| |*s₁*|						| |*s₂*| 							| ... 	|  |*s_n*| |

| *q₁* 					| |*q₁*|	| |*q₁ ∩ s₁*|	| |*q₁ ∩ s₂*| 	| ... 	|  |*q₁ ∩ s_n*| |

| *q₂* 					| |*q₂*|	| |*q₂ ∩ s₁*|	| |*q₂ ∩ s₂*| 	| ... 	|  |*q₂ ∩ s_n*| |

| ... 								| ...							| ...												| ...													| ...	|	...												 |

| *q_m* 					| |*q_m*|	| |*q_m ∩ s₁*|	| |*q_m ∩ s₂*| 	| ... 	|  |*q_m ∩ s_n*| |

where:

* *k* - k-mer length,

* *f* - minhash fraction (1, when minhashing is disabled), 

* *s₁*, *s₂*,  ...,   *s_n* - database sample names,

* *q₁*, *q₂*,  ...,   *q_m* - query sample names,

* |*a*| - number of k-mers in sample *a*,

* |*a ∩ b*| - number of k-mers common for samples *a* and *b*.

When `-sparse` switch is specified or `all2all-sp`, `all2all-parts` modes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (*column_id*: *value*) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

| 									| 								| 					| 				|		|			|	

| :---: 							| :---: 						| :---: 			| :---:			| :---:	|  :---:	| 

| kmer-length: *k* fraction: *f* 	| db-samples 					| *s₁*					| *s₂* | ... 	|  *s_n* |

| query-samples 					| total-kmers 					| |*s₁*|		| |*s₂*| 	| ... 	|  |*s_n*| |

| *q₁* 					| |*q₁*|	| *i₁₁*: |*q₁ ∩ s_i₁₁*|	| *i₁₂*: |*q₁ ∩ s_i₁₂*| | ||

| *q₂* 					| |*q₂*|	| *i₂₁*: |*q₂ ∩ s_i₂₁*|	| *i₂₂*: |*q₂ ∩ s_i₂₂*| 	| *i₂₃*: |*q₂ ∩ s_i₂₃*|  	| |   

| *q₂* 					| |*q₂*|	| ||||

| ... 								| ...							| ... ||||

| *q_m* 					| |*q_m*|	| *i_m1*: |*q_m ∩ s_{i_m1}*|	| |||

For performance reasons, `all2all`, `all2all-sp`, and `all2all-parts` modes produce a lower triangular matrix.

 

 ## 2.3. Calculating similarities or distances

`kmer-db distance  [-sparse [-min [:]]* [-max [:]]* ]  `

Parameters:

* `measure` - names of the similarity/distance measure to be calculated, can be one of the following: 

  * `jaccard`: $J(q,s) = |p \cap q| / |p \cup q|$, 

  * `min`: $\min(q,s) =  |p \cap q| / \min(|p|,|q|)$, 

  * `max`: $\max(q,s) =  |p \cap q| / \max(|p|,|q|)$, 

  * `cosine`: $\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$,  

  * `mash` (Mash distance): $\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$, 

  * `ani` (average nucleotide identity): $\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$,

  * `ani-shorter` - same as `ani` but with `min` used instead of `jaccard`.

* `common_table` (input) - file containing table with numbers of common k-mers produced by `all2all`, `new2all`, or `one2all` mode (both, dense and sparse matrices are supported), 

* `output_table` (output) - file containing table with calculated distance measure,  

* `-phylip-out` - store output distance matrix in a Phylip format,

* `-sparse` - outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs),

* `-min [:]` - retains elements with `criterion` greater than or equal to `value` (see details below), 

* `-max [:]` - retains elements with `criterion` lower than or equal to `value` (see details below),

`criterion` can be `num-kmers` (number of common k-mers) or one of the distance/similarity measures: `jaccard`, `min`, `max`, `cosine`, `mash`, `ani`, `ani-shorder` (see 2.3 for definitions). If no `criterion` is specified, `measure` argument is used by default. Multiple filters can be combined.

 

        

## 2.4. Storing minhashed k-mers

This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by `build`, `new2all`, or `one2all` modes with `-from-minhash` switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:

`kmer-db minhash [-f ] [-k ] [-multisample-fasta] [-alphabet ] [-preserve-strand] `

`kmer-db minhash -from-kmers [-f ] `

Parameters:

 * `sample_list` (input) - file containing list of samples in one of the supported formats (see `build` mode), 

 * `-f ` - fraction of all k-mers to be accepted by the minhash filter (default: 0.01),

 * `-k ` - length of k-mers (default: 18; maximum: 30); ignored when `-from-kmers` switch is specified,

 * `-multisample-fasta` / `-from-kmers` - see `build` mode for details.

 * `-alphabet` - alphabet:

   * `nt` (4 symbol nucleotide with indistinguishable T/U; default) 

   * `aa` (20 symbol amino acid)

   * `aa12_mmseqs` (amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,W

   * `aa11_diamond` (amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STA

   * `aa6_dayhoff` (amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C

* `-preserve-strand`- preserve strand instead of taking canonical k-mers (allowed only in `nt` alphabet; default: off)

For each sample from the list, a binary file with *.minhash* extension containing filtered k-mers is created.

# 3. Datasets

List of the pathogens investigated in Kmer-db study can be found [here](https://github.com/refresh-bio/kmer-db/tree/master/data)

## Citing

[Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2019) Kmer-db: instant evolutionary distance estimation, Bioinformatics, 35(1): 133–136](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty610/5050791?redirectedFrom=fulltext)

[Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. Nat Methods. https://doi.org/10.1038/s41592-025-02701-7](https://doi.org/10.1038/s41592-025-02701-7)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/refresh-bio/kmer-db

Awesome Lists containing this project

README