https://github.com/alexpreynolds/kmer-counter

Count kmers with a more efficient (faster) hash table
https://github.com/alexpreynolds/kmer-counter

bioinformatics cpp14 genomics hash-map kmer kmer-counting

Last synced: 6 months ago
JSON representation

Count kmers with a more efficient (faster) hash table

Host: GitHub
URL: https://github.com/alexpreynolds/kmer-counter
Owner: alexpreynolds
License: mit
Created: 2017-01-31T21:15:32.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-06-08T04:09:33.000Z (over 1 year ago)
Last Synced: 2025-03-24T21:05:07.958Z (7 months ago)
Topics: bioinformatics, cpp14, genomics, hash-map, kmer, kmer-counting
Language: C++
Homepage:
Size: 44.9 KB
Stars: 24
Watchers: 2
Forks: 5
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# kmer-counter

Compilation
-----------

Run `make` to build the `kmer-counter` binary.

This has been compiled under Ubuntu 18.04.4, Cygwin 3.1.4, and Mac OS X 10.15.3, using concurrent GCC/glibc and Clang toolkits.

Usage
-----

Run `kmer-counter --help` for a list of options.

There are a couple ways to use this.

1. You can provide a single-line FASTA input and write counts to standard output, *e.g.*:

```
$ ./kmer-counter --fasta --k=6 sequences.fa
>foo CGTTAA:1 TTAACG:1
>bar TTCTTA:1 TAGGGC:1 AAATTC:1 GTGGAA:1 AACTTC:1 ...
...
```

2. For a more complex use case, you can provide a four-column BED file with the interval's genomic sequence in the fourth column (*i.e.*, ID field), along with the number *k* for the k-mers you want to count, an *offset* value for mer-keys (explained below), and a *results directory* to write results, *e.g.*:

```
$ ./kmer-counter --bed --k=6 --offset=12195 --results-dir="6mers" intervals.bed4
```

The above example generates 6-mers of the sequences from the file `intervals.bed4`.

The results are stored in a folder called `6mers`, which contains two files `count.bed` and `map.txt`.

The first file `count.bed` contains a BED file of intervals from `intervals.bed4`, where the fourth column contains a space-delimited pair of "mer"-keys and the number of times that key is seen. Mer-keys are numbers which begin at the `offset` value provided on the command-line.

The second file `map.txt` contains a tab-delimited pairing of mers and their mer-key, as found in `count.bed`.

Notes
-----

I am using a [hash table](https://en.wikipedia.org/wiki/Hash_table) implementation from [Emil Ernerfeldt](https://github.com/emilk/emilib/blob/master/emilib/hash_map.hpp). A discussion about performance characteristics compared with the C++ STL `std::unordered_map` is [available from the author](http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alexpreynolds/kmer-counter

Awesome Lists containing this project

README