Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/slowikj/seqr

fast and comprehensive k-mer counting package
https://github.com/slowikj/seqr

bioinformatics bioinformatics-tool dna-processing feature-engineering feature-extraction genomics hashing hashing-algorithms k-mer k-mer-counting kmer kmer-counting kmer-frequency-count kmers ngram ngrams protein-sequences rcpp rcppparallel rpackage

Last synced: about 1 month ago
JSON representation

fast and comprehensive k-mer counting package

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
```

```{r, include = FALSE}
library(seqR)
```

# seqR - fast and comprehensive k-mer counting package

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/seqR)](https://cran.r-project.org/package=seqR)
[![R build status](https://github.com/slowikj/seqR/workflows/R-CMD-check/badge.svg)](https://github.com/slowikj/seqR/actions)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![codecov.io](https://codecov.io/github/slowikj/seqR/coverage.svg?branch=master)](https://codecov.io/github/slowikj/seqR?branch=master)
[![Code Quality Status](https://www.code-inspector.com/project/23909/status/svg)](https://www.code-inspector.com/project/23909/status/svg)
[![Code Quality Score](https://www.code-inspector.com/project/23909/score/svg)](https://www.code-inspector.com/project/23909/score/svg)

## About

`seqR` is an R package for fast k-mer counting. It provides

* **highly optimized** (the core algorithm is written in C++)
* **in-memory**
* **probabilistic** (with configurable dimensionality of a hash value
used for storing k-mers internally),
* **multi-threaded** (with a configurable size of the batch of sequences (`batch_size`) to process in a single step. If `batch_size` equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)

implementation that supports

* **various variants of k-mers** (contiguous, gapped, and positional counterparts)
* **all biological sequences** (e.g., nucleic acids and proteins)

Moreover, the result optimizes memory consumption by the application of **sparse matrices**
(see [package Matrix](https://CRAN.R-project.org/package=Matrix)),
compatible with machine learning packages
such as [ranger](https://CRAN.R-project.org/package=ranger)
and [xgboost](https://CRAN.R-project.org/package=xgboost).

## How to...

### How to install

To install `seqR` from CRAN:

```{r, eval=FALSE}
install.packages("seqR")
```

Alternatively, if you want to use the latest development version:

```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("slowikj/seqR")
```

### How to use

The package provides two functions that facilitate k-mer counting

* `count_kmers` (used for counting k-mers of one type)
* `count_multimers` (a wrapper of `count_kmers`, used for counting k-mers of many types in a single invocation of the function)

and one function used for custom processing of k-mer matrices:

* `rbind_columnwise` (a helper function used for merging several k-mer matrices that do not have same sets of columns)

To learn more, see [features overview vignette](https://slowikj.github.io/seqR/articles/features-overview.html)
and [reference](https://slowikj.github.io/seqR/reference/index.html).

#### Examples

##### counting 5-mers

```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
k=5)
```

##### counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)

```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
kmer_gaps=c(0, 1, 0, 2))
```

##### counting 1-mers and 2-mers

```{r}
data(CsgA)

CsgA[1L:2]

count_multimers(sequences=CsgA,
k_vector = c(1, 2))
```

### How to cite

For citation type:

```{r, eval=FALSE}
citation("seqR")
```

or use:

Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.

## Benchmarks

The `seqR` package has been compared with other existing k-mer counting R packages:
[biogram](https://CRAN.R-project.org/package=biogram),
[kmer](https://CRAN.R-project.org/package=kmer),
[seqinr](https://CRAN.R-project.org/package=seqinr),
and [biostrings](https://bioconductor.org/packages/Biostrings).

All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz 8 cores, using the [microbenchmark](https://CRAN.R-project.org/package=microbenchmark) R package.

### Contiguous k-mers

#### Changing k

The input consists of one `DNA` sequence of length `3 000`.

#### Changing the number of sequences

Each `DNA` sequence has `3 000` elements, `contiguous 5-mer` counting.

### Gapped k-mers

#### Changing the first contiguous part of a k-mer

The input consists of one `DNA` sequence of length `1 000 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.

#### Changing the first gap size

The input consists of one `DNA` sequence of length `100 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.