Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/slowikj/seqr
fast and comprehensive k-mer counting package
https://github.com/slowikj/seqr
bioinformatics bioinformatics-tool dna-processing feature-engineering feature-extraction genomics hashing hashing-algorithms k-mer k-mer-counting kmer kmer-counting kmer-frequency-count kmers ngram ngrams protein-sequences rcpp rcppparallel rpackage
Last synced: about 7 hours ago
JSON representation
fast and comprehensive k-mer counting package
- Host: GitHub
- URL: https://github.com/slowikj/seqr
- Owner: slowikj
- Created: 2019-11-14T06:20:51.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-09-27T16:40:05.000Z (about 3 years ago)
- Last Synced: 2023-10-25T13:36:32.641Z (about 1 year ago)
- Topics: bioinformatics, bioinformatics-tool, dna-processing, feature-engineering, feature-extraction, genomics, hashing, hashing-algorithms, k-mer, k-mer-counting, kmer, kmer-counting, kmer-frequency-count, kmers, ngram, ngrams, protein-sequences, rcpp, rcppparallel, rpackage
- Language: C++
- Homepage: https://slowikj.github.io/seqR/
- Size: 1.72 MB
- Stars: 14
- Watchers: 3
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
``````{r, include = FALSE}
library(seqR)
```# seqR - fast and comprehensive k-mer counting package
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/seqR)](https://cran.r-project.org/package=seqR)
[![R build status](https://github.com/slowikj/seqR/workflows/R-CMD-check/badge.svg)](https://github.com/slowikj/seqR/actions)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![codecov.io](https://codecov.io/github/slowikj/seqR/coverage.svg?branch=master)](https://codecov.io/github/slowikj/seqR?branch=master)
[![Code Quality Status](https://www.code-inspector.com/project/23909/status/svg)](https://www.code-inspector.com/project/23909/status/svg)
[![Code Quality Score](https://www.code-inspector.com/project/23909/score/svg)](https://www.code-inspector.com/project/23909/score/svg)## About
`seqR` is an R package for fast k-mer counting. It provides
* **highly optimized** (the core algorithm is written in C++)
* **in-memory**
* **probabilistic** (with configurable dimensionality of a hash value
used for storing k-mers internally),
* **multi-threaded** (with a configurable size of the batch of sequences (`batch_size`) to process in a single step. If `batch_size` equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)implementation that supports
* **various variants of k-mers** (contiguous, gapped, and positional counterparts)
* **all biological sequences** (e.g., nucleic acids and proteins)Moreover, the result optimizes memory consumption by the application of **sparse matrices**
(see [package Matrix](https://CRAN.R-project.org/package=Matrix)),
compatible with machine learning packages
such as [ranger](https://CRAN.R-project.org/package=ranger)
and [xgboost](https://CRAN.R-project.org/package=xgboost).## How to...
### How to install
To install `seqR` from CRAN:
```{r, eval=FALSE}
install.packages("seqR")
```Alternatively, if you want to use the latest development version:
```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("slowikj/seqR")
```### How to use
The package provides two functions that facilitate k-mer counting
* `count_kmers` (used for counting k-mers of one type)
* `count_multimers` (a wrapper of `count_kmers`, used for counting k-mers of many types in a single invocation of the function)and one function used for custom processing of k-mer matrices:
* `rbind_columnwise` (a helper function used for merging several k-mer matrices that do not have same sets of columns)
To learn more, see [features overview vignette](https://slowikj.github.io/seqR/articles/features-overview.html)
and [reference](https://slowikj.github.io/seqR/reference/index.html).#### Examples
##### counting 5-mers
```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
k=5)
```##### counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)
```{r}
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
kmer_gaps=c(0, 1, 0, 2))
```##### counting 1-mers and 2-mers
```{r}
data(CsgA)CsgA[1L:2]
count_multimers(sequences=CsgA,
k_vector = c(1, 2))
```### How to cite
For citation type:
```{r, eval=FALSE}
citation("seqR")
```or use:
Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.
## Benchmarks
The `seqR` package has been compared with other existing k-mer counting R packages:
[biogram](https://CRAN.R-project.org/package=biogram),
[kmer](https://CRAN.R-project.org/package=kmer),
[seqinr](https://CRAN.R-project.org/package=seqinr),
and [biostrings](https://bioconductor.org/packages/Biostrings).All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz 8 cores, using the [microbenchmark](https://CRAN.R-project.org/package=microbenchmark) R package.
### Contiguous k-mers
#### Changing k
The input consists of one `DNA` sequence of length `3 000`.
#### Changing the number of sequences
Each `DNA` sequence has `3 000` elements, `contiguous 5-mer` counting.
### Gapped k-mers
#### Changing the first contiguous part of a k-mer
The input consists of one `DNA` sequence of length `1 000 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.
#### Changing the first gap size
The input consists of one `DNA` sequence of length `100 000`. `Gapped 5-mers` counting with base gaps `(1, 0, 0, 1)`.