Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/biogenies/kmerfilters

Tools for filtering k-mer data
https://github.com/biogenies/kmerfilters

genomics k-mer n-gram peptide-identification proteomics

Last synced: about 1 month ago
JSON representation

Tools for filtering k-mer data

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# kmerFilters : R Package for k-mer Data Simulation and Filtering

[![Codecov test coverage](https://codecov.io/gh/BioGenies/kmerFilters/branch/main/graph/badge.svg)](https://app.codecov.io/gh/BioGenies/kmerFilters?branch=main)

## Overview

k-mers (n-grams) refer to k-length substrings derived from longer sequences, which can be continuous (representing a block of subsequent residues) or discontinuous (where the wildcards representing the gaps between residues are allowed). In biological applications, k-mers are used for various purposes, including genome assembly, sequence alignment, motif discovery, variant calling, phylogenetics, and CRISPR target identification. Due to the vast number of variables introduced by k-mer notation, we need tools to filter the variables.

In this package we provide tools for simulating k-mer data and benchmarking various filtering techniques. It is designed to assist researchers and practitioners in evaluating and comparing different approaches for preprocessing sequence data, particularly for applications such as protein function prediction.

## Features

Our package provides fast and efficient generation of synthetic k-mer datasets with customizable parameters, including sequence length, number of sequences, k-mer size and motifs impact. The latest embraces modelling the response variable in discrete or continuous way depending on few biological cases, such as

- additive impact of each motif,

- additive impact of interactions of motifs,

- additive impact of custom interactions (logical expressions using motifs).

To learn more, please visit: https://biogenies.info/kmerFilters/

Specific resources:

- [Simulation description](https://biogenies.info/kmerFilters/articles/simulation_description.html)

## How to install

You can install the latest version from GitHub:

```
# install.packages("devtools")
devtools::install_github("BioGenies/kmerFilters")
```

## Examples of usage

### How to simulate data

You can use the `generate_motifs` function for motifs generation. For example:

```{r}
library(kmerFilters)

alph <- letters[1:4]
n_injections <- 4

motifs <- generate_motifs(alphabet = alph,
n_motifs = 4,
n_injections = 4,
n = 4,
d = 6)

motifs
```

Using simulated motifs we can simulate positive and negative sequences and consrtuct a k-mer feature space:

```{r}
results <- generate_kmer_data(n_seq = 200,
sequence_length = 20,
alphabet = alph,
motifs = motifs,
n_injections = 4)
results[1:10, ]
```

Using obtained data you can choose how to generate a response variable. We provide three functions for that:

- `get_target_additive()`,
- `get_target_interactions()`,
- `get_target_logic()`.

For example, the following code:

```{r}
target <- get_target_additive(results)
target
```

creates a binary response variable based on the logistic regression model assumptions.

### How to filter

We have implemented many filtering methods. You can list them easily using the function

```{r}
list_filters()
```

Using k-mer space and target variable you can filter k-mers as follows:

```{r}
filter_quipt(target, results, significance_level = 0.05)
```

## Contributing

Contributions to kmerFilters are welcome! If you have suggestions for new features, improvements, or bug fixes, please submit an issue or pull request on GitHub.