Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/loukesio/barcosim

BarcoSim, a specialized R package designed for simulating barcoded amplicon sequences. Whether you're testing algorithms or forecasting experimental outcomes, BarcoSim provides a robust framework to mimic real-world sequencing scenarios, ensuring preparedness and enhancing the efficiency of your genomic projects."
https://github.com/loukesio/barcosim

Last synced: 4 days ago
JSON representation

Host: GitHub
URL: https://github.com/loukesio/barcosim
Owner: loukesio
License: other
Created: 2023-04-15T19:24:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-09T12:31:06.000Z (16 days ago)
Last Synced: 2024-12-09T13:34:29.043Z (16 days ago)
Language: R
Homepage:
Size: 2.1 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)

[![Last Commit](https://img.shields.io/github/last-commit/loukesio/barcosim.svg)](https://github.com/loukesio/barcosim/commits/main)

[![GitHub Contributors](https://img.shields.io/github/contributors/loukesio/barcosim.svg)](https://github.com/loukesio/barcosim/graphs/contributors)

[![Codecov test coverage](https://codecov.io/gh/loukesio/barcosim/branch/main/graph/badge.svg)](https://codecov.io/gh/loukesio/barcosim?branch=main)

[![Lifecycle: Experimental](https://img.shields.io/badge/lifecycle-experimental-blueviolet.svg)](https://www.tidyverse.org/lifecycle/#experimental)

## Install the BarcoSim package

Install the package using the following commands  

```r

# for now you can install the developemental version of ltc

# first you need to install the devtools package 

# in case you have not already installed

install.packages("devtools") 

# and load it

library(devtools)

# then you can install the dev version of the ltc

devtools::install_github("loukesio/BarcoSim")

# and load it

library(BarcoSim)

```

### 1. Use the `gpseq` command to generate the parent sequences.

 Parameters:

- `num_sequences`: An integer specifying the number of DNA sequences to generate.

- `seq_length`: An integer representing the length of each DNA sequence.

- `range_start`: An integer indicating the start position of the barcoded sequence.

- `range_end`: An integer indicating the end position of the barcoded sequence.

``` r

library(Biostrings) # Provides tools for working with biological sequences, such as DNA, RNA, and protein sequences

library(BarcoSim)   # BarcoSim: A package for simulating barcoded sequencing data

library(dplyr)      # A powerful package for data manipulation and transformation,

set.seed(123)       # sets the random seed to ensure the reproducibility of a random processes (generation of sequences)

# This function creates 5 parent sequences, each with 10 base pairs and a single barcode area spanning from base 3 to base 6.

df1.1 <- gpseq(num_sequences=5, seq_length=10, range_start=3, range_end=6)

df1.1 %>% 

  DNAStringSet()

  

#> DNAStringSet object of length 10:

#>      width seq

#> [1]    10 CGCAGCGTAA

#> [2]    10 CGTGTTGTAA

#> [3]    10 CGGCAAGTAA

#> [4]    10 CGTCGGGTAA

#> [5]    10 CGATGCGTAA

# Create five parent sequences, each consisting of 10 base pairs, with multiple barcoded regions spanning from base 2 to base 3 and 

# from base 6 to base 8.

df1.2 <- gpseq(5,10,range_start=c(2,6), range_end=c(3,8))

df1.2 %>% 

DNAStringSet()

DNAStringSet object of length 5:

    width seq

[1]    10 GCTTAGGACG

[2]    10 GTGTATGGCG

[3]    10 GCGTACTCCG

[4]    10 GGGTATGTCG

[5]    10 GATTAGCTCG

```

^{Created on 2023-04-15 with [reprex v2.0.2](https://reprex.tidyverse.org)}

The outcome of the gpseq contains the conserved sequences from 1-2 and 7-10, and the barcode sequences from 3-6 (see Figure1). In addition with the help of the function `calcSeqSim` we can quantify the similarity among sequences at each base pair.



  

  



### 2. Use the `calcSeqSim` function to plot sequence similarity across the parent sequences

parameters: 

- `dna_seq`: A character vector of DNA sequences obtained as the output of the `gpseq` function.

Plotting sequence similarity in a sequence with a single barcoded area:

``` r

library(BarcoSim)

library(ggplot2)

library(dplyr)

df1.1 = c("CGCAGCGTAA", "CGTGTTGTAA", "CGGCAAGTAA", "CGTCGGGTAA", "CGATGCGTAA")

df1.1

#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"

#______________________________________________

# Find sequence similarity at each position

#______________________________________________

calSeqSim(df1.1)

#> [1] 100 100  40  40  60  40 100 100 100 100

#___________________________

# plotting example :)

#____________________________

# Create the tibble

df1.1_data <- calSeqSim(df1.1)  %>% 

  tibble() %>% 

  dplyr::rename(similarity=1)

# Create the geom area plot

ggplot(df1.1_data, aes(x = 1:nrow(df1.1_data), y = similarity, fill = similarity)) +

  geom_area(color = "#333333", fill = "#edae49") +

  xlab("Base pair (bp)") +

  ylab("Percentage of Similarity") +

  ggtitle("Sequnce similarity plot per bp") +

  scale_x_continuous(breaks=seq(1:10)) +

  theme(panel.grid = element_blank(),

        panel.background = element_rect(fill = "white"), 

        plot.title = element_text(hjust=0.5), 

        axis.text = element_text(size=12), 

        axis.ticks.length = unit(.2, "cm")) +

  geom_vline(xintercept =c(3,6), linetype="dashed")

# similarly you can plot the df1.2 data

  

```



  

  



### 3. Use the `r_gpseq` command to replicate parent sequences and make a barcode data set. 

 Parameters:

- `dna_seq`: A character vector of DNA sequences, obtained as the output of the `gpseq` function.

- `num_replicates`: An integer specifying the number of times each parent sequence should be replicated.

- `error_rate`: A numeric value representing the probability error rate during the replication process.

``` r

library(BarcoSim)

print(df1.1)

#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"

#With the current parameters of the r_gpseq function, you can replicate each parent DNA sequence in dna_seq twice with a 0.1 probability error rate.

r_gpseq(dna_seq=df1.1,num_replicates=2,error_rate=0.1)

#>    parent parent_seq  offspring

#> 1       1 CGCAGCGTAA CGGTGCGTAA

#> 2       1 CGCAGCGTAA CGCAGCGTAA

#> 3       2 CGTGTTGTAA CGTGTTGTAC

#> 4       2 CGTGTTGTAA CGTGTTGTAA

#> 5       3 CGGCAAGTAA CGGCAAGTAA

#> 6       3 CGGCAAGTAA CCGCAAGTAA

#> 7       4 CGTCGGGTAA CGTCGGGTAC

#> 8       4 CGTCGGGTAA  GTCGGGTAA

#> 9       5 CGATGCGTAA  GATGCGTAA

#> 10      5 CGATGCGTAA CCATGCGTAA

```

^{Created on 2023-06-24 with [reprex v2.0.2](https://reprex.tidyverse.org)}

### 4. `r_gpseq_csub` replicates parent sequences with a specified error rate while allowing the specification of substitution rates for each base. It is an extension of the r_gpseq function.

- `dna_seq`: A character vector of DNA sequences, obtained as the output of the `gpseq` function.

- `num_replicates`:  An integer specifying the number of times each parent sequence should be replicated.

- `error_rate` A numeric value between 0 and 1 representing the probability error rate during the replication process.

- `substitution_probs` (list of length 5): Includes substitution probabilities for each base (A, C, G, T, and empty string).

``` r

library(BarcoSim)

df1.1 = c("GCTTAGGACG", "GTGTATGGCG", "GCGTACTCCG", "GGGTATGTCG", "GATTAGCTCG")

substitution_probs <- list("A" = 0.1, "C" = 0.2, "G" = 0.3, "T" = 0.4, " " = 0.1)

r_gpseq_csub(dna_seq=df1.1,num_replicates=2,error_rate = 0.1, substitution_probs)

#>    parent parent_seq  offspring

#> 1       1 GCTTAGGACG GCTT AGACG

#> 2       1 GCTTAGGACG GCTTACGACG

#> 3       2 GTGTATGGCG GTGTAGGGCG

#> 4       2 GTGTATGGCG GTGTATAGCG

#> 5       3 GCGTACTCCG GCGTGCTCCG

#> 6       3 GCGTACTCCG GCGTACGCCG

#> 7       4 GGGTATGTCG GGGTATGTCG

#> 8       4 GGGTATGTCG GGGTATGTCG

#> 9       5 GATTAGCTCG GGTTGGCTCG

#> 10      5 GATTAGCTCG GAT GGC CG

```

^{Created on 2023-06-24 with [reprex v2.0.2](https://reprex.tidyverse.org)}