Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chrisarg/bio-seqalignment-examples-tailingpolyester

Repository related to my talk at TPRC2024
https://github.com/chrisarg/bio-seqalignment-examples-tailingpolyester

bioinformatics rna-seq simulator

Last synced: 25 days ago
JSON representation

Repository related to my talk at TPRC2024

Host: GitHub
URL: https://github.com/chrisarg/bio-seqalignment-examples-tailingpolyester
Owner: chrisarg
License: other
Created: 2024-04-17T01:52:43.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-06-12T12:26:50.000Z (7 months ago)
Last Synced: 2024-12-02T04:43:18.271Z (27 days ago)
Topics: bioinformatics, rna-seq, simulator
Language: Perl
Homepage:
Size: 610 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README
- Changelog: Changes
- License: LICENSE

Awesome Lists containing this project

README

NAME
Bio::SeqAlignment::Examples::TailingPolyester - extending the Polyester
RNAsequencing simulator by including polyA tails

VERSION
version 0.01

SYNOPSIS
A collection of examples that demonstrate how to extend the polyester
RNA sequencing tool by including polyA tails in the reference RNA being
used to generate the simulated RNA sequencing data. The module also
shows the general present day relevance of Perl for constructing
bioinformatic applications related to sequence mapping.

DESCRIPTION
This distribution provides examples of the use of Perl, BioPerl and the
Perl Data Language to extend the polyester RNA sequencing tool by
providing it with the ability to include polyA tails in the reference
RNA being used to generate the simulated RNA sequencing data. It also
shows how to use these sequences for RNA sequence mapping. The main
module created for the example is found under the namespace
Bio::SeqAlignment::Applications::Sequencing::Simulators::RNASeq::Polyest
er and it is a command line tool that wraps over the Polyester
simulator, which itself is a R based bioconductor package. In our
extension we provided polyester with the capabilities to add a tail to
the RNA sequences it simulated. To do so we also created a pure R
command line tool for poyester and put it under the control of Perl.
This example requires a few other modules that may be of some general
use. Some of these modules are imported under the
Bio::SeqAlignment::Examples::TailingPolyester namespace. Other modules
were given their own namespace under Bio::SeqAlignment. These modules
fall in three separate categories:

A Modules related to the simulation of random values from truncated
distributions. Those are functional and will eventually find themselves
under their own namespace once I figure which one this will be! Until
then, one can load them by importing the relevant module under
Bio::SeqAlignment::Examples::TailingPolyester 1. SimulatePDLGSL : module
that uses the Gnu Scientific Library (GSL) and the Perl Data Language
(PDL) to simulate random numbers from truncated versions of the
distributions provided by the GSL using two role plugins: one for
simulating random numbers from the uniform distribution, and one for
computing the CDF (Cumulative density function) of the truncated
distribution and their inverse. 2. SimulateMathGSL : module that uses
the Gnu Scientific Library (GSL) base Perl to simulate random numbers
from truncated versions of the distributions in GSL using using two role
plugins: one for simulating random numbers from the uniform
distribution, and one for computing the CDF (Cumulative density
function) of the truncated distribution and their inverse. 3.
SimulateTruncatedRNGPDL : a role plugin that implements the inverse CDF
method for drawing random numbers from a possibly truncated version of a
distribution using the Perl Data Language (PDL). 4. SimulateTruncatedRNG
: a role plugin that implements the inverse CDF method for drawing
random numbers from a possibly truncated version of a distribution in
base Perl. 5. PDLRNG: a role plugin that draws random numbers from the
uniform distribution using the Xoshiro256+ algorithm in the Perl Data
Language (PDL). 6. GSLRNG: a role plugin that draws random numbers from
the uniform distribution using the uniform (flat) distribution in the
PDL::GSL module of PDL 7. PERLRNGPDL: a role plugin that draws random
numbers from the uniform distribution using the builtin rand() function
in Perl and returns a ndarray with these values 8. PERLRNG: a role
plugin that draws random numbers from the uniform distribution using the
builtin rand() function in Perl and returns a reference to array of said
values.

B. Modules related to generic tasks such as reading and processing
collections of BioX::Seq objects, tailing of sequences, documenting
sequence modifications etc. polyA processing and removal of such tails
from sequencing data. BioX::Seq is a lightweight framework for
representing biological sequences such as those that come from
sequencing instruments. It is a simple object that holds the sequence
data, the quality data, and the name of the sequence. It is used as a
lightweight alternative to the BioPerl Bio::Seq object. It can handle
both FASTA and FASTQ files, including their compressed versions. The
modules that fall under this category are:

1. Bio::SeqAlignment::Components::Conversions::BioXFASTX . This module
handles the conversion of lists of BioX::Seq objects to FASTX (where X
is either A or Q indicating a FASTA or a FASTQ) file in the disk. The
module is used as an example of input/output plugins for the
Bio::SeqAlignment::Components::TrimTail module. 2.
Bio::SeqAlignment::Components::Sundry::IOHelpers : a collection of
modules that read, write and split FASTX (either FASTA or FASTQ) files.
It provides convenience functions to read/write such files using the
lightweight module BioX::Seq::Stream. 3.
Bio::SeqAlignment::Components::Sundry::Tailing : This module provides
functions to add various tails to the 3' of biological sequences. Such
modifications are useful for e.g. simulating polyA tails in RNAseq,
adding UMI (Universal Molecular Identifier) tags to sequences, etc. The
function add_polyA is used by the
Bio::SeqAlignment::Applications::Sequencing::Simulators::RNASeq::Polyest
er module to add poly A tails in the extension of Polyester presented in
the talk. 4.
Bio::SeqAlignment::Components::Sundry::DocumentSequenceModifications :
This module is used to store modifications to sequences that are carried
out by components of the simulator (or the modules that process
sequences for mapping). During the execution of the Perl code, we use
hash structures to store such modifications (a type of in-memory log)
and then write them out in YAML, JASON or MessagePack formats. These
files may be loaded at a subsequent point and used to analyze the
results of what ever sequence modification was carried out in the source
data.

A single application script is provided in the bin directory of the
distribution. This script is called polyester.pl and is used to attach
the polyA tails to the reference sequences, before calling out the
polyester R script.

In addition to this distribution contains example scripts for the use of
these modules and comparator scripts for high performance random
frequency generation against R and Python. PDL just shines in this area.

All modules, and application scripts were used for the talk given to the
S cience Track of the Perl & Raku conference 2024.
https://tprc.us/tprc-2024-las/
https://blogs.perl.org/users/oodler_577/2024/01/perl-raku-conference-202
4-to-host-a-science-track.html

scripts
This is a directory that holds various scripts in Perl and R that are
used to generate and analyze performance data of various aspects covered
in this talk. The generated data are found in the subfolder data, while
the results of these analyses are stored as image files under 'scripts'.
The following files are found under this location:

cutadapt_polyA_algo_timing.pl
This script benchmarks various potential approaches to trimming the
polyA tail from sequences, including various native Perl implementations
of the cutadapt algorithm, as well as PDL and C implementations of the
same algorithm. It also includes an implementation of a changepoint
method in C.

cutadapt_polyA_algo_timing.py
A python script for the implementation of the cutadapt algorithm for
trimming polyA tails from sequences and a modified version developed for
benchmarking. This script is used to compare the performance of various
implementations of the cutadapt algorithm in Perl, Python, and C.

testRNG_performance.pl
This script tests different combinations of random number generators,
and implementations of the inverse CDF method for sampling from
truncated distributions. It's main output is a comma separated script of
timing data.

testsimsGSL.R
This script is used to test the performance of the GSL RNGs against the
inverse CDF implemented via a procedural logic in R. It outputs a single
PNG file with the violin plots (a combination of box plots and kernel
density) of the timing data for different possible implementations of
the inverse CDF method in either R or Perl.

vioplot_Perl_R_lognormal.png
Performance comparison of Perl and R for the generation of truncated
lognormal variates. It is produced by testsimsGSL.R

testPerl.csv
This is a CSV file that contains the timing data for the Perl RNGs and
the inverse CDF method implemented in PDL. It is produced by
testRNG_performance.pl

perl_timing.txt
This is a text file that contains the timing data for the various
implementations of cutadapt in native Perl, PDL and PDL/C methods. It is
produced by the script cutadapt_polyA_algo_timing.pl

python_timing.txt
This is a text file that contains the timing data for the various
implementations of cutadapt in native Python. It is produced by the
script cutadapt_polyA_algo_timing.py