Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib

Parallelizing the Edlib sequencing library using MCE and OpenMP in Perl
https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib

bioinformatics rna-seq sequence-alignment sequencing-rna

Last synced: about 1 month ago
JSON representation

Parallelizing the Edlib sequencing library using MCE and OpenMP in Perl

Awesome Lists containing this project

README

        

NAME
Bio::SeqAlignment::Examples::EnhancingEdlib - Parallelizing Edlib with
MCE, OpenMP using Inline and FFI::Platypus

VERSION
version 0.03

SYNOPSIS
A collection of examples that demonstrate how to enhance the sequence
alignment library Edlib by adding process level parallelism (through
MCE) and thread level parallelism (through OpenMP). These examples
hopefully demonstrate the general present day relevance of Perl for
constructing bioinformatic applications related to sequence mapping.

DESCRIPTION
This distribution provides examples of sequence mapping applications
build on top of the Edlib library. The examples demonstrate how to use
the MCE and OpenMP parallelism libraries to speed up the sequence
alignment process through process and thread level parallelism. The
examples are meant to be illustrative of the capabilities of Perl to
create parallel components for sequencing applications. These examples
educational, and are not meant to be used in production code. However,
they can be used as a starting point for creating such applications,
which is the point we are trying to make in the presentation given at
the Perl and Raku conference 2024.

All scripts were used to generate data for the talk given to the Science
Track of the Perl & Raku conference 2024 https://tprc.us/tprc-2024-las/
https://blogs.perl.org/users/oodler_577/2024/01/perl-raku-conference-202
4-to-host-a-science-track.html

db
This is a directory that holds the sequence data used in the examples.
It contains two files : Homo_sapiens.GRCh38.cds.sample.strip.fasta :
1000 sequences Homo_sapiens.GRCh38.sample.end.strip.fasta : 648
sequences from the human transcriptome.

scripts
This is a directory that holds various scripts in Perl and R that are
used to generate and analyze performance data for the MCE and openMP
enhanced versions of edlib.

testMapMCE_openMP.pl
Perl script that illustrates the use of the Linear dataflow and the
Generic Mapper dicussed in Bio::SeqAlignment::Components::SeqMapping
for
the creation of an MCE and OpenMP enhanced version of Edlib. Running the
script generates the results file timings_openMP.txt . Before running in
your own machine, please make sure that the number of workers and the
number of threads are set between 1 and the maximum number of
hyperthreads for your system.

testMapMCE_overC.pl
Perl script that illustrates the use of the Linear dataflow and the
Generic Mapper dicussed in Bio::SeqAlignment::Components::SeqMapping
for
the creation of an MCE enhanced version of Edlib. Running the script
generates the results file timings.txt . Before running in your own
machine, please make sure that the number of workers is set between 1
and the maximum number of hyperthreads for your system. Feel free to
experiment with different chunk sizes (the script only checks chunk
sizes between 1 and 20 query sequences).

plot_timings.R
R script that generates the figures for the benchmarking between the MCE
and MCE/OpenMP enhanced versions of Edlib.

timings.txt
Execution timings of the MCE version of the enhanced Edlib library. This
is a tab separated file with three columns: Workers : number of MCE
workers Time : execution time in seconds Chunk_Size : MCE chunk size

timings_openMP.txt
Execution timings of the MCE/openMP version of the enhanced Edlib
library. This is a tab separated file with three columns: Workers :
number of MCE workers Time : execution time in seconds Num_Threads :
Number of OpenMP threads

timings_chunk.png
Plot of the timings.txt file

timings_MCE_OMP.png
Plot of the execution times of the MCE enhanced version of Edlib (for
chunk size of one) for an increasing number of MCE workers v.s. the
MCE/OpenMP enhancement for an increasing number of threads for single
worker.

timings_MCE_with_OMP.png
Plot of the execution times of the MCE/OpenMP enhanced version of Edlib
for various combinations of workers and threads. The figure is a two
dimensional histogram (rendered as a contour plot) with the 5% best
performing combinations marked with a '+' sign.

spacetime_MCE_OMP.png
Plot of the resources (copies of data in memory x execution time) used
by the MCE enhanced version of Edlib (for chunk size of one) for an
increasing number of MCE workers v.s. the MCE/OpenMP enhancement for an
increasing number of threads for single worker.

spacetime_MCE_with_OMP.png
Plot of the resources (copies of data in memory x execution time) used
by the MCE/OpenMP enhanced version of Edlib for various combinations of
workers and threads. The figure is a two dimensional histogram (rendered
as a contour plot) with the 5% best performing combinations marked with
a '+' sign.

Worker_Thread_effects.png
Main effects of the number of workers and threads on the execution time
of the MCE/OpenMP enhanced version of Edlib. The figure is generated by
fitting a Generalized Additive Model (GAM) analysis of the data in the
timings_MCE_with_OMP.txt file and plotting the main effects of the
number of workers and threads.

Worker_Thread_interaction.png
Interaction effects of the number of workers and threads on the
execution time of the MCE/OpenMP enhanced version of Edlib. The figure
is generated by fitting a Generalized Additive Model (GAM) analysis of
the data in the timings_MCE_with_OMP.txt file and plotting the
generalized interaction term between the number of workers and threads.

SEE ALSO
* edlib

This distribution provides edlib static and shared libraries so that
they can be used by other Perl distributions that are on CPAN.

* Bio::SeqAlignment

A collection of tools and libraries for aligning biological
sequences from within Perl.

* Bio::SeqAlignment::Components::SeqMapping

Components (modules for dataflow in sequencing applications and
generic mappers) to create sequence mapping applications in Perl.

* Bio::SeqAlignment::Components::Libraries::edlib

Edlib sequence mapping library binding for Perl. Two versions are
provided, one that is FFI based and one that is Inline based.

* MCE

MCE (Many-Core Engine) is a high performance parallel processing
library for Perl. It is designed to take advantage of the many cores
available on modern processors and provides a simple interface for
creating parallel components in Perl.

* OpenMP

The OpenMP ARB (Architecture Review Boards) mission is to
standardize directive-based multi-language high-level parallelism
that is performant, productive and portable. Jointly defined by a
group of major computer hardware and software vendors and major
parallel computing user facilities, the OpenMP API is a portable,
scalable model that gives shared-memory parallel programmers a
simple and flexible interface for developing parallel applications
on platforms ranging from embedded systems and accelerator devices
to multicore systems and shared-memory systems.

* PDL

The Perl Data Language (PDL) gives standard Perl the ability to
compactly store and speedily manipulate the large N-dimensional data
arrays which are the bread and butter of scientific computing. PDL
turns Perl into a free, array-oriented, numerical language that can
be a very solid alternative to switching to Python or R for
numerical computations during complex data analysis tasks and
pipelines.

AUTHOR
Christos Argyropoulos

COPYRIGHT AND LICENSE
This software is copyright (c) 2024 by Christos Argyropoulos.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.