https://github.com/citiususc/pastaspark
PASTASpark is an extension to PASTA (Practical Alignments using SATé and TrAnsitivity) that allows to execute it on a distributed memory cluster making use of Apache Spark.
https://github.com/citiususc/pastaspark
Last synced: 7 months ago
JSON representation
PASTASpark is an extension to PASTA (Practical Alignments using SATé and TrAnsitivity) that allows to execute it on a distributed memory cluster making use of Apache Spark.
- Host: GitHub
- URL: https://github.com/citiususc/pastaspark
- Owner: citiususc
- License: gpl-3.0
- Created: 2017-02-15T13:15:10.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2019-03-19T08:48:28.000Z (about 7 years ago)
- Last Synced: 2024-04-16T11:35:32.333Z (about 2 years ago)
- Language: Python
- Homepage:
- Size: 32.2 MB
- Stars: 10
- Watchers: 9
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# What is PASTASpark about?
**PASTASpark** is a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of [PASTA][1] (Practical Alignments using SATé and TrAnsitivity). PASTASpark reduces noticeably the execution time of PASTA, running the most costly part of the original code as a distributed Spark application. In this way, PASTASpark guarantees scalability and fault tolerance, and allows to obtain MSAs from very large datasets in reasonable time.
If you use **PASTASPark**, please, cite this article:
> José M. Abuin, Tomás F. Pena and Juan C. Pichel. ["PASTASpark: multiple sequence alignment meets Big Data"][4]. *Bioinformatics*, Vol. 33, Issue 18, pages 2948-2950, 2017.
**PASTASpark** was originally a fork from [PASTA][1] (Forked in November 2016) [here][2] and [here][3]. Later, it became a project itself in this repository. The original PASTA paper can be found with this references:
> Mirarab, S., Nguyen, N., and Warnow, T. (2014). ["PASTA: Ultra-Large Multiple Sequence Alignment"][5]. In R. Sharan (Ed.), *Research in Computational Molecular Biology*, (pp. 177–191).
> Mirarab, S., Nguyen, N. Guo, S., Wang, L., Kim, J. and Warnow, T. ["PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences"][6]. *Journal of Computational Biology*, (2014).
## Installation
**PASTASpark** only works on Linux systems.
### Compilation from sources
You need Python 2.7 and git installed.
1. Clone the repository:
```
git clone https://github.com/citiususc/pastaspark.git
```
2. Enter the created directory and run the install command:
```
cd pastaspark
python setup.py develop --user
```
## Running PASTASpark
### Running Dependencies
1. Python 2.7 or later.
2. Java 8 (required for Opal, which is by the default used in PASTA for merging small alignments).
3. A cluster with Hadoop/YARN and Spark installed and running. Tested with Hadoop 2.7.1 and Spark 1.6.1.
4. A shared folder among the computing nodes to store the results in the cluster.
### Working modes
**PASTASpark** can be executed as the original PASTA or on a YARN/Spark cluster. In this way, if you launch **PASTASpark** within a Spark context, it will be executed on your Spark cluster. You can find more information about this topic in the next section.
### Examples
A basic example of how to execute **PASTASpark** in your local machine with a working Spark setup is:
```
spark-submit --master local run_pasta.py -i data/small.fasta -t data/small.tree
```
The following is an example of how to launch PASTASpark using a bash script and taking as input the files stored in the `data` directory:
```
#!/bin/bash
SPARK_COMMAND="spark-submit --master yarn --deploy-mode cluster"
DRIVER_MEM="25G"
EXEC_MEM="5G"
CURRENT_DIR=`pwd`
HOME="/home/jmabuin"
NUM_EXECUTORS="8"
DRIVER_CORES="4"
EXECUTOR_CORES="1"
ARCHIVES="pasta.zip"
PY_FILES="pasta.zip,$HOME/.local/lib/python2.7/site-packages/DendroPy-3.12.3-py2.7.egg"
INPUT_DATA="$CURRENT_DIR/data/small.fasta"
INPUT_TREE="$CURRENT_DIR/data/small.tree"
$SPARK_COMMAND --name PastaSpark_Small_8Exec --driver-memory $DRIVER_MEM --executor-memory $EXEC_MEM --num-executors $NUM_EXECUTORS --driver-cores $DRIVER_CORES --executor-cores $EXECUTOR_CORES --archives $ARCHIVES --py-files $PY_FILES run_pasta.py --temporaries=./ -i $INPUT_DATA -t $INPUT_TREE --num-cpus=$DRIVER_CORES --num-cpus-spark=$EXECUTOR_CORES --num-partitions=$NUM_EXECUTORS
```
To see the original PASTA documentation, click [here](README_PASTA.md).
[1]: https://github.com/smirarab/pasta
[2]: https://github.com/jmabuin/pasta
[3]: https://github.com/tarabelo/pasta
[4]: https://doi.org/10.1093/bioinformatics/btx354
[5]: https://link.springer.com/chapter/10.1007%2F978-3-319-05269-4_15
[6]: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0156