{"id":22155930,"url":"https://github.com/citiususc/pastaspark","last_synced_at":"2025-10-26T11:52:41.620Z","repository":{"id":68812712,"uuid":"82062348","full_name":"citiususc/pastaspark","owner":"citiususc","description":"PASTASpark is an extension to PASTA (Practical Alignments using SATé and TrAnsitivity) that allows to execute it on a distributed memory cluster making use of Apache Spark.","archived":false,"fork":false,"pushed_at":"2019-03-19T08:48:28.000Z","size":33811,"stargazers_count":10,"open_issues_count":1,"forks_count":6,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-04-16T11:35:32.333Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/citiususc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-02-15T13:15:10.000Z","updated_at":"2024-03-12T08:14:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"e2d25219-49e0-4d2c-97b0-9cde500898a3","html_url":"https://github.com/citiususc/pastaspark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpastaspark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpastaspark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpastaspark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpastaspark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/citiususc","download_url":"https://codeload.github.com/citiususc/pastaspark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227660738,"owners_count":17800412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T02:33:19.623Z","updated_at":"2025-10-26T11:52:36.590Z","avatar_url":"https://github.com/citiususc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What is PASTASpark about?\n\n**PASTASpark** is a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of [PASTA][1] (Practical Alignments using SATé and TrAnsitivity). PASTASpark reduces noticeably the execution time of PASTA, running the most costly part of the original code as a distributed Spark application. In this way, PASTASpark guarantees scalability and fault tolerance, and allows to obtain MSAs from very large datasets in reasonable time. \n\nIf you use **PASTASPark**, please, cite this article:\n\n\u003e José M. Abuin, Tomás F. Pena and Juan C. Pichel. [\"PASTASpark: multiple sequence alignment meets Big Data\"][4]. *Bioinformatics*, Vol. 33, Issue 18, pages 2948-2950, 2017.\n\n\n**PASTASpark** was originally a fork from [PASTA][1] (Forked in November 2016) [here][2] and [here][3]. Later, it became a project itself in this repository. The original PASTA paper can be found with this references:\n\n\u003e Mirarab, S., Nguyen, N., and Warnow, T. (2014). [\"PASTA: Ultra-Large Multiple Sequence Alignment\"][5]. In R. Sharan (Ed.), *Research in Computational Molecular Biology*, (pp. 177–191).\n\n\u003e Mirarab, S., Nguyen, N. Guo, S., Wang, L., Kim, J. and Warnow, T. [\"PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences\"][6]. *Journal of Computational Biology*, (2014).\n\n## Installation\n\n**PASTASpark** only works on Linux systems.\n\n### Compilation from sources\n\nYou need Python 2.7 and git installed.\n\n1. Clone the repository:\n```\ngit clone https://github.com/citiususc/pastaspark.git\n```\n\n2. Enter the created directory and run the install command:\n```\ncd pastaspark\npython setup.py develop --user\n```\n\n## Running PASTASpark\n\n### Running Dependencies\n\n1. Python 2.7 or later.\n2. Java 8 (required for Opal, which is by the default used in PASTA for merging small alignments).\n3. A cluster with Hadoop/YARN and Spark installed and running. Tested with Hadoop 2.7.1 and Spark 1.6.1.\n4. A shared folder among the computing nodes to store the results in the cluster.\n\n\n### Working modes\n\n**PASTASpark** can be executed as the original PASTA or on a YARN/Spark cluster. In this way, if you launch **PASTASpark** within a Spark context, it will be executed on your Spark cluster. You can find more information about this topic in the next section.\n\n### Examples\n\nA basic example of how to execute **PASTASpark** in your local machine with a working Spark setup is:\n\n```\nspark-submit --master local run_pasta.py -i data/small.fasta -t data/small.tree\n```\n\nThe following is an example of how to launch PASTASpark using a bash script and taking as input the files stored in the `data` directory:\n```\n#!/bin/bash\n\nSPARK_COMMAND=\"spark-submit --master yarn --deploy-mode cluster\"\nDRIVER_MEM=\"25G\"\nEXEC_MEM=\"5G\"\n\nCURRENT_DIR=`pwd`\nHOME=\"/home/jmabuin\"\n\nNUM_EXECUTORS=\"8\"\nDRIVER_CORES=\"4\"\nEXECUTOR_CORES=\"1\"\nARCHIVES=\"pasta.zip\"\nPY_FILES=\"pasta.zip,$HOME/.local/lib/python2.7/site-packages/DendroPy-3.12.3-py2.7.egg\"\n\nINPUT_DATA=\"$CURRENT_DIR/data/small.fasta\"\nINPUT_TREE=\"$CURRENT_DIR/data/small.tree\"\n\n$SPARK_COMMAND --name PastaSpark_Small_8Exec --driver-memory $DRIVER_MEM --executor-memory $EXEC_MEM --num-executors $NUM_EXECUTORS --driver-cores $DRIVER_CORES --executor-cores $EXECUTOR_CORES --archives $ARCHIVES --py-files $PY_FILES run_pasta.py --temporaries=./ -i $INPUT_DATA -t $INPUT_TREE --num-cpus=$DRIVER_CORES --num-cpus-spark=$EXECUTOR_CORES --num-partitions=$NUM_EXECUTORS\n\n```\nTo see the original PASTA documentation, click [here](README_PASTA.md).\n\n[1]: https://github.com/smirarab/pasta\n[2]: https://github.com/jmabuin/pasta\n[3]: https://github.com/tarabelo/pasta\n[4]: https://doi.org/10.1093/bioinformatics/btx354\n[5]: https://link.springer.com/chapter/10.1007%2F978-3-319-05269-4_15\n[6]: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0156\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fpastaspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcitiususc%2Fpastaspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fpastaspark/lists"}