https://github.com/passionlab/pastis
Distributed Protein Sequence Aligner
https://github.com/passionlab/pastis
Last synced: 5 months ago
JSON representation
Distributed Protein Sequence Aligner
- Host: GitHub
- URL: https://github.com/passionlab/pastis
- Owner: PASSIONLab
- License: other
- Created: 2019-06-13T17:47:55.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-10-03T21:34:37.000Z (over 1 year ago)
- Last Synced: 2025-06-10T13:49:07.945Z (8 months ago)
- Language: C++
- Size: 11.8 MB
- Stars: 7
- Watchers: 5
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# PASTIS: Distributed Many-to-Many Protein Sequence Alignment Using Sparse Matrices
PASTIS is a highly-parallel distributed protein sequence aligner. It relies on distributed sparse matrix and graph operations to scale expensive many-against-many sequence alignments. PASTIS uses novel techniques such as Blocked 2D Sparse SUMMA to reduce high memory overhead, custom matrix operations for load balancing and can take advantage of accelerators for faster alignments. In 2022, PASTIS was among the 6 finalists for the [Gordon Bell Prize](https://awards.acm.org/bell), completing a many-against-many search of 400 million protein sequences using over 20,000 GPUs in less than 3.5 hours and sustaining a rate of 690 million alignments per second. Accelerator-enabled GPU ports can be accessed under [CUDA](https://github.com/PASSIONLab/PASTIS/tree/CUDA) and [HIP](https://github.com/PASSIONLab/PASTIS/tree/HIP) branches in the repository. Refer to these [works](https://github.com/PASSIONLab/PASTIS?tab=readme-ov-file#citation) for more details.
## Prerequisites
1. Operating System.
* PASTIS is tested and known to work on the following operating systems.
* SUSE Linux Enterprise Server 15.
* Ubuntu 14.10.
* MacOS.
2. GCC/G++ version 8.2.0 or above.
3. CMake 3.11 or above.
## Dependencies
1. PASTIS requires CombBLAS and SeqAn libraries. These are included within PASTIS as
submodules. To get them all, use:
```
git clone --recursive https://github.com/PASSIONLab/PASTIS
```
If you already happen to have those libraries, they are needed within the PASTIS
directory. So, link the installation directories under names "CombBLAS" and
"seqan". Note that PASTIS uses master branch of CombBLAS and develop branch of
SeqAn.
2. Build CombBLAS; the following commands can be used to build and install CombBLAS:
```
cd CombBLAS
mkdir build
mkdir install
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install ../
make -j4
make install
```
3. For SeqAn, switch to develop branch (needed for threads and vectorization):
```
cd seqan
git fetch
get checkout develop
```
SeqAn is a header only library, so there's no need to build it.
## Build and Test PASTIS
To build PASTIS after cloning or downloading the source, use:
* mkdir build install
* cd build
* cmake .. -DCMAKE_INSTALL_PREFIX=../install
* make install
To run tests, use:
* ctest -V
This will run PASTIS with the fasta file provided under tests directory, and
produce output alignment information in the same directory.
## Run PASTIS
You can run PASTIS in parallel by specifying the number of processes to the mpirun or mpiexec command. The number of processes must be perfect square value.
The main parameters of PASTIS are as follows:
- ```-i ```: Input FASTA file.
- ```-c ```: Number of sequences in the FASTA file.
- ```-k ```: K-mer length.
- ```--na```: Do not perform alignment.
- ```--sfa```: Seqan Smith-Waterman alignment.
- ```--sxa ```: Seqan X-drop alignment with the indicated drop value.
- ```--absw```: ADEPT Striped Smith-Waterman (only available for GPUs).
- ```--af ```: Final similarity matrix in matrix market file format.
The additional parameters of PASTIS are as follows:
- ```-O ```: Fasta file parallel read overlap in bytes. ```[default: 10000]```
- ```--sc ```: Maximum seed count. ```[default: 2]```
- ```-s ```: K-mer stride. ```[default: 1]```
- ```-g ```: Gap open penalty (negative). ```[default: -11]```
- ```-e ```: Gap extension penalty (negative). ```[default: -2]```
- ```--ckthr ```: Common k-mer threshold. The sequence pairs that have
less or equal to this number of common k-mers are not aligned. ```[default: 0]```
- ```--mosthr ```: Maximum overlap score threshold. The maximum overlap
score is given by the highest overlap region of k-mers divided by the number
k-mer length. The pairs whose such score is less than or equal to ```mosthr```
are not aligned. ```[default: 0]```
- ```--alph ```: Reduced alphabet to use. Default is not to use any kind of reduced alphabet. Choose from ```{pdefault, murphy10, dssp10, gbmr10, td10, diamond}```.
- ```--bsz ```: Batch alignment size. ```[default: 1e7]```
- ```--br ```: Block multiplication row dimension. ```[default: 1]```
- ```--bc ```: Block multiplication column dimension. ```[default: 1]```
Example run:
```mpirun -N 4 ./build/pastis -i ./tests/scope_77k.fa -c 77040 --af sim_mat.mtx --ckthr 1 --sxa 49```
## GPU support
PASTIS has GPU support for NVIDIA and AMD GPUs. Switch to CUDA or HIP branch to run PASTIS on systems with accelerators. The alignment option should be provided as ```--absw``` when running on GPUs as it is the alignment option supported for GPUs. To compile the ADEPT library, use:
* cd ADEPT
* mkdir build
* cd build
* cmake ..
* make
In building the HIP branch, to build ADEPT you need to use the custom cmake file CMakeLists.txt-ADEPT provided in the repo. Copy this file into ADEPT directory by ```cp CMakeLists.txt-ADEPT ADEPT/CMakeLists.txt``` and follow the instructions provided above. In order to use HIP compiler, use ```cmake .. -DCMAKE_CXX_COMPILER=hipcc``` to configure make.
## Notes for running PASTIS on large datasets
PASTIS supports a blocking mode for large datasets to perform similarity search in a blocked manner and save memory. In default, this mode is not enabled. If you run into out of memory errors, provide the number of blocks you want PASTIS to use with the options ```--br ``` and ```--bc ```, where the target number of blocks is equal to the multiplication of these two parameters.
## Notes for running PASTIS on NERSC Cori
The necessary modules for running PASTIS on Cori are as follows:
* module swap PrgEnv-intel PrgEnv-gnu
* module load cmake
Build PASTIS with the wrappers:
* cmake .. -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC
* make
## Citation
If you use PASTIS in your research, please cite our paper describing the algorithm and the implementation:
* Oguz Selvitopi*, Saliya Ekanayake*, Giulia Guidi, Georgios Pavlopoulos, Ariful Azad, and Aydın Buluç. Distributed Many-to-Many Protein Sequence Alignment Using Sparse Matrices. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20), 2020. (*:joint first authors) [pdf](https://people.eecs.berkeley.edu/~aydin/PASTIS-SC20.pdf)
* Selvitopi, Oguz, Saliya Ekanayake, Giulia Guidi, Muaaz G. Awan, Georgios A. Pavlopoulos, Ariful Azad, Nikos Kyrpides, Leonid Oliker, Katherine Yelick, and Aydin Buluç. "Extreme-scale many-against-many protein similarity search." In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12. IEEE, 2022. **SC'22 Gordon Bell Prize Finalist**
## Copyright
PASTIS Copyright (c) 2020, The Regents of the University of California,
through Lawrence Berkeley National Laboratory (subject to receipt of
any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software,
please contact Berkeley Lab's Intellectual Property Office at
IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department
of Energy and the U.S. Government consequently retains certain rights. As
such, the U.S. Government has been granted for itself and others acting on
its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the
Software to reproduce, distribute copies to the public, prepare derivative
works, and perform publicly and display publicly, and to permit others to do so.