https://github.com/fwcd/hpc-smith-waterman
GPU-accelerated Smith-Waterman algorithm implementation in Rust
https://github.com/fwcd/hpc-smith-waterman
Last synced: 3 months ago
JSON representation
GPU-accelerated Smith-Waterman algorithm implementation in Rust
- Host: GitHub
- URL: https://github.com/fwcd/hpc-smith-waterman
- Owner: fwcd
- Created: 2022-02-15T21:56:56.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-02-21T17:11:09.000Z (over 4 years ago)
- Last Synced: 2026-02-15T16:40:19.816Z (4 months ago)
- Language: Rust
- Homepage:
- Size: 127 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Smith Waterman in Rust for HPC
A GPU-accelerated implementation of the Smith-Waterman algorithm for finding optimal local sequence aligments in Rust.
## Usage
> Note: If you want to build the program from source rather than use a prebuilt binary, substitute `cargo run --release --` for every occurrence of `hpc-smith-waterman` in the following commands. Prebuilt binaries for x86-64 Linux (using OpenCL ICD) and arm64 Darwin/macOS can be found in `bin`.
The program includes two main modes: `run` and `bench`.
### Run Mode
In the first mode, the program will run every engine on a single database/query pair. E.g.
```
hpc-smith-waterman run
```
will run the algorithm on the default pair of sequences (`TGTTACGG` and `GGTTGACTA`). You can, however, also specify a custom pair of sequences
```
hpc-smith-waterman run GATT ATBTAG
```
will run the algorithm on the given pair (`GATT` and `ATBAG`).
### Bench Mode
> Note: To use bench mode, you need to either make sure that a dataset exists at `data/uniprot_sprot.fasta` from your cwd (you can download this dataset with the script `scripts/download-dataset`) or point to a custom FASTA-dataset with `--path`.
In the second mode, the program will read a dataset and then compare the first sequence to all of the remaining sequences, again using each engine. During this, the elapsed time and the Giga-CUPS (Cell Operations Per Second) will be recorded.
The simplest way to invoke this mode is to not pass any arguments:
```
hpc-smith-waterman bench
```
This will use the aforementioned dataset (`data/unipro_sprot.fasta`) and run every engine. If you only wish to run a subset of engines, you can pass the engines you wish to run as arguments. The following engines are supported:
| Flag | Description |
| ---- | ----------- |
| `--naive` | A naive CPU engine |
| `--diagonal` | A CPU engine that parallelizes over diagonals |
| `--opencl-diagonal` | A GPU engine that parallelizes over diagonals |
| `--opencl-diagonal` | A GPU engine that parallelizes over diagonals |
| `--optimized-diagonal` | A CPU engine that parallelizes over diagonals and uses a cache-optimized (diagonal-major) matrix layout |
| `--optimized-opencl-diagonal` | A GPU engine that parallelizes over diagonals and uses a cache-optimized (diagonal-major) matrix layout |
For example, if you wish to bench the naive engine and the OpenCL diagonal engine, you could invoke the program as follows:
```
hpc-smith-waterman bench --naive --opencl-diagonal
```
You can customize the maximum number of query sequenced benchmarked against using `--number` aka. `-n`:
```
hpc-smith-waterman bench -n 2000
```
If you wish, you can also let each query sequence benchmarked against be repeated/cycled a certain number of times using `--repeat` aka. `-r` to make it longer (useful for benchmarking the performance of very long sequences). For example, to bench against 50 sequences, each repeated/cycled 10 times in length, run
```
hpc-smith-waterman bench -n 50 -r 10
```
If you have multiple GPUs installed, you can choose the GPU for OpenCL using `--gpu-index` (the default is 0), e.g. like this:
```
hpc-smith-waterman --gpu-index 1 bench --opencl-diagonal
```
## Performance Considerations
While the benchmarks already parallelize over the examples using CPU threads, there are some observations to keep in mind:
- The GPU engines generally only outperform the CPU engines on large sequences (since those let us parallelize the kernel well due to lots of diagonals)
- Additionally, there is overhead to using OpenCL (e.g. configuring kernels, queueing them, etc.), which makes the CPU variants often faster when benchmarking lots of short sequences
- The naive CPU variant is already pretty fast due to good cache coherency (we iterate the matrix in a natural way, the inner loop visits adjacent elements)
## Example Results
Example benchmark results on the Apple M1 Pro:
### Lots of short-ish sequences
```
$ hpc-smith-waterman bench -n 10000
┌──────────────────────────┐
│ Naive (CPU) (sequential) │
└──────────────────────────┘
Elapsed: 9.79s
Giga-CUPS: 0.38
Pairs: 10000
┌────────────────────────┐
│ Naive (CPU) (parallel) │
└────────────────────────┘
Elapsed: 1.59s
Giga-CUPS: 2.34
Pairs: 10000
┌───────────────────────────┐
│ Diagonal (CPU) (parallel) │
└───────────────────────────┘
Elapsed: 4.79s
Giga-CUPS: 0.78
Pairs: 10000
┌─────────────────────────────────────┐
│ Optimized Diagonal (CPU) (parallel) │
└─────────────────────────────────────┘
Elapsed: 3.38s
Giga-CUPS: 1.10
Pairs: 10000
┌────────────────────────────────────────────────┐
│ OpenCL Diagonal (GPU: Apple M1 Pro) (parallel) │
└────────────────────────────────────────────────┘
Elapsed: 14.75s
Giga-CUPS: 0.25
Pairs: 10000
┌──────────────────────────────────────────────────────────┐
│ Optimized OpenCL Diagonal (GPU: Apple M1 Pro) (parallel) │
└──────────────────────────────────────────────────────────┘
Elapsed: 19.68s
Giga-CUPS: 0.19
Pairs: 10000
```
Observation: CPU variants outperform GPU variants by quite a bit.
### Few, very large sequences
> We exclude the naive engine since it's too slow.
```
$ hpc-smith-waterman bench --diagonal --opencl-diagonal --optimized-diagonal --optimized-opencl-diagonal -n 5 -r 36
┌───────────────────────────┐
│ Diagonal (CPU) (parallel) │
└───────────────────────────┘
Elapsed: 10.71s
Giga-CUPS: 0.18
Pairs: 5
┌─────────────────────────────────────┐
│ Optimized Diagonal (CPU) (parallel) │
└─────────────────────────────────────┘
Elapsed: 4.90s
Giga-CUPS: 0.39
Pairs: 5
┌────────────────────────────────────────────────┐
│ OpenCL Diagonal (GPU: Apple M1 Pro) (parallel) │
└────────────────────────────────────────────────┘
Elapsed: 7.46s
Giga-CUPS: 0.25
Pairs: 5
┌──────────────────────────────────────────────────────────┐
│ Optimized OpenCL Diagonal (GPU: Apple M1 Pro) (parallel) │
└──────────────────────────────────────────────────────────┘
Elapsed: 2.14s
Giga-CUPS: 0.89
Pairs: 5
```
Observation: The GPU is good at crunching large matrices with lots of diagonals in parallel.