https://github.com/alpaylan/sars

Suffix Array Library for Rust
https://github.com/alpaylan/sars
fasta prefix-table suffix-array
Last synced: 4 months ago
JSON representation
Suffix Array Library for Rust
Host: GitHub
URL: https://github.com/alpaylan/sars
Owner: alpaylan
Created: 2022-04-25T05:26:06.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2022-04-27T03:25:18.000Z (about 3 years ago)
Last Synced: 2025-01-30T01:41:48.722Z (6 months ago)
Topics: fasta, prefix-table, suffix-array
Language: Rust
Homepage:
Size: 19 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # SARS:  Lightweight Suffix Arrays for Rust

## Introduction

I implemented the project in Rust, using `clap` for argument handling, 

`rust-bio` for suffix arrays, 

`bincode/serde` for serialization, 

`rustc-hash` for _FxHash_ 

functions and `bstr` for

_byte/string_ conversions. 

Github Link: [https://github.com/alpaylan/sars](https://github.com/alpaylan/sars)

Crates.io Link: [https://crates.io/crates/sars](https://crates.io/crates/sars)

## Running

There are two executables, both use API provided by sars.

### buildsa

Buildsa allows you to use custom `fasta` files for building suffix arrays over them.

After downloading the project, you should be able to do   

`cargo run --bin buildsa -- --help`   

to see the options. 

```yaml

sars 0.1.0

Alperen Keles

SARS is a Lightweight Suffix Arrays Library for Rust

USAGE:

    buildsa [OPTIONS] [ARGS]

ARGS:

        the path to a FASTA format file containing the reference of which you will

                   build the suffix array

           the program will write a single binary output file to a file with this name,

                   that contains a serialized version of the input string and the suffix array

OPTIONS:

    -h, --help           Print help information

    -p, --preftab     if the option --preftab is passed to the buildsa executable (with the

                         parameter k), then a prefix table will be built atop the suffix array,

                         capable of jumping to the suffix array interval corresponding to any prefix

                         of length k

    -V, --version        Print version information

```

You will see the prompt above. You can provide the relevant arguments for building your suffix array.

```asm

builsa --preftab 3 reference_file_path output_file_path

=======================================================

builsa reference_file_path output_file_path

```

### querysa

Querysa allows you to use custom `fasta` files for doing queries over an index.

After downloading the project, you should be able to do   

`cargo run --bin querysa -- --help`   

to see the options.

```yaml

sars 0.1.0

  Alperen Keles

  SARS is a Lightweight Suffix Arrays Library for Rust

USAGE:

  querysa [ARGS]

ARGS:

          the path to the binary file containing your serialized suffix array

        the path to an input file in FASTA format containing a set of records. You

  will need to care about both the name and sequence of these fasta records, as

  you will report the output using the name that appears for a record. Note,

  query sequences can span more than one line (headers will always be on one

  line).

      this argument should be one of two strings; either naive or simpaccel. If the

  string is naive you should perform your queries using the naive binary search

  algorithm. If the string is simpaccel you should perform your queries using

  the “simple accelerant” algorithm we covered in class [possible values;

  naive, simpaccel]

         the name to use for the resulting output.

OPTIONS:

  -h, --help       Print help information

  -V, --version    Print version information

```

You will see the prompt above. You can provide the relevant arguments for querying your suffix array.

```asm

querysa index_file_path query_file_path query_mode output_file_path

```

## Build

**-- What did you find to be the most challenging part of implementing the buildsa program?**  

I had to change my implementation that used Rust String constructs to Vec constructs, because the first implementation took an enormous time because of allocations I needed to do for substring equality check. It was the most exhausting part of the building process.

**-- For references of various size:**  

**--- How long does building the suffix array take?**  

For the given _ecoli_ data, it takes around 13 seconds to build the suffix array. We can see that it's growth function approximates $O(N)$ as each halving in the size corresponds to an halving in the time.

| Reference Size(As Bytes) | Time(As Milliseconds) |

|--------------------------| ------- |

| 4639676                  |  12536 |

| 2319838                  |  6185 |

| 1159919                  |  3092 |

|  579960                  |  1531 |

 |  289980                  |  752 |

 |  144990                  |  374 |

 |  72495                   |  188 |

 |  36248                   |  93 |

 |  18124                   |  45 |

 |  9062                    |  22 |

 |  4531                    |  10 |

 |  2266                    |  5 |

 |  1133                    |  2 |

 |  567                     |  1 |

    Table of reference size/time for different sizes. All measurements are done by cutting the given ecoli data by half each time

**--- How large is the resulting serialized file?**  

Size of the suffix array is directly proportional to the size of the reference. When we halve the reference size, suffix array size drops at the same order. |

**--- For the times and sizes above, how do they change if you use the --preftab option with some different values of k?**  

We can see that up until prefix length 7, prefix table is merely negligible. But due to exponential growth, we see a very quick rise from that point on.

| Prefix Length | Resulting Serialized File for Full Ecoli Index |

|---------------| --- |

| None          | 42M   |

| 1             | 42M   |

| 2             | 42M   |

| 3             | 42M   |

| 4             | 42M   |

| 5             | 42M   |

| 6             | 42M   |

| 7             | 42M   |

| 8             | 44M   |

| 9             | 50M   |

| 10            | 72M   |

| 11            | 119M  |

| 12            | 167M  |

| 13            | 196M  |

| 14            | 210M  |

|  15           | 218M  |

    Table of prefix length/file sizes

| Prefix Length | Time to Create Prefix Table(As Milliseconds) |

| --- |----------------------------------------------|

| 1 | 746                                          |

| 2 | 1148                                         |

| 3 | 1617                                         |

| 4 | 2497                                         |

| 5 | 2449                                         |

| 6 | 2844                                         |

| 7 | 3376                                         |

| 8 | 9317                                         |

| 9 | 8504                                         |

| 10 | 18488                                        |

| 11 | 17606                                        |

| 12 | 52644                                        |

| 13 | 32074                                        |

| 14 | 30912                                        |

| 15 |                                              |

    Time to create prefix tables on full reference

**-- Given the scaling above, how large of a genome do you think you could construct the suffix array for on a machine with 32GB of RAM, why?**   

We have a 42MB serialized file for a roughly 4MB reference. Hence, the ratio is close to 1/10. Without a prefix table, we could scale up to a 3GB size reference, approximately 1000 times our current reference, which would make our length 4.6B nucleotides.

## Query

**-- What did you find to be the most challenging part of implementing the query program?**  

I changed my data representation halfway through the project for efficiency reasons, which resulted in various bugs in my \texttt{longest\_common\_prefix} and \texttt{simpaccel\_search} functions; I have spent a fair amount of time debugging and solving these bugs. The indirection of using an offset over the reference instead of dealing with actual strings makes it much harder to debug because data is inherently hidden; it requires extra work to construct it.

**--- For references of various size:**  

**--- How long does query take on references of different size, and on queries of different length?**  

| Reference Size(As Bytes) | Time(As Seconds) | 

|--------------------------| --- |

 | 4639676                  | 66 |

 | 2319838                  | 35 |

 | 1159919                  | 18 |

 | 579960                   | 7 |

 | 289980                   | 3 |

 | 144990                   | 2 |

 | 72495                    | 1 |

    Time for the naive algorithm to run on reference

    

**--- How does the speed of the naive lookup algorithm compare to the speed of the simpleaccel lookup algorithm?**  

**-- How does the speed further compare when not using a prefix lookup table versus using a prefix lookup table with different values of k?**

| Prefix Length  | Naive(1) | Naive(2) | Simpaccel(1) | Simpaccel(2) |

|----------------| --- | --- | --- | --- |

| None           | 67 | 67 | 2 | 2 |

| 1              | 31 | 32 | 1 | 1 |

| 2              | 46 | 48 | 2 | 2 |

| 3              | 53 | 54 | 2 | 2 |

| 4              | 41 | 40 | 1 | 1 |

| 5              | 38 | 37 | 1 | 1 |

| 6              | 42 | 34 | 1 | 1 |

 | 7              | 20 | 21 | 0 | 0 |

 | 8              | 16 | 16 | 0 | 0 |

 | 9              | 10 | 10 | 0 | 0 |

 | 10             | 8 | 8 | 0 | 0 |

 | 11             | 5 | 5 | 0 | 0 |

 | 12             | 4 | 4 | 0 | 0 |

 | 13             | 3 | 3 | 0 | 0 |

 | 14             | 3 | 3 | 0 | 0 |

 | 15             | 3 | 3 | 0 | 0|

    

    Time(in seconds) to do 10 queries for Ecoli Data

    

**-- Given the scaling above, and the memory requirements of each type of index, what kind of tradeoff do you personally think makes sense in terms of using more memory in exchange for faster search.** 

We can see the effect of diminishing returns around 7-10 for prefix length. As we can see also see that size of prefix length is negligible up until 7, I think it makes sense to keep it in that region depending on our time and space requirements.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alpaylan/sars

Awesome Lists containing this project

README