{"id":22194162,"url":"https://github.com/alpaylan/sars","last_synced_at":"2025-03-24T21:26:14.037Z","repository":{"id":109441441,"uuid":"485234066","full_name":"alpaylan/sars","owner":"alpaylan","description":"Suffix Array Library for Rust","archived":false,"fork":false,"pushed_at":"2022-04-27T03:25:18.000Z","size":19888,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-30T01:41:48.722Z","etag":null,"topics":["fasta","prefix-table","suffix-array"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alpaylan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-25T05:26:06.000Z","updated_at":"2022-04-27T03:26:00.000Z","dependencies_parsed_at":"2023-04-04T12:33:16.003Z","dependency_job_id":null,"html_url":"https://github.com/alpaylan/sars","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alpaylan%2Fsars","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alpaylan%2Fsars/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alpaylan%2Fsars/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alpaylan%2Fsars/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alpaylan","download_url":"https://codeload.github.com/alpaylan/sars/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245353853,"owners_count":20601441,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fasta","prefix-table","suffix-array"],"created_at":"2024-12-02T13:11:42.076Z","updated_at":"2025-03-24T21:26:14.001Z","avatar_url":"https://github.com/alpaylan.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SARS:  Lightweight Suffix Arrays for Rust\n\n## Introduction\n\nI implemented the project in Rust, using `clap` for argument handling, \n`rust-bio` for suffix arrays, \n`bincode/serde` for serialization, \n`rustc-hash` for _FxHash_ \nfunctions and `bstr` for\n_byte/string_ conversions. \n\nGithub Link: [https://github.com/alpaylan/sars](https://github.com/alpaylan/sars)\n\nCrates.io Link: [https://crates.io/crates/sars](https://crates.io/crates/sars)\n\n## Running\n\nThere are two executables, both use API provided by sars.\n\n### buildsa\n\nBuildsa allows you to use custom `fasta` files for building suffix arrays over them.\n\nAfter downloading the project, you should be able to do   \n`cargo run --bin buildsa -- --help`   \nto see the options. \n\n```yaml\nsars 0.1.0\nAlperen Keles\nSARS is a Lightweight Suffix Arrays Library for Rust\n\nUSAGE:\n    buildsa [OPTIONS] [ARGS]\n\nARGS:\n    \u003creference\u003e    the path to a FASTA format file containing the reference of which you will\n                   build the suffix array\n    \u003coutput\u003e       the program will write a single binary output file to a file with this name,\n                   that contains a serialized version of the input string and the suffix array\n\nOPTIONS:\n    -h, --help           Print help information\n    -p, --preftab \u003ck\u003e    if the option --preftab is passed to the buildsa executable (with the\n                         parameter k), then a prefix table will be built atop the suffix array,\n                         capable of jumping to the suffix array interval corresponding to any prefix\n                         of length k\n    -V, --version        Print version information\n```\n\nYou will see the prompt above. You can provide the relevant arguments for building your suffix array.\n\n```asm\nbuilsa --preftab 3 reference_file_path output_file_path\n=======================================================\nbuilsa reference_file_path output_file_path\n```\n\n\n### querysa\n\nQuerysa allows you to use custom `fasta` files for doing queries over an index.\n\nAfter downloading the project, you should be able to do   \n`cargo run --bin querysa -- --help`   \nto see the options.\n\n```yaml\nsars 0.1.0\n  Alperen Keles\n  SARS is a Lightweight Suffix Arrays Library for Rust\n\nUSAGE:\n  querysa [ARGS]\n\nARGS:\n  \u003cindex\u003e        the path to the binary file containing your serialized suffix array\n  \u003cqueries\u003e      the path to an input file in FASTA format containing a set of records. You\n  will need to care about both the name and sequence of these fasta records, as\n  you will report the output using the name that appears for a record. Note,\n  query sequences can span more than one line (headers will always be on one\n  line).\n  \u003cquerymode\u003e    this argument should be one of two strings; either naive or simpaccel. If the\n  string is naive you should perform your queries using the naive binary search\n  algorithm. If the string is simpaccel you should perform your queries using\n  the “simple accelerant” algorithm we covered in class [possible values;\n  naive, simpaccel]\n  \u003coutput\u003e       the name to use for the resulting output.\n\nOPTIONS:\n  -h, --help       Print help information\n  -V, --version    Print version information\n```\n\nYou will see the prompt above. You can provide the relevant arguments for querying your suffix array.\n\n```asm\nquerysa index_file_path query_file_path query_mode output_file_path\n```\n\n## Build\n\n**-- What did you find to be the most challenging part of implementing the buildsa program?**  \nI had to change my implementation that used Rust String constructs to Vec\u003cu8\u003e constructs, because the first implementation took an enormous time because of allocations I needed to do for substring equality check. It was the most exhausting part of the building process.\n\n**-- For references of various size:**  \n**--- How long does building the suffix array take?**  \nFor the given _ecoli_ data, it takes around 13 seconds to build the suffix array. We can see that it's growth function approximates $O(N)$ as each halving in the size corresponds to an halving in the time.\n\n| Reference Size(As Bytes) | Time(As Milliseconds) |\n|--------------------------| ------- |\n| 4639676                  |  12536 |\n| 2319838                  |  6185 |\n| 1159919                  |  3092 |\n|  579960                  |  1531 |\n |  289980                  |  752 |\n |  144990                  |  374 |\n |  72495                   |  188 |\n |  36248                   |  93 |\n |  18124                   |  45 |\n |  9062                    |  22 |\n |  4531                    |  10 |\n |  2266                    |  5 |\n |  1133                    |  2 |\n |  567                     |  1 |\n\n    Table of reference size/time for different sizes. All measurements are done by cutting the given ecoli data by half each time\n\n\n**--- How large is the resulting serialized file?**  \nSize of the suffix array is directly proportional to the size of the reference. When we halve the reference size, suffix array size drops at the same order. |\n\n**--- For the times and sizes above, how do they change if you use the --preftab option with some different values of k?**  \nWe can see that up until prefix length 7, prefix table is merely negligible. But due to exponential growth, we see a very quick rise from that point on.\n\n| Prefix Length | Resulting Serialized File for Full Ecoli Index |\n|---------------| --- |\n| None          | 42M   |\n| 1             | 42M   |\n| 2             | 42M   |\n| 3             | 42M   |\n| 4             | 42M   |\n| 5             | 42M   |\n| 6             | 42M   |\n| 7             | 42M   |\n| 8             | 44M   |\n| 9             | 50M   |\n| 10            | 72M   |\n| 11            | 119M  |\n| 12            | 167M  |\n| 13            | 196M  |\n| 14            | 210M  |\n|  15           | 218M  |\n\n    Table of prefix length/file sizes\n\n| Prefix Length | Time to Create Prefix Table(As Milliseconds) |\n| --- |----------------------------------------------|\n| 1 | 746                                          |\n| 2 | 1148                                         |\n| 3 | 1617                                         |\n| 4 | 2497                                         |\n| 5 | 2449                                         |\n| 6 | 2844                                         |\n| 7 | 3376                                         |\n| 8 | 9317                                         |\n| 9 | 8504                                         |\n| 10 | 18488                                        |\n| 11 | 17606                                        |\n| 12 | 52644                                        |\n| 13 | 32074                                        |\n| 14 | 30912                                        |\n| 15 |                                              |\n    Time to create prefix tables on full reference\n\n**-- Given the scaling above, how large of a genome do you think you could construct the suffix array for on a machine with 32GB of RAM, why?**   \nWe have a 42MB serialized file for a roughly 4MB reference. Hence, the ratio is close to 1/10. Without a prefix table, we could scale up to a 3GB size reference, approximately 1000 times our current reference, which would make our length 4.6B nucleotides.\n\n## Query\n**-- What did you find to be the most challenging part of implementing the query program?**  \nI changed my data representation halfway through the project for efficiency reasons, which resulted in various bugs in my \\texttt{longest\\_common\\_prefix} and \\texttt{simpaccel\\_search} functions; I have spent a fair amount of time debugging and solving these bugs. The indirection of using an offset over the reference instead of dealing with actual strings makes it much harder to debug because data is inherently hidden; it requires extra work to construct it.\n\n**--- For references of various size:**  \n**--- How long does query take on references of different size, and on queries of different length?**  \n\n| Reference Size(As Bytes) | Time(As Seconds) | \n|--------------------------| --- |\n | 4639676                  | 66 |\n | 2319838                  | 35 |\n | 1159919                  | 18 |\n | 579960                   | 7 |\n | 289980                   | 3 |\n | 144990                   | 2 |\n | 72495                    | 1 |\n\n    Time for the naive algorithm to run on reference\n    \n\n**--- How does the speed of the naive lookup algorithm compare to the speed of the simpleaccel lookup algorithm?**  \n**-- How does the speed further compare when not using a prefix lookup table versus using a prefix lookup table with different values of k?**\n\n| Prefix Length  | Naive(1) | Naive(2) | Simpaccel(1) | Simpaccel(2) |\n|----------------| --- | --- | --- | --- |\n| None           | 67 | 67 | 2 | 2 |\n| 1              | 31 | 32 | 1 | 1 |\n| 2              | 46 | 48 | 2 | 2 |\n| 3              | 53 | 54 | 2 | 2 |\n| 4              | 41 | 40 | 1 | 1 |\n| 5              | 38 | 37 | 1 | 1 |\n| 6              | 42 | 34 | 1 | 1 |\n | 7              | 20 | 21 | 0 | 0 |\n | 8              | 16 | 16 | 0 | 0 |\n | 9              | 10 | 10 | 0 | 0 |\n | 10             | 8 | 8 | 0 | 0 |\n | 11             | 5 | 5 | 0 | 0 |\n | 12             | 4 | 4 | 0 | 0 |\n | 13             | 3 | 3 | 0 | 0 |\n | 14             | 3 | 3 | 0 | 0 |\n | 15             | 3 | 3 | 0 | 0|\n    \n    Time(in seconds) to do 10 queries for Ecoli Data\n    \n\n\n**-- Given the scaling above, and the memory requirements of each type of index, what kind of tradeoff do you personally think makes sense in terms of using more memory in exchange for faster search.** \n\nWe can see the effect of diminishing returns around 7-10 for prefix length. As we can see also see that size of prefix length is negligible up until 7, I think it makes sense to keep it in that region depending on our time and space requirements. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falpaylan%2Fsars","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falpaylan%2Fsars","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falpaylan%2Fsars/lists"}