{"id":20810102,"url":"https://github.com/lh3/mssa-bench","last_synced_at":"2025-04-15T07:20:38.444Z","repository":{"id":243116441,"uuid":"809945652","full_name":"lh3/mssa-bench","owner":"lh3","description":"Evaluating the performance of multi-string SA construction","archived":false,"fork":false,"pushed_at":"2025-03-22T01:07:23.000Z","size":160,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-22T01:29:56.275Z","etag":null,"topics":["suffix-array"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-03T18:49:20.000Z","updated_at":"2025-03-22T01:07:26.000Z","dependencies_parsed_at":"2025-03-12T04:38:55.378Z","dependency_job_id":null,"html_url":"https://github.com/lh3/mssa-bench","commit_stats":null,"previous_names":["lh3/mssa-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fmssa-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fmssa-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fmssa-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fmssa-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/mssa-bench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249023976,"owners_count":21200003,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["suffix-array"],"created_at":"2024-11-17T20:19:43.042Z","updated_at":"2025-04-15T07:20:38.439Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"This repo evaluates the performance of portable libraries for constructing the\nsuffix array (SA) of string sets. These libraries only include a few source\nfiles and have no dependencies. They can be building blocks of larger projects.\n\nThere are [different ways][ss-review] to define the SA of a string set. We\nhere focus on the most common definition as follows.  Let\n$`\\mathcal{T}=\\{T_1,T_2,\\ldots,T_n\\}`$ be a set of strings over $\\Sigma$. Their\nconcatenation is $`T=T_1\\$_1T_2\\$_2\\cdots T_n\\$_n`$ where\n$`\\$_1\u003c\\$_2\u003c\\cdots\u003c\\$_n`$ are smaller than all symbols in $\\Sigma$. The SA of\nstring set $`\\mathcal{T}`$ is defined as the SA of string $T$.\n\nFew SA construction libraries directly support string sets. Nonetheless, we can\nachieve the goal for any libraries that support integer alphabets, such as\n[libsais][libsais], by converting $T$ to an integer array $X$:\n```math\nX[k]=\\left\\{\\begin{array}{ll}\ni \u0026 \\mbox{if }T[k]=\\$_i \\\\\nT[k]+n \u0026 \\mbox{otherwise}\n\\end{array}\\right.\n```\nThen the SA of $X$ will be identical to the SA of $T$. A disadvantage of this\nmethod is that we need to convert 8-bit characters to 32-bit or 64-bit\nintegers. This increases the memory footprint.\n\nTo alleviate the issue, I developed [ksa][ksa] in 2011 by adapting an old\nversion of [Yuta Mori][mori]'s sais. Briefly, during symbol comparisons, ksa\nimplicitly replaces a sentinel $`\\$_i`$ with $`j-|T|`$ where $j$ is the offset\nof $`\\$_i`$ in $T$. The comparison between symbols takes more time but we do\nnot need to convert $T$ to integer arrays anymore and can thus save memory.\nThis repo includes an updated version named [msais][msais].\n\n[Published in 2017][gsacak-paper], [gSACA-K][gsacak] is another library based\non the linear-time SAIS algorithm. Please read its paper for details.\n\nHere is the timing for constructing the [CHM13v2 genome][chm13] on both strand (6.2\nbillion symbols in total) on a Xeon Gold 6130:\n\n|             |msais|gsaca-k|sais-t1|sais-t4|sais-t8|sais-t8b|sais-t8c|sais16-t8c|\n|:------------|---:|------:|------:|------:|------:|-------:|-------:|---------:|\n|# threads    |   1|      1|      1|      4|      8|       8|       8|         8|\n|Elapsed (s)  |1211|   3356|    588|    386|    260|     374|     473|       296|\n|CPU time (s) |1209|   3349|    587|   1152|   1439|    1895|    2602|      1146|\n|Peak RSS (GB)|52.3|   53.5|   92.9|   92.9|   92.9|    92.9|    92.9|      58.4|\n\nSome notes and observations:\n\n* libsais is clearly the fastest even on a single thread and we see noticeable\n  speedup with multiple threads. A caveat is that the multi-threading\n  performance of libsais appears to have large fluctuation. For example,\n  the three sais-t8 were run on different nodes with the same configuration but\n  the speed was quite different. sais-t8c and sais16-t8c were run on the same\n  machine.\n\n* gSACA-K would crash if compiled with `-fopenmp`. I am not sure why.\n\n* msais is faster than gSACA-K and has the same memory footprint. It would be\n  good to apply this msais strategy to libsais to reduce its peak memory.\n\n* We omitted [ropebwt2][rb2] and [BEETL][beetl] because they are slow for\n  chromosome-long strings and we omitted [grlBWT][grl] because it writes\n  temporary files and is not designed as a library. We did not evaluate\n  [eGAP][egap] because gSACA-K appears to be faster in multiple third-party\n  benchmarks.\n\n[libsais]: https://github.com/IlyaGrebnov/libsais\n[chm13]: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/assemblies/analysis_set/\n[mori]: https://github.com/y-256\n[gsacak]: https://github.com/felipelouza/gsa-is\n[gsacak-paper]: https://www.sciencedirect.com/science/article/pii/S0304397517302621\n[ksa]: https://github.com/lh3/fermi/blob/master/ksa.c\n[fermi]: https://github.com/lh3/fermi\n[fermi-paper]: https://academic.oup.com/bioinformatics/article/28/14/1838/218887\n[ss-review]: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae333/7681884\n[rb2]: https://github.com/lh3/ropebwt2\n[grl]: https://github.com/ddiazdom/grlBWT\n[beetl]: https://github.com/BEETL/BEETL\n[egap]: https://github.com/felipelouza/egap\n[msais]: https://github.com/lh3/msais-lite\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fmssa-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fmssa-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fmssa-bench/lists"}