{"id":21963049,"url":"https://github.com/kampersanda/rcomp","last_synced_at":"2025-04-23T22:27:56.710Z","repository":{"id":87855174,"uuid":"365129372","full_name":"kampersanda/rcomp","owner":"kampersanda","description":"C++17 implementation of online RLBWT construction in optimal-time and BWT-runs bounded space","archived":false,"fork":false,"pushed_at":"2022-06-28T15:24:59.000Z","size":254,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T04:11:18.652Z","etag":null,"topics":["bwt","compression","rindex"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kampersanda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-07T05:57:43.000Z","updated_at":"2023-01-30T23:45:45.000Z","dependencies_parsed_at":"2023-03-24T13:20:39.132Z","dependency_job_id":null,"html_url":"https://github.com/kampersanda/rcomp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Frcomp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Frcomp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Frcomp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Frcomp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kampersanda","download_url":"https://codeload.github.com/kampersanda/rcomp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250525650,"owners_count":21445067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bwt","compression","rindex"],"created_at":"2024-11-29T10:59:33.521Z","updated_at":"2025-04-23T22:27:56.695Z","avatar_url":"https://github.com/kampersanda.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# R-comp: Online RLBWT compression in optimal-time and BWT-runs bounded space\n\nThis is an experimental library of R-comp, an online RLBWT compression algorithm in optimal-time and BWT-runs bounded space, described in the paper: [An Optimal-Time RLBWT Construction in BWT-runs Bounded Space](https://arxiv.org/abs/2202.07885) (ICALP 2022)\nby Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei.\n\n## Build instructions\n\nYou can download and compile the library with the following commands:\n\n```shell\n$ git clone https://github.com/kampersanda/rcomp.git\n$ cd rcomp\n$ mkdir build\n$ cd build\n$ cmake ..\n$ make -j\n```\n\nThe code is written in C++17, so please install g++ \u003e= 7.0 or clang \u003e= 4.0. For the build system, CMake \u003e= 3.0 have to be installed to compile the library.\n\nThe library employs the third-party libraries [cmd\\_line\\_parser](https://github.com/jermp/cmd_line_parser), [doctest](https://github.com/onqtam/doctest), [nameof](https://github.com/Neargye/nameof) and [tinyformat](https://github.com/c42f/tinyformat), whose header files are contained in this repository.\n\nThe code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.\n\n## Implementations\n\nThe library implements several data structures and provides the following variants of R-comp defined in `rlbwt_types`:\n\n- `rlbwt_types::lfig_naive` is a straightforward implementation of the LF-interval graph with `O(r)` nodes, and\n- `rlbwt_types::glfig_serialized\u003cg\u003e` is a spece-efficient implementation of the LF-interval graph with `O(r/g)` nodes,\n\nwhere `r` is the number of BWT-runs.\n\nAlso, the library implements r-index on these data structures, providing `count` and `locate` queries in the compressed space. In the same manner as `rlbwt_types`, the variants are defined in `rindex_types`.\n\n## Limitations\n\n- An input text must NOT contain the `0x00` character because it is used as a special end marker.\n- In the current version, class `GroupedFIndex` resorts to static global variables. Please do NOT create multiple instances of `glfig_serialized`.\n\n## Sample usage\n\n### RLBWT\n\n`sample/sample_rlbwt.cpp` provides a sample usage.\n\n```c++\n#include \u003cstring\u003e\n\n#include \"rlbwt_types.hpp\"\n#include \"utils.hpp\"\n\nusing namespace rcomp;\n\nint main(int argc, char** argv) {\n    // Input text\n    const std::string text = \"abaababaab\";\n\n    // Construct the RLBWT by appending characters (with end-marker $) in reverse.\n    // Note that '\\0' is used for the end marker (i.e., the text should not contain '\\0').\n    rlbwt_types::glfig_serialized\u003c8\u003e::type rlbwt;\n    for (size_t i = 1; i \u003c= text.size(); i++) {\n        rlbwt.extend(text[text.size() - i]);\n    }\n\n    // Extract the resulted BWT-runs\n    std::cout \u003c\u003c \"BWT-runs: \";\n    rlbwt.output_runs([](const run_type\u0026 r) { std::cout \u003c\u003c r \u003c\u003c \",\"; });\n    std::cout \u003c\u003c std::endl;\n\n    // Decode the original text (except $) from the RLBWT\n    std::string decoded;\n    rlbwt.decode_text([\u0026](uchar_type c) { decoded.push_back(c); });\n    std::reverse(decoded.begin(), decoded.end());  // need to be reversed\n    std::cout \u003c\u003c \"text:    \" \u003c\u003c text \u003c\u003c std::endl;\n    std::cout \u003c\u003c \"decoded: \" \u003c\u003c decoded \u003c\u003c std::endl;\n\n    return 0;\n}\n```\n\nThe output will be\n\n```\nBWT-runs: (b,3),(a,1),(b,1),($,1),(a,5),\ntext:    abaababaab\ndecoded: abaababaab\n```\n\n### r-index\n\n`sample/sample_rindex.cpp` provides a sample usage.\n\n```c++\n#include \u003cstring\u003e\n\n#include \"rindex_types.hpp\"\n#include \"utils.hpp\"\n\nusing namespace rcomp;\n\nint main(int argc, char** argv) {\n    // Input text\n    const std::string text = \"abaababaab\";\n\n    // Construct the r-index by appending characters (with end-marker $) in reverse.\n    // Note that '\\0' is used for the end marker (i.e., the text should not contain '\\0').\n    rindex_types::glfig_naive\u003c8\u003e::type rindex;\n    for (size_t i = 1; i \u003c= text.size(); i++) {\n        rindex.extend(text[text.size() - i]);\n    }\n\n    // Extract the resulted BWT-runs\n    std::cout \u003c\u003c \"BWT-runs: \";\n    rindex.output_runs([](const run_type\u0026 r) { std::cout \u003c\u003c r \u003c\u003c \",\"; });\n    std::cout \u003c\u003c std::endl;\n\n    // Decode the original text (except $) from the index\n    std::string decoded;\n    rindex.decode_text([\u0026](uchar_type c) { decoded.push_back(c); });\n    std::reverse(decoded.begin(), decoded.end());  // need to be reversed\n    std::cout \u003c\u003c \"text:    \" \u003c\u003c text \u003c\u003c std::endl;\n    std::cout \u003c\u003c \"decoded: \" \u003c\u003c decoded \u003c\u003c std::endl;\n\n    // Count the occurrences of the (reversed) query\n    const std::string query = \"aaba\";  // i.e., \"abaa\" in the original order\n    const size_type occ = rindex.count(make_range(query));\n    std::cout \u003c\u003c \"count(\" \u003c\u003c query \u003c\u003c \") = \" \u003c\u003c occ \u003c\u003c std::endl;\n\n    // Locate the (reversed) query\n    std::cout \u003c\u003c \"locate(\" \u003c\u003c query \u003c\u003c \") = {\";\n    rindex.locate(make_range(query), [\u0026](size_type pos) {\n        // We will get pos such that text_r[..pos] = \"..aaba\", where text_r = reverse(text+'$').\n        // In other words, we can extract the original position starting at \"abaa\" as follows.\n        std::cout \u003c\u003c text.size() - pos \u003c\u003c \",\";\n    });\n    std::cout \u003c\u003c \"}\" \u003c\u003c std::endl;\n\n    return 0;\n}\n```\n\nThe output will be\n\n```\nBWT-runs: (b,3),(a,1),(b,1),($,1),(a,5),\ntext:    abaababaab\ndecoded: abaababaab\ncount(aaba) = 2\nlocate(aaba) = {5,0,}\n```\n\n## Performance test\n\n### RLBWT\n\nThe executable `perf/perf_rlbwt` measures the performance of R-comp. The command line options are printed by specifying the parameter `-h`.\n\n```\n$ ./perf/perf_rlbwt -h\nUsage: ./perf/perf_rlbwt [-h,--help] input_path [-t rlbwt_type] [-r reverse_mode] [-T enable_test]\n\n input_path\n    Input file path of text\n [-t rlbwt_type]\n    Rlbwt data structure type: lfig | glfig_[8|16|32|64] (default=glfig_16)\n [-r reverse_mode]\n    Load the text in reverse? (default=1)\n [-T enable_test]\n    Test the data structure? (default=1)\n [-h,--help]\n    Print this help text and silently exits.\n```\n\n- For data structure type `t`, `lfig` is a straight forward implementation, and `glfig_g` is a memory-efficient implementation by the grouping technique with group size `g`. \n- When `r` is set to `1`, the text will be input in reverse order to build the RLBWT for the text in the original order.\n- When `T` is set to `1`, the correctness of the result will be tested.\n\nFor example, for dataset  `alice29.txt`, the following command measures the performace of R-comp with data structure `glfig_16` and shows the detailed statistics.\n\n```\n$ ./perf/perf_rlbwt alice29.txt\n[Input_Params]\ninput_path:     alice29.txt\nrlbwt_type:     rcomp::Rlbwt_GLFIG\u003crcomp::GroupedLFIntervalGraph\u003crcomp::GroupedLData_Serialized\u003c16, false, 2, true, true\u003e, rcomp::GroupedFData_Serialized\u003crcomp::GroupedLData_Serialized\u003c16, false, 2, true, true\u003e \u003e, 7\u003e \u003e\nreverse_mode:   1\n[Progress_Report]\nnum_chars:      10000\nconstruction_sec:       0.002\n[Progress_Report]\nnum_chars:      100000\nconstruction_sec:       0.03\n[Final_Report]\nconstruction_sec:       0.048\nnum_runs:       66903\nnum_chars:      152090\ncompression_ratio:      0.439891\nalloc_memory_in_bytes:  2892218\nalloc_memory_in_MiB:    2.75823\npeak_memory_in_bytes:   3604480\npeak_memory_in_MiB:     3.4375\nTesting the data structure now...\nNo Problem!\nTesting the decoded text now...\nNo Problem!\n```\n\n### r-index\n\nThe executable `perf/perf_rindex` measures the performance of r-index on R-comp. The command line options are printed by specifying the parameter `-h`.\n\n```\n$ ./perf/perf_rindex -h\nUsage: ./perf/perf_rindex [-h,--help] input_path [-t rindex_type] [-r reverse_mode] [-T enable_test]\n\n input_path\n        Input file path of text\n [-t rindex_type]\n        Rindex data structure type: lfig | glfig_[8|16|32|64] (default=glfig_16)\n [-r reverse_mode]\n        Loading the text in reverse? (default=1)\n [-T enable_test]\n        Testing the data structure? (default=1)\n [-h,--help]\n        Print this help text and silently exits.\n```\n\nThe parameter settings are the same as `perf_rlbwt`, and the following command measures the performace of r-index on R-comp with data structure `glfig_16`.\n\n```\n$ ./perf/perf_rindex alice29.txt \n[Input_Params]\ninput_path:     alice29.txt\nrindex_type:    rcomp::Rindex_GLFIG\u003crcomp::GroupedLFIntervalGraph\u003crcomp::GroupedLData_Serialized\u003c16, true, 2, true, true\u003e, rcomp::GroupedFData_Serialized\u003crcomp::GroupedLData_Serialized\u003c16, true, 2, true, true\u003e \u003e, 7\u003e \u003e\nreverse_mode:   1\n[Progress_Report]\nnum_chars:      10000\nconstruction_sec:       0.005\n[Progress_Report]\nnum_chars:      100000\nconstruction_sec:       0.07\n[Final_Report]\nconstruction_sec:       0.116\nnum_runs:       66903\nnum_chars:      152090\ncompression_ratio:      0.439891\nalloc_memory_in_bytes:  6408190\nalloc_memory_in_MiB:    6.11133\npeak_memory_in_bytes:   8417280\npeak_memory_in_MiB:     8.02734\n[Search_Settings]\nnum_trials:     10\nnum_queries:    1000\nquery_length:   8\nquery_seed:     13\nWarming up now...\ndummy:  18858\n[Count_Query]\nocc_per_query:  18.858\nmicrosec_per_query:     4.19685\n[Locate_Query]\nocc_per_query:  37.716\nmicrosec_per_query:     7.14312\nmicrosec_per_occ:       0.189392\nTesting the data structure now...\nNo Problem!\nTesting the decoded text now...\nNo Problem!\n```\n\n## BWT tool\n\nThe executable `tool/transform` constructs the BWT text from a given text. You need to set `r` to `1` to output the BWT for the text in the original order.\n\n```\n$ ./tool/transform alice29.txt alice29.bwt -r 1\nConstructing now...\nRLBWT was constructed for 152090 chars.\nOutputting now...\nBWT-text was output.\nThe number of resulting runs was 66903.\n```\n\nNote that, when `r` is set to `0`, the BWT for the reversed text will be constructed.\n\n```\n$ ./tool/transform alice29.txt alice29.bwt -r 0\nWARNING: Since the text will be input in the original order, the BWT for the reversed text will be constructed.\nConstructing now...\nRLBWT was constructed for 152090 chars.\nOutputting now...\nBWT-text was output.\nThe number of resulting runs was 66186.\n```\n\n## r-index demo\n\nThe executable `tool/demo_rindex` offers a demo of `count` and `locate` queries using r-index.\n\n```\n$ ./tool/demo_rindex alice29.txt \nConstructing r-index...\n152090 characters indexed in 6327169 bytes = 6178.88 KiB = 6.03406 MiB.\n1. Enter query string to search.\n2. Enter \"exit\" to continue indexing.\n\u003e Dinah\nCount(\"Dinah\") = 14, done in 36.6 micro sec.\n1. Enter '1' to run locate with print.\n2. Enter '2' to run locate without print.\n3. Enter another not to run locate.\n\u003e 1\nLocate(\"Dinah\") = {43681, 4612, 5237, 33563, 35845, 5189, 33713, 32627, 32751, 21324, 36155, 4532, 4475, 32894, }\nLocate query, done in 76 micro sec, 5.42857 micro sec per occ.\n1. Enter query string to search.\n2. Enter \"exit\" to continue indexing.\n\u003e exit\nThanks!\n```\n\n## Unit test\n\nThe unit tests are written using [doctest](https://github.com/onqtam/doctest). After compiling, you can run tests with the following command.\n\n```\n$ make test\n```\n\n## Authors\n\n- [Takaaki Nishimoto](https://github.com/TNishimoto)\n- [Shunsuke Kanda](https://github.com/kampersanda) (Creator)\n- [Yasuo Tabei](https://github.com/tb-yasu)\n\n## Licensing\n\nThis program is available for only academic use, basically. For the academic use, please keep [MIT License](https://github.com/kampersanda/rcomp/blob/main/LICENSE). For the commercial use, please keep GPL 2.0 and make a contact to one of the authors.\n\nIf you use the library, please cite the following paper:\n\n```\n@inproceedings{nishimoto2022optimal,\n  author =\t{Nishimoto, Takaaki and Kanda, Shunsuke and Tabei, Yasuo},\n  title =\t{{An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space}},\n  booktitle =\t{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},\n  pages =\t{99:1--99:20},\n  year =\t{2022},\n  doi =\t\t{10.4230/LIPIcs.ICALP.2022.99},\n}\n```\n\n## Related software\n\n- [renum](https://github.com/TNishimoto/renum) is a C++ implementation of enumeration of characteristic substrings in BWT-runs bounded space.\n- [rlbwt\\_iterator](https://github.com/TNishimoto/rlbwt_iterator) is a C++ implementation of some iterators in BWT-runs bounded space.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkampersanda%2Frcomp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkampersanda%2Frcomp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkampersanda%2Frcomp/lists"}