{"id":20657386,"url":"https://github.com/epfl-systemf/regelk","last_synced_at":"2025-04-19T12:37:44.164Z","repository":{"id":231528833,"uuid":"781900388","full_name":"epfl-systemf/RegElk","owner":"epfl-systemf","description":"Ocaml Linear Engine for JavaScript Regexes, implementing the algorithms described in Linear Matching of JavaScript Regular Expressions at PLDI24","archived":false,"fork":false,"pushed_at":"2024-05-29T15:21:28.000Z","size":44871,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-05-30T04:59:16.396Z","etag":null,"topics":["javascript","linear","regex"],"latest_commit_sha":null,"homepage":"","language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfl-systemf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-04-04T08:59:05.000Z","updated_at":"2024-05-29T15:21:31.000Z","dependencies_parsed_at":"2024-04-04T13:58:36.371Z","dependency_job_id":null,"html_url":"https://github.com/epfl-systemf/RegElk","commit_stats":null,"previous_names":["epfl-systemf/regelk"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfl-systemf%2FRegElk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfl-systemf%2FRegElk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfl-systemf%2FRegElk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfl-systemf%2FRegElk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfl-systemf","download_url":"https://codeload.github.com/epfl-systemf/RegElk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224953364,"owners_count":17397688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","linear","regex"],"created_at":"2024-11-16T18:20:26.022Z","updated_at":"2024-11-16T18:20:26.529Z","avatar_url":"https://github.com/epfl-systemf.png","language":"OCaml","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RegElk - OCaml Linear Engine for JavaScript Regexes\nAuthors: [Aurèle Barrière](https://aurele-barriere.github.io/) and [Clément Pit-Claudel](https://pit-claudel.fr/clement/).\n\n## About\nThis is a linear regular expression engine for a subset of JavaScript regexes.\nThe underlying algorithm is an extension of the [PikeVM](https://swtch.com/~rsc/regexp/regexp2.html), supporting more JavaScript features.\nThis engine implements the algorithms described in the paper [Linear Matching of JavaScript Regular Expressions](https://arxiv.org/abs/2311.17620) by the same authors.\n\nIn particular, it supports, for the first time with linear time and space complexity:\n- nullable JavaScript quantifiers (these have different semantics than in other regex languages, see for instance `(a?b??)*` on string \"ab\")\n- capture reset, a JavaScript-specific property where capture groups are reset at each quantifier iteration (for instance `((a)|(b))*` on string \"ab\")\n- all lookarounds (lookahads and lookbehinds), even with capture groups inside\n- linear matching of the greedy or nullable plus.\n\nRegElk means **Reg**ex **E**ngine with **L**inear loo**K**arounds. \nElks are [diagonal walkers](https://ecowellness.com/animal-tracking-part-2-common-gait-patterns/), meaning that they reuse their front legs prints for their rear legs to conserve energy, evoking how a PikeVM merges threads reaching the same state to preserve linearity.\n\n![RegElk](etc/regelk_logo.jpg)\n\n## Complexity\n\nGiven a regex of size `|r|` and a string of size `|s|`, this engine has linear worst-case time complexity in both of them `O(|r|*|s|)`.\nWhile counted quantifiers are supported, they increase the regex size.\nFor instance, `e{4-8}` will multiply the size of `e` 8 times.\nHowever, the greedy plus (`+` or `{1,}`) or the nonnullable lazy plus (as in `(ab)+?`) are handled without duplication.\n\nThe engine also has `O(|r|*|s|)` space complexity.\nIf one wants to avoid a string-size dependent space complexity, we provide alternative register data-structures, presenting various time-space complexity tradeoff.\n\n|                | Time Complexity             | Space Complexity |\n|----------------|-----------------------------|------------------|\n| List (default) | `O(\\|r\\|*\\|s\\|)`            | `O(\\|r\\|*\\|s\\|)` |\n| Array          | `O(\\|r\\|^2*\\|s\\|)`          | `O(\\|r\\|^2)`     |\n| Tree           | `O(\\|r\\|*log(\\|r\\|)*\\|s\\|)` | `O(\\|r\\|^2)`     |\n\nNote however that a `O(|r|*|s|)` space complexity cannot be avoided when using our linear lookaround algorithm.\n\n## Supported Features\n\n| Feature                       | Example                                   |\n|-------------------------------|-------------------------------------------|\n| Lookaheads                    | `a(?=(b))`, `a(?!=b)`                     |\n| Lookbehinds                   | `(?\u003c=b)a`, `(?\u003c!b)a`                      |\n| Capture Groups                | `(a*)b`                                   |\n| Noncapturing Groups           | `(?:a*)b`                                 |\n| Greedy Quantifiers            | `*`, `+`, `?`                             |\n| Lazy Quantifiers              | `*?`, `+?`, `??`                          |\n| Counted Quantifiers           | `a{6,12}`, `a{7,}`, `a{9}`, `a{4,5}?`     |\n| Character Classes             | `[a-z]`, `[^h]`, `[aeiouy]`               |\n| Character Groups              | `\\w`, `\\d`, `\\s`, `\\W`, `\\D`, `\\S`        |\n| Anchors                       | `$`, `^`                                  |\n| Word Boundaries               | `\\b`, `\\B`                                |\n\nBackreferences are not supported, as they make the matching problem [NP-hard](https://perl.plover.com/NPC/NPC-3SAT.html).\nNamed capture groups, hexadecimal escapes, unicode escapes, unicode properties and regex flags are not supported yet, although they could be in the future.\n\n\n## Dependencies\nYou need the following Opam packages.\nOther version numbers may also work.\n- Ocaml 5.0\n- ocamlbuild 0.14.1\n- Menhir 20220210\n- ocaml_intrinsics v0.15.2\n- core_bench v0.15.0\n- core v0.15.1\n- core_unix v0.15.2\n- yojson 2.1.0\n\nYou also need to install Node.JS and have `node` in your path.\n\n## Usage\nBuild all executables with `make`. \nMake sure to have configured your opam switch so that it has all dependencies listed above.\nThis creates several executables:\n\n- `main.native` is the Ocaml matcher\n- `fuzzer.native` is a fuzzer that compares the OCaml matcher to the Irregexp engine of V8 in Node\n- `tests.native` contains a battery of tests that should all succeed\n- `stats.native` computes regex feature usage statistics from corpora of regexes\n- `benchmark.native` allows you to run benchmarks\n- `matcher.native` and `linearbaseline.native` are only used for the benchmarks and you should not run them directly\n\n\n`main.native`, `fuzzer.native` and `benchmark.native` have command line options that can be printed with the argument `--help`.\n\n## Files\n\n### OCaml Engine Files\n- the main entry point of the engine is in the `main.ml` file\n- regexes and regex annotation are found in `regex.ml`\n- the extended bytecode NFA representation is defined in `bytecode.ml`\n- compilation from a regex to bytecode is found in `compiler.ml`\n- the NFA simulation algorithm, with all our extensions, is implemented in `interpreter.ml`\n- the CDN plus formulas are defined in `cdn.ml`\n- the oracles used y the lookaround algorithm is defined in `oracle.ml`\n- the three capture registers implementations are defined in `regs.ml`\n- character classes are implemented in `charclasses.ml`\n- the ECMA-style regex parser is defined in `parser_src/`\n\n### Other Tools\n- the differential fuzzer is implemented in `fuzzer.ml`\n- the computation of statistics on regex features usage is defined in `stats.ml`\n- the `scripts_bench` directory contains scripts called by the benchmarks or the fuzzer to compare the OCaml engine to other engines.\n\n## Correspondence between the Paper and the Code\n\n### Renamings\n- The linear engine from V8 is called \"V8Linear\" in the paper. It is sometimes called \"Experimental\" or \"Exp\" in the code (as this is the name used by the V8 developers).\n- The bytecode instructions `Consume` and `ConsumeAny` from Figure 4 are replaced by a single `Consume` instruction in `bytecode.ml`, which takes as argument either a character or a list of character ranges.\n- The `Jump` instruction is called `Jmp` in the code.\n- `SetReg` is called `SetRegisterToCP`.\n- `SetQuant` is called `SetQuantToClock`.\n- `CheckNull` is called `CheckNullable`.\n- `SetNullPlus` in the paper is not an independent instruction in the code. Instead `SetQuantToClock` encodes both `SetQuant` and `SetNullPlus`: it takes a boolean argument indicating if this quantifier is a nulled plus or not.\n- The \"Balanced Tree\" register implementation in the paper is renamed to `Map_Regs` in the code.\n\n### Algorithm 1\nThis is implemented by the functions `advance_epsilon` and `find_match` in `interpreter.ml`.\n\n### Section 4.1\n- In `compiler.ml`, line 112, you can see the bytecode compilation of a quantifier and see that `BeginLoop` and `EndLoop` instructions are inserted.\n- In `interpreter.ml`, the thread boolean `exit_allowed` encodes in which automata the thread is. See at lines 412 and 417 how the two new instructions are implemented.\n\n### Section 4.2\n- Threads are augmented with clocks in `quant_regs` (line 107 of `interpreter.ml`).\n- The filtering algorithm is implemented at line 285 of `interpreter.ml`.\n\n### Section 4.3\n- The oracle table is defined in `oracle.ml`.\n- The first phase is implemented as the `build_oracle` function, line 651 of `interpreter.ml`.\n- The second phase is simply the `find_match` function defined previously.\n- The third phase is implemented as the `build_capture` function, line 680 of `interpreter.ml`.\n\n### Section 4.4\nSwitch to the `strlb` directory for this algorithm, and see the corresponding `README.md` file.\n\n### Section 4.5\n- The nullability analysis of Section 4.5.1 and Figure 12 starts at line 139 in `regex.ml`.\n- The non-nullable plus case of Section 4.5.2 is implemented at line 78 in `compiler.ml`.\n- For Section 4.5.3, see lines 90 and 101 of `compiler.ml`. \n- The nullability formulas of footnote 8 are defined in `cdn.ml` and called \"CDN formulas\".\n\n### Section 4.6\n- The three different register data-structures are defined in `regs.ml`: `Array_Regs`, `List_Regs` and `Map_Regs`.\n- At line 28 of `interpreter.ml`, see that the interpreter is parameterized by a register implementation (a `REGS` module).\n- The benchmark used for Figure 15 is defined as `dsarray`, `dslist` and `dstree` in `benchmark_vectors.ml`.\n\n### Section 5.1\n- All regex corpora are in the `corpus` directory.\n- The `stats.ml` file uses the parser to analyze each regex and see in which categories it belongs to.\n\n### Section 5.2\n- (C1) the benchmark used for Figure 17 is named \"Clocks\", defined line 117 of `benchmark_vectors`.\n- (C2) the benchmark used for Figure 18 is named \"NNPlus\", defined line 69.\n- (C3) the benchmark used for Figure 19 is named \"CDN\", defined line 95.\n- (C4) the benchmark used for Figure 20 is named \"LBstr\", defined line 179.\n- (C5) the benchmark used for Figure 21 is named \"LAreg\", defined line 138.\n- (C5) the benchmark used for Figure 22 is named \"LAstr\", defined line 157.\n\n### Tests\n- For every pair of regex and string discussed in the paper, we added this test to our test suite.\n- This list is the `paper_tests` list at line 290 of `tests.ml`.\n- This test suite is executed for each of the three register implementations when running `tests.native`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfl-systemf%2Fregelk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfl-systemf%2Fregelk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfl-systemf%2Fregelk/lists"}