Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/epfl-systemf/regelk

Ocaml Linear Engine for JavaScript Regexes, implementing the algorithms described in Linear Matching of JavaScript Regular Expressions at PLDI24
https://github.com/epfl-systemf/regelk

javascript linear regex

Last synced: 2 months ago
JSON representation

Ocaml Linear Engine for JavaScript Regexes, implementing the algorithms described in Linear Matching of JavaScript Regular Expressions at PLDI24

Awesome Lists containing this project

README

        

# RegElk - OCaml Linear Engine for JavaScript Regexes
Authors: [Aurèle Barrière](https://aurele-barriere.github.io/) and [Clément Pit-Claudel](https://pit-claudel.fr/clement/).

## About
This is a linear regular expression engine for a subset of JavaScript regexes.
The underlying algorithm is an extension of the [PikeVM](https://swtch.com/~rsc/regexp/regexp2.html), supporting more JavaScript features.
This engine implements the algorithms described in the paper [Linear Matching of JavaScript Regular Expressions](https://arxiv.org/abs/2311.17620) by the same authors.

In particular, it supports, for the first time with linear time and space complexity:
- nullable JavaScript quantifiers (these have different semantics than in other regex languages, see for instance `(a?b??)*` on string "ab")
- capture reset, a JavaScript-specific property where capture groups are reset at each quantifier iteration (for instance `((a)|(b))*` on string "ab")
- all lookarounds (lookahads and lookbehinds), even with capture groups inside
- linear matching of the greedy or nullable plus.

RegElk means **Reg**ex **E**ngine with **L**inear loo**K**arounds.
Elks are [diagonal walkers](https://ecowellness.com/animal-tracking-part-2-common-gait-patterns/), meaning that they reuse their front legs prints for their rear legs to conserve energy, evoking how a PikeVM merges threads reaching the same state to preserve linearity.

![RegElk](etc/regelk_logo.jpg)

## Complexity

Given a regex of size `|r|` and a string of size `|s|`, this engine has linear worst-case time complexity in both of them `O(|r|*|s|)`.
While counted quantifiers are supported, they increase the regex size.
For instance, `e{4-8}` will multiply the size of `e` 8 times.
However, the greedy plus (`+` or `{1,}`) or the nonnullable lazy plus (as in `(ab)+?`) are handled without duplication.

The engine also has `O(|r|*|s|)` space complexity.
If one wants to avoid a string-size dependent space complexity, we provide alternative register data-structures, presenting various time-space complexity tradeoff.

| | Time Complexity | Space Complexity |
|----------------|-----------------------------|------------------|
| List (default) | `O(\|r\|*\|s\|)` | `O(\|r\|*\|s\|)` |
| Array | `O(\|r\|^2*\|s\|)` | `O(\|r\|^2)` |
| Tree | `O(\|r\|*log(\|r\|)*\|s\|)` | `O(\|r\|^2)` |

Note however that a `O(|r|*|s|)` space complexity cannot be avoided when using our linear lookaround algorithm.

## Supported Features

| Feature | Example |
|-------------------------------|-------------------------------------------|
| Lookaheads | `a(?=(b))`, `a(?!=b)` |
| Lookbehinds | `(?<=b)a`, `(?