https://github.com/burntsushi/rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
https://github.com/burntsushi/rebar
Last synced: about 2 months ago
JSON representation
A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
Host: GitHub
URL: https://github.com/burntsushi/rebar
Owner: BurntSushi
License: unlicense
Created: 2023-01-27T12:46:21.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-10-01T15:34:25.000Z (about 1 year ago)
Last Synced: 2025-03-28T16:45:59.037Z (7 months ago)
Language: Python
Size: 39.9 MB
Stars: 247
Watchers: 6
Forks: 17
Open Issues: 9
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project

README

          rebar

=====

A biased barometer for gauging the relative speed of some regex engines on a

curated set of tasks.

## Links

* [METHODOLOGY](METHODOLOGY.md) describes the motivation, design, benchmark

selection and evaluation protocol used by rebar.

* [BUILD](BUILD.md) describes how to build `rebar` and the regex engines it

measures.

* [TUTORIAL](TUTORIAL.md) provides a guided exploration of some of the most

useful `rebar` sub-commands.

* [CONTRIBUTING](CONTRIBUTING.md) describes how to add new benchmarks and how

to add a new regex engine to benchmark.

* [MODELS](MODELS.md) describes the different types of workloads measured.

* [FORMAT](FORMAT.md) describes the directory hierarchy and TOML format for

how benchmarks are defined.

* [KLV](KLV.md) describes the format of data given to regex engine runner

programs for how to execute a benchmark.

* [BIAS](BIAS.md) is a work-in-progress document describing the bias of this

barometer.

* [WANTED](WANTED.md) provides some ideas for other regex engines to add to

rebar.

* [BYOB](BYOB.md) discusses how to "bring your own benchmarks." That is, anyone

can use `rebar` with their own engine and benchmark definitions.

## Results

This section shows the results of a _curated and [biased](BIAS.md)_ set of

benchmarks. These reflect only a small subset of the benchmarks defined in

this repository, but were carefully crafted to attempt to represent a broad

range of use cases and annotated where possible with analysis to aide in the

interpretation of results.

The results begin with a summary, then a list of links to each benchmark group

and then finally the results for each group. Results are shown one benchmark

group at a time, where a single group is meant to combine related regexes or

workloads, where it is intended to be useful to see how results change across

regex engines. Analysis is provided, at minimum, for every group. Although,

analysis is heavily biased towards Rust's regex crate, as it is what this

author knows best. However, contributions that discuss other regex engines are

very welcomed.

Below each group of results are the parameters for each individual benchmark

within that group. An individual benchmark may contain some analysis specific

to it, but it will at least contain a summary of the benchmark details. Some

parameters, such as the haystack, are usually too big to show in this README.

One can use rebar to look at the haystack directly. Just take the `full name`

of the benchmark and give it to the `rebar haystack` command. For example:

```

$ rebar haystack unicode/compile/fifty-letters

ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή

```

Similarly, the full benchmark execution details (including the haystack) can

be seen with the `rebar klv` command:

```

$ rebar klv unicode/compile/fifty-letters

name:29:unicode/compile/fifty-letters

model:7:compile

pattern:7:\pL{50}

case-insensitive:5:false

unicode:4:true

haystack:106:ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϙϛϝϟϡϸϻͱͳͷΐάέή

max-iters:1:0

max-warmup-iters:1:0

max-time:1:0

max-warmup-time:1:0

```

Finally, you can run the benchmark yourself and look at results on the command

line:

```

$ rebar measure -f '^unicode/compile/fifty-letters$' | tee results.csv

$ rebar cmp results.csv

```

### Summary

Below are two tables summarizing the results of regex engines benchmarked.

Each regex engine includes its version at the time measurements were captured,

a summary score that ranks it relative to other regex engines across all

benchmarks and the total number of measurements collected.

The first table ranks regex engines based on search time. The second table

ranks regex engines based on compile time.

The summary statistic used is the [geometric mean] of the speed ratios for

each regex engine across all benchmarks that include it. The ratios within

each benchmark are computed from the median of all timing samples taken, and

dividing it by the best median of the regex engines that participated in the

benchmark. For example, given two regex engines `A` and `B` with results `35

ns` and `25 ns` on a single benchmark, `A` has a speed ratio of `1.4` and

`B` has a speed ratio of `1.0`. The geometric mean reported here is then the

"average" speed ratio for that regex engine across all benchmarks.

If you're looking to compare two regex engines specifically, then it is better

to do so based only on the benchmarks that they both participate in. For

example, to compared based on the results recorded on 2023-05-04, one can do:

```

$ rebar rank record/all/2023-05-04/*.csv -f '^curated/' -e '^(rust/regex|hyperscan)$' --intersection -M compile

Engine      Version           Geometric mean of speed ratios  Benchmark count

------      -------           ------------------------------  ---------------

hyperscan   5.4.1 2023-02-22  2.03                            25

rust/regex  1.8.1             2.13                            25

```

**Caution**: Using a single number to describe the overall performance of a

regex engine is a fraught endeavor, and it is debatable whether it should be

included here at all. It is included primarily because the number of benchmarks

is quite large and overwhelming. It can be quite difficult to get a general

sense of things without a summary statistic. In particular, a summary statistic

is also useful to observe how the _overall picture_ itself changes as changes

are made to the barometer. (Whether it be by adding new regex engines or

adding/removing/changing existing benchmarks.) One particular word of caution

is that while geometric mean is more robust with respect to outliers than

arithmetic mean, it is not unaffected by them. Therefore, it is still critical

to examine individual benchmarks if one wants to better understanding the

performance profile of any specific regex engine or workload.

[geometric mean]: https://dl.acm.org/doi/pdf/10.1145/5666.5673

#### Summary of search-time benchmarks

| Engine | Version | Geometric mean of speed ratios | Benchmark count |

| ------ | ------- | ------------------------------ | --------------- |

| [hyperscan](benchmarks/../engines/hyperscan) | 5.4.2 2023-04-22 | 2.40 | 28 |

| [rust/regex](benchmarks/../engines/rust/regex) | 1.10.2 | 3.14 | 38 |

| [dotnet/compiled](benchmarks/../engines/dotnet) | 8.0.0 | 4.35 | 34 |

| [pcre2/jit](benchmarks/../engines/pcre2) | 10.42 2022-12-11 | 6.03 | 34 |

| [dotnet/nobacktrack](benchmarks/../engines/dotnet) | 8.0.0 | 8.75 | 29 |

| [re2](benchmarks/../engines/re2) | 2023-11-01 | 10.45 | 31 |

| [javascript/v8](benchmarks/../engines/javascript) | 21.4.0 | 13.80 | 32 |

| [d/ldc/std-regex](benchmarks/../engines/d) | 2.105 | 22.24 | 31 |

| [regress](benchmarks/../engines/regress) | 0.7.1 | 32.44 | 32 |

| [java/hotspot](benchmarks/../engines/java) | 21.0.1+12-LTS-29 | 42.08 | 34 |

| [python/regex](benchmarks/../engines/python) | 2023.12.25 | 42.45 | 34 |

| [perl](benchmarks/../engines/perl) | 5.38.1 | 42.91 | 33 |

| [python/re](benchmarks/../engines/python) | 3.11.6 | 43.23 | 33 |

| [icu](benchmarks/../engines/icu) | 72.1.0 | 50.50 | 34 |

| [go/regexp](benchmarks/../engines/go) | 1.21.5 | 76.38 | 31 |

| [pcre2](benchmarks/../engines/pcre2) | 10.42 2022-12-11 | 117.74 | 33 |

| [rust/regex/lite](benchmarks/../engines/rust/regex-lite) | 0.1.5 | 162.11 | 28 |

#### Summary of compile-time benchmarks

| Engine | Version | Geometric mean of speed ratios | Benchmark count |

| ------ | ------- | ------------------------------ | --------------- |

| [pcre2](benchmarks/../engines/pcre2) | 10.42 2022-12-11 | 1.37 | 10 |

| [rust/regex/lite](benchmarks/../engines/rust/regex-lite) | 0.1.5 | 2.77 | 10 |

| [regress](benchmarks/../engines/regress) | 0.7.1 | 3.08 | 9 |

| [icu](benchmarks/../engines/icu) | 72.1.0 | 3.62 | 11 |

| [pcre2/jit](benchmarks/../engines/pcre2) | 10.42 2022-12-11 | 5.73 | 11 |

| [go/regexp](benchmarks/../engines/go) | 1.21.5 | 5.79 | 10 |

| [rust/regex](benchmarks/../engines/rust/regex) | 1.10.2 | 11.89 | 14 |

| [re2](benchmarks/../engines/re2) | 2023-11-01 | 12.58 | 10 |

| [dotnet/compiled](benchmarks/../engines/dotnet) | 8.0.0 | 20.51 | 10 |

| [python/re](benchmarks/../engines/python) | 3.11.6 | 37.71 | 11 |

| [python/regex](benchmarks/../engines/python) | 2023.12.25 | 104.77 | 11 |

| [dotnet/nobacktrack](benchmarks/../engines/dotnet) | 8.0.0 | 146.01 | 6 |

| [hyperscan](benchmarks/../engines/hyperscan) | 5.4.2 2023-04-22 | 564.38 | 7 |

### Benchmark Groups

Below is a list of links to each benchmark group in this particular barometer.

Each benchmark group contains 1 or more related benchmarks. The idea of each

group is to tell some kind of story about related workloads, and to give

a sense of how performance changes based on the variations between each

benchmark.

This report was generated by `rebar 0.1.0 (rev 79305bcb5f)`.

* [literal](#literal)

* [literal-alternate](#literal-alternate)

* [date](#date)

* [ruff-noqa](#ruff-noqa)

* [lexer-veryl](#lexer-veryl)

* [cloud-flare-redos](#cloud-flare-redos)

* [unicode-character-data](#unicode-character-data)

* [words](#words)

* [aws-keys](#aws-keys)

* [bounded-repeat](#bounded-repeat)

* [unstructured-to-json](#unstructured-to-json)

* [dictionary](#dictionary)

* [noseyparker](#noseyparker)

* [quadratic](#quadratic)

### literal

This group of benchmarks measures regex patterns that are simple literals. When

possible, we also measure case insensitive versions of the same pattern. We do

this across three languages: English, Russian and Chinese. For English, Unicode

mode is disabled while it is enabled for Russian and Chinese. (Which mostly

only matters for the case insensitive benchmarks.)

This group is mainly meant to demonstrate two things. Firstly is whether the

regex engine does some of the most basic forms of optimization by recognizing

that a pattern is just a literal, and that a full blown regex engine is

probably not needed. Indeed, naively using a regex engine for this case is

likely to produce measurements much worse than most regex engines. Secondly is

how the performance of simple literal searches changes with respect to both

case insensitivity and Unicode. Namely, substring search algorithms that work

well on ASCII text don't necessarily also work well on UTF-8 that contains many

non-ASCII codepoints. This is especially true for case insensitive searches.

Notice, for example, how RE2 seems to be faster in the `sherlock-casei-ru`

benchmark than in the `sherlock-ru` benchmark, even though the latter is "just"

a simple substring search where as the former is a multiple substring search.

In the case of `sherlock-ru`, RE2 actually attempts a literal optimization that

likely gets caught up in dealing with a high false positive rate of candidates.

Where as in the case of `sherlock-casei-ru`, no literal optimization is

attempted and instead its lazy DFA is used. The high false positive rate in the

simpler literal case winds up making it overall slower than it likely would be

if it would just use the DFA.

This is not in any way to pick on RE2. Every regex engine that does literal

optimizations (and most do) will suffer from this kind of setback in one way

or another.

| Engine | sherlock-en | sherlock-casei-en | sherlock-ru | sherlock-casei-ru | sherlock-zh |

| - | - | - | - | - | - |

| d/ldc/std-regex | 8.1 GB/s | 1933.0 MB/s | 836.8 MB/s | 2.0 GB/s | 2.7 GB/s |

| dotnet/compiled | 14.0 GB/s | 13.1 GB/s | 24.7 GB/s | 12.6 GB/s | 30.4 GB/s |

| dotnet/nobacktrack | 8.8 GB/s | 7.9 GB/s | 8.7 GB/s | 5.2 GB/s | 33.2 GB/s |

| go/regexp | 4.2 GB/s | 47.1 MB/s | 2.1 GB/s | 35.6 MB/s | 2.1 GB/s |

| hyperscan | 29.9 GB/s | **29.1 GB/s** | 4.3 GB/s | 7.5 GB/s | **50.6 GB/s** |

| icu | 1603.6 MB/s | 451.4 MB/s | 3.1 GB/s | 283.1 MB/s | 4.2 GB/s |

| java/hotspot | 2.4 GB/s | 275.7 MB/s | 3.9 GB/s | 223.6 MB/s | 4.5 GB/s |

| javascript/v8 | 6.1 GB/s | 3.0 GB/s | **43.4 GB/s** | 3.3 GB/s | 10.5 GB/s |

| pcre2 | 7.1 GB/s | 849.1 MB/s | 2.1 MB/s | 2047.9 KB/s | 57.8 MB/s |

| pcre2/jit | 26.4 GB/s | 16.9 GB/s | 32.0 GB/s | **17.7 GB/s** | 36.9 GB/s |

| perl | 2.8 GB/s | 546.2 MB/s | 3.4 GB/s | 102.0 MB/s | 7.2 GB/s |

| python/re | 3.8 GB/s | 343.0 MB/s | 6.9 GB/s | 477.0 MB/s | 11.0 GB/s |

| python/regex | 3.5 GB/s | 2.8 GB/s | 4.5 GB/s | 3.9 GB/s | 6.8 GB/s |

| re2 | 11.1 GB/s | 2.5 GB/s | 764.2 MB/s | 948.0 MB/s | 2.7 GB/s |

| regress | 4.7 GB/s | 1133.9 MB/s | 4.7 GB/s | 296.0 MB/s | 4.8 GB/s |

| rust/regex | 27.8 GB/s | 11.4 GB/s | 29.3 GB/s | 8.8 GB/s | 39.9 GB/s |

| rust/regex/lite | 55.9 MB/s | 56.4 MB/s | 118.2 MB/s | - | 162.0 MB/s |

| rust/regexold | **32.0 GB/s** | 7.9 GB/s | 32.4 GB/s | 6.2 GB/s | 34.6 GB/s |

Show individual benchmark parameters.

**sherlock-en**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/01-literal/sherlock-en` |

| model | [`count`](MODELS.md#count) |

| regex | `````Sherlock Holmes````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`.*`) | 513 |

**sherlock-casei-en**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/01-literal/sherlock-casei-en` |

| model | [`count`](MODELS.md#count) |

| regex | `````Sherlock Holmes````` |

| case-insensitive | `true` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`.*`) | 522 |

**sherlock-ru**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/01-literal/sherlock-ru` |

| model | [`count`](MODELS.md#count) |

| regex | `````Шерлок Холмс````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`.*`) | 724 |

**sherlock-casei-ru**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/01-literal/sherlock-casei-ru` |

| model | [`count`](MODELS.md#count) |

| regex | `````Шерлок Холмс````` |

| case-insensitive | `true` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`.*`) | 746 |

`rust/regex/lite` is not included because it doesn't support Unicode-aware

case insensitive matching.

**sherlock-zh**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/01-literal/sherlock-zh` |

| model | [`count`](MODELS.md#count) |

| regex | `````夏洛克·福尔摩斯````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/zh-sampled.txt`](benchmarks/haystacks/opensubtitles/zh-sampled.txt) |

| count(`.*`) | 30 |

### literal-alternate

This group is like `literal`, but expands the complexity from a simple literal

to a small alternation of simple literals, including case insensitive variants

where applicable. Once again, we do this across three languages: English,

Russian and Chinese. We disable Unicode mode for English but enable it for

Russian and Chinese. Enabling Unicode here generally only means that case

insensitivity takes Unicode case folding rules into account.

This benchmark ups the ante when it comes to literal optimizations. Namely,

for a regex engine to optimize this case, it generally needs to be capable of

reasoning about literal optimizations that require one or more literals from

a set to match. Many regex engines don't deal with this case well, or at all.

For example, after a quick scan at comparing the `sherlock-en` benchmark here

and in the previous `literal` group, one thing that should stand out is the

proportion of regex engines that now measure throughput in MB/s instead of

GB/s.

One of the difficulties in optimizing for this case is that multiple substring

search is difficult to do in a way that is fast. In particular, this benchmark

carefully selected each alternation literal to start with a different character

than the other alternation literals. This, for example, inhibits clever regex

engines from noticing that all literals begin with the same byte (or small

number of bytes). Consider an alternation like `foo|far|fight`. It is not hard

to see that a regex engine _could_ just scan for the letter `f` as a prefilter

optimization. Here, we pick our regex such that this sort of shortcut isn't

available. For the regex engine to optimize this case, it really needs to deal

with the problem of multiple substring search.

Multiple substring search _can_ be implemented via a DFA, and perhaps in some

cases, quite quickly via a [shift DFA]. Beyond that though, multiple substring

search can be implemented by other various algorithms such as Aho-Corasick or

Rabin-Karp. (The standard Aho-Corasick formulation is an NFA, but it can also

be converted to a DFA by pre-computing all failure transitions. This winds up

with a similar result as using Thompson's construction to produce an NFA and

then powerset construction to get a DFA, but the Aho-Corasick construction

algorithm is usually quite a bit faster because it doesn't need to deal with a

full NFA.)

The problem here is that DFA speeds may or may not help you. For example, in

the case of RE2 and Rust's regex engine, it will already get DFA speeds by

virtue of their lazy DFAs. Indeed, in this group, RE2 performs roughly the same

across all benchmarks. So even if you, say build an Aho-Corasick DFA, it's not

going to help much if at all. So it makes sense to avoid it.

But Rust's regex crate has quite a bit higher throughputs than RE2 on most of

the benchmarks in this group. So how is it done? Currently, this is done via

the [Teddy] algorithm, which was ported out of [Hyperscan]. It is an algorithm

that makes use of SIMD to accelerate searching for a somewhat small set of

literals. Most regex engines don't have this sort of optimization, and indeed,

it seems like Teddy is not particularly well known. Alas, regex engines that

want to move past typical DFA speeds for multiple substring search likely need

some kind of vectorized algorithm to do so. (Teddy is also used by Rust's

regex crate in the previous `literal` group of benchmarks for accelerating

case insensitive searches. Namely, it enumerates some finite set of prefixes

like `she`, `SHE`, `ShE` and so on, and then looks for matches of those as a

prefilter.)

[shift DFA]: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

[Teddy]: https://github.com/BurntSushi/aho-corasick/tree/4e7fa3b85dd3a3ce882896f1d4ee22b1f271f0b4/src/packed/teddy

[Hyperscan]: https://github.com/intel/hyperscan

| Engine | sherlock-en | sherlock-casei-en | sherlock-ru | sherlock-casei-ru | sherlock-zh |

| - | - | - | - | - | - |

| d/ldc/std-regex | 1290.3 MB/s | 1168.4 MB/s | 1560.2 MB/s | 1440.2 MB/s | 2.1 GB/s |

| dotnet/compiled | 3.6 GB/s | 924.6 MB/s | 1179.4 MB/s | 1161.1 MB/s | **27.8 GB/s** |

| dotnet/nobacktrack | 3.0 GB/s | 418.3 MB/s | 546.6 MB/s | 151.3 MB/s | 17.4 GB/s |

| go/regexp | 28.5 MB/s | 16.8 MB/s | 37.0 MB/s | 9.8 MB/s | 52.2 MB/s |

| hyperscan | 13.9 GB/s | **13.4 GB/s** | 4.6 GB/s | **4.0 GB/s** | 19.8 GB/s |

| icu | 675.3 MB/s | 115.0 MB/s | 168.9 MB/s | 107.6 MB/s | 338.8 MB/s |

| java/hotspot | 70.6 MB/s | 62.0 MB/s | 119.3 MB/s | 55.9 MB/s | 184.3 MB/s |

| javascript/v8 | 686.1 MB/s | 670.0 MB/s | 936.1 MB/s | 601.5 MB/s | 6.4 GB/s |

| pcre2 | 866.2 MB/s | 159.4 MB/s | 1726.2 KB/s | 1630.5 KB/s | 8.6 MB/s |

| pcre2/jit | 1558.2 MB/s | 649.7 MB/s | 1188.7 MB/s | 297.8 MB/s | 2.5 GB/s |

| perl | 1113.8 MB/s | 116.8 MB/s | 108.7 MB/s | 76.0 MB/s | 236.5 MB/s |

| python/re | 437.5 MB/s | 42.0 MB/s | 309.5 MB/s | 54.5 MB/s | 635.9 MB/s |

| python/regex | 298.8 MB/s | 67.3 MB/s | 287.5 MB/s | 86.6 MB/s | 929.0 MB/s |

| re2 | 927.1 MB/s | 926.6 MB/s | 936.1 MB/s | 930.3 MB/s | 966.5 MB/s |

| regress | 1512.5 MB/s | 288.7 MB/s | 223.2 MB/s | 105.6 MB/s | 260.3 MB/s |

| rust/regex | 12.7 GB/s | 3.0 GB/s | **6.6 GB/s** | 1668.2 MB/s | 12.1 GB/s |

| rust/regex/lite | 31.2 MB/s | 21.8 MB/s | 46.8 MB/s | - | 70.7 MB/s |

| rust/regexold | **15.9 GB/s** | 2.7 GB/s | 3.0 GB/s | 452.5 MB/s | 19.5 GB/s |

Show individual benchmark parameters.

**sherlock-en**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/02-literal-alternate/sherlock-en` |

| model | [`count`](MODELS.md#count) |

| regex | `````Sherlock Holmes\|John Watson\|Irene Adler\|Inspector Lestrade\|Professor Moriarty````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`.*`) | 714 |

**sherlock-casei-en**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/02-literal-alternate/sherlock-casei-en` |

| model | [`count`](MODELS.md#count) |

| regex | `````Sherlock Holmes\|John Watson\|Irene Adler\|Inspector Lestrade\|Professor Moriarty````` |

| case-insensitive | `true` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`.*`) | 725 |

**sherlock-ru**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/02-literal-alternate/sherlock-ru` |

| model | [`count`](MODELS.md#count) |

| regex | `````Шерлок Холмс\|Джон Уотсон\|Ирен Адлер\|инспектор Лестрейд\|профессор Мориарти````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`.*`) | 899 |

**sherlock-casei-ru**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/02-literal-alternate/sherlock-casei-ru` |

| model | [`count`](MODELS.md#count) |

| regex | `````Шерлок Холмс\|Джон Уотсон\|Ирен Адлер\|инспектор Лестрейд\|профессор Мориарти````` |

| case-insensitive | `true` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`.*`) | 971 |

`rust/regex/lite` is not included because it doesn't support Unicode-aware

case insensitive matching.

**sherlock-zh**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/02-literal-alternate/sherlock-zh` |

| model | [`count`](MODELS.md#count) |

| regex | `````夏洛克·福尔摩斯\|约翰华生\|阿德勒\|雷斯垂德\|莫里亚蒂教授````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/zh-sampled.txt`](benchmarks/haystacks/opensubtitles/zh-sampled.txt) |

| count(`.*`) | 207 |

### date

This is a monster regex for extracting dates from unstructured text from

the [datefinder] project written in Python. The regex itself was taken from

[printing the `DATES_PATTERN`][datefinder-regex] variable in the `datefinder`

project. I then removed all names from the capture groups, unnecessary escapes

and collapsed it to a single line (because not all regex engines support

verbose mode).

The regex is more akin to a tokenizer, and the `datefinder` library attempts to

combine these tokens into timestamps.

We measure an ASCII only version of it and a Unicode-aware version of it.

Unicode is relevant here because of case insensitivity, and because the regex

makes use of the character classes `\s` and `\d`, which are bigger when they're

Unicode aware. We also measure the compilation time of each.

The results here can be a little tricky to interpret. Namely, it looks like

backtrackers tend to do worse than automata oriented regex engines, but

`go/regexp` uses automata and is itself quite slow here. Notice, though, that

`hyperscan`, `re2` and `rust/regex` do well here. While I'm less familiar with

`hyperscan`, the explanation for `re2` and `rust/regex` is obvious once you

look at a profile: it's the lazy DFA. Both have implementations of a regex

engine that build a DFA during search time, with at most one new transition

(and one new state) being create per byte of haystack. In practice, most

transitions get reused, which means that it tends to act like a real DFA most

of the time for most regexes on most haystacks.

Compilation time of this monster regex is also all over the place. PCRE2 does

the best, and Hyperscan winds up being quite slow. Once you enable Unicode

mode, compilation time generally gets worse, and especially so for `re2` and

`rust/regex`. In particular, both compile _byte oriented_ automata, which means

the transitions are defined over bytes and not codepoints. That means large

Unicode classes like `\d` tend to balloon in size, because they get converted

into UTF-8 automata.

[datefinder]: https://github.com/akoumjian/datefinder/tree/master

[datefinder-regex]: https://github.com/akoumjian/datefinder/blob/5376ece0a522c44762b1ab656fc80737b427ed16/datefinder/constants.py#L112-L124

| Engine | ascii | unicode | compile-ascii | compile-unicode |

| - | - | - | - | - |

| d/ldc/std-regex | 727.5 KB/s | 648.1 KB/s | - | - |

| dotnet/compiled | 1162.4 KB/s | 1167.8 KB/s | - | 1.44ms |

| go/regexp | 273.4 KB/s | - | 1.26ms | - |

| hyperscan | 104.2 MB/s | - | 642.83ms | - |

| icu | 318.0 KB/s | 315.6 KB/s | 451.99us | 451.34us |

| java/hotspot | 2.0 MB/s | 1671.2 KB/s | - | - |

| javascript/v8 | 34.6 MB/s | 31.6 MB/s | - | - |

| pcre2 | 1123.4 KB/s | 176.2 KB/s | **114.15us** | **132.53us** |

| pcre2/jit | 21.1 MB/s | 13.0 MB/s | 680.46us | 941.76us |

| perl | 2.7 MB/s | - | - | - |

| python/re | 1106.3 KB/s | 859.5 KB/s | 3.78ms | 3.96ms |

| python/regex | 1140.5 KB/s | 1023.6 KB/s | 9.80ms | 29.73ms |

| re2 | 80.4 MB/s | - | 1.17ms | - |

| regress | 1894.0 KB/s | 1883.3 KB/s | 1.08ms | 1.08ms |

| rust/regex | **158.2 MB/s** | **156.3 MB/s** | 1.37ms | 4.88ms |

| rust/regex/lite | 971.2 KB/s | - | 355.71us | - |

| rust/regexold | 148.2 MB/s | 420.2 KB/s | 1.55ms | 5.22ms |

Show individual benchmark parameters.

**ascii**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/03-date/ascii` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex-path | [`wild/date.txt`](benchmarks/regexes/wild/date.txt) |

| case-insensitive | `true` |

| unicode | `false` |

| haystack-path | [`rust-src-tools-3b0d4813.txt`](benchmarks/haystacks/rust-src-tools-3b0d4813.txt) |

| count(`d/.*/std-regex`) | 111841 |

| count(`dotnet.*`) | 111825 |

| count(`hyperscan`) | 547662 |

| count(`icu`) | 111825 |

| count(`javascript/v8`) | 111825 |

| count(`regress`) | 111841 |

| count(`.*`) | 111817 |

As with many other benchmarks, Hyperscan reports all matches, even ones that

are overlapping. This particular regex is too big to analyze closely, but it

seems plausible one could still use it (possibly with a slightly tweaked regex)

for this task.

**unicode**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/03-date/unicode` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex-path | [`wild/date.txt`](benchmarks/regexes/wild/date.txt) |

| case-insensitive | `true` |

| unicode | `true` |

| haystack-path | [`rust-src-tools-3b0d4813.txt`](benchmarks/haystacks/rust-src-tools-3b0d4813.txt) |

| count(`dotnet/compiled|icu|java/hotspot|javascript/v8`) | 111825 |

| count(`.*`) | 111841 |

ECMAScript engines such as `d/.*/std-regex`, `javascript/v8` and `regress`

are included here despite its `\d` not being Unicode-aware (as required by

ECMAScript). Notably, its `\s` _is_ Unicode aware. (`\w` is too, but it's not

used in this regex.) In this particular haystack, `\d` being ASCII-only doesn't

impact the match count.

However, neither `re2` nor `go/regexp` are included here because neither `\d`

nor `\s` are Unicode-aware, and the `\s` being ASCII-only does impact the match

count.

`hyperscan` is excluded here because the pattern results in a "too large"

compilation error. As far as I know, Hyperscan doesn't expose any knobs for

increasing this limit.

`dotnet/compiled` gets a different count here, but it's not clear why.

`perl` is left out of this benchmark because it times out.

`rust/regex/lite` is excluded because it doesn't support Unicode-aware `\w`,

`\d` or `\s`.

**compile-ascii**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/03-date/compile-ascii` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/date.txt`](benchmarks/regexes/wild/date.txt) |

| case-insensitive | `true` |

| unicode | `false` |

| haystack | `2010-03-14` |

| count(`hyperscan`) | 10 |

| count(`.*`) | 5 |

Notice that ECMAScript engines such as `d/.*/std-regex` and `regress` are

included in this ASCII benchmark, because in `compile-unicode` we specifically

test that the `\d` used in this regex is Unicode-aware. `regress` does not make

`\d` Unicode-aware, so it gets thrown into the ASCII group. But do note that it

does appear to have some Unicode awareness.

**compile-unicode**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/03-date/compile-unicode` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/date.txt`](benchmarks/regexes/wild/date.txt) |

| case-insensitive | `true` |

| unicode | `true` |

| haystack | `۲۰۱۰-۰۳-۱۴` |

| count(`javascript/v8|regress`) | 2 |

| count(`.*`) | 5 |

We use "extended arabic-indic digits" to represent the same date, `2010-03-14`,

that we use for verification in `compile-ascii`. These digits are part of `\d`

when it is Unicode aware.

### ruff-noqa

The regex benchmarked here comes from the [Ruff project][ruff], which is a

Python linter written in Rust. The project uses many regexes, but [we pluck

one out in particular][noqa] that is likely to be run more frequently than the

others:

```

(\s*)((?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)

```

This is a regex that looks for `# noqa` annotations on each line. The `noqa`

annotation generally causes the linter to ignore those lines with respect to

warnings it emits. The regex also tries to extract annotations following the

`noqa` that permit ignoring only specific rules in the linter.

We also remove the `i` inline flag and instead use `[Nn][Oo][Qq][Aa]` to search

for `noqa` case insensitively. We do this so that this benchmark can run on

regex engines that don't support inline flags, such as those that follow the

ECMAScript specification (at time of writing). This includes `d/.*/std-regex`,

`javascript/v8` and `regress`.

This benchmark has a few interesting characteristics worth pointing out:

* It is line oriented, which means the haystacks it searches are likely to be

small. This in turn means that the overhead of the regex engine is likely to

matter more than in throughput oriented benchmarks.

* On this particular haystack (the CPython source code), the number of matches

is quite small. Therefore, it is quite beneficial here to be able to have a

fast path to say "there is no match" without doing any extra work. While the

number of matches here is perhaps uncharacteristically small for a Python

project, you would generally expect _most_ lines to not have `# noqa` in them,

and so the presumption of a fast rejection is probably a decent assumption for

this particular regex.

* Ruff uses capturing groups to pick out parts of the match, so when a match

is found, the regex engine needs to report additional information beyond just

the overall match spans. The spans of each matching capture group also need

to be reported.

* There are no prefix (or suffix) literals in the regex to enable any

straight-forward prefilter optimizations.

With respect to the point about no prefix or suffix literals, we also include

a tweaked version of the regex that removes the leading `(\s*)`:

```

(?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?

```

In this case, the regex now starts with a literal, albeit one that is asked

to match case insensitively. We can actually see pretty clearly the impact

the tweaked version has on the speed for each regex engine. `pcre2/jit`, for

example, improves its throughput from around 500 MB/s to 1.5 GB/s. `go/regexp`

has an even more dramatic (relatively speaking) improvement.

`rust/regex` is a little different in that it's quite fast in both cases.

The key optimization that applies for `rust/regex` is the "reverse inner"

optimization. Even in the original regex, `rust/regex` will pluck out the `#

noqa` literal and search for it case insensitively. When a candidate is found,

it then searches for `(\s*)` in reverse to find the start position, and then

finally does a standard forward search from that point to find the reverse

position.

[ruff]: https://github.com/charliermarsh/ruff

[noqa]: https://github.com/charliermarsh/ruff/blob/5c987874c48e6ed5d0ef7f9a09c4cb1940bd4018/crates/ruff/src/noqa.rs#L22

| Engine | real | tweaked | compile-real |

| - | - | - | - |

| d/ldc/std-regex | 64.6 MB/s | 114.4 MB/s | - |

| dotnet/compiled | 181.3 MB/s | 837.6 MB/s | 45.20us |

| dotnet/nobacktrack | 307.7 MB/s | 678.4 MB/s | 367.20us |

| go/regexp | 34.4 MB/s | 715.5 MB/s | 2.99us |

| icu | 29.3 MB/s | 336.8 MB/s | 8.01us |

| java/hotspot | 38.4 MB/s | 232.2 MB/s | - |

| javascript/v8 | 129.6 MB/s | 283.5 MB/s | - |

| pcre2 | 123.4 MB/s | 1343.5 MB/s | **1.15us** |

| pcre2/jit | 569.6 MB/s | 1469.6 MB/s | 6.78us |

| perl | 101.3 MB/s | 129.2 MB/s | - |

| python/re | 29.8 MB/s | 117.0 MB/s | 67.69us |

| python/regex | 77.6 MB/s | 100.5 MB/s | 159.49us |

| re2 | 552.2 MB/s | 962.7 MB/s | 7.10us |

| regress | 39.8 MB/s | 600.1 MB/s | 3.54us |

| rust/regex | **1541.2 MB/s** | **1475.9 MB/s** | 56.66us |

| rust/regex/lite | 30.1 MB/s | 59.4 MB/s | 2.22us |

| rust/regexold | 173.0 MB/s | 1118.2 MB/s | 40.84us |

Show individual benchmark parameters.

**real**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/04-ruff-noqa/real` |

| model | [`grep-captures`](MODELS.md#grep-captures) |

| regex | `````(\s*)((?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`d/.*/std-regex`) | 80 |

| count(`.*`) | 84 |

**tweaked**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/04-ruff-noqa/tweaked` |

| model | [`grep-captures`](MODELS.md#grep-captures) |

| regex | `````(?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`d/.*/std-regex`) | 40 |

| count(`.*`) | 44 |

**compile-real**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/04-ruff-noqa/compile-real` |

| model | [`compile`](MODELS.md#compile) |

| regex | `````(\s*)((?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `# noqa` |

| count(`.*`) | 1 |

### lexer-veryl

This group benchmarks a "lexer" where it combines a whole bunch of different

patterns that identify tokens in a language into a single regex. It then uses

capture groups to determine which branch of the alternation actually matched,

and thus, which token matched. We also benchmark a variant of this that asks

the regex engine to search for each pattern individually (most regex engines

don't support this mode).

This is used by the [Veryl] project by way of the [Parol] parser generator. The

regex was [extracted by the Parol maintainers upon my request][parol-issue].

We use this regex to represent the "lexing" use case, where sometimes folks

will build a pretty big regex with a bunch of small regexes for identifying

tokens. Usually the idea is that the lexer matches literally everything in the

haystack (indeed, the last branch in this regex is a `.` and the first is any

newline), and thus these sorts of regexes tend to be quite latency sensitive.

Namely, it really matters just how much overhead is involved in reporting

matches. This is likely one of the reasons why most regex engines are overall

pretty slow here.

The other aspect of this that's quite difficult is the sheer number of

capturing groups. There's several dozen of them, which means regex engines have

to keep track of a fair bit of state to handle it.

You might think this would be bad for backtrackers and good for automata

engines, since automata engines are *supposed* to be able to handle large

alternations better than backtrackers. But that's not the case here. Even for

example Python's regex engine (backtracker) beats RE2 (automata). My hypothesis

for why this is, is latency. Automata engines tend to have multiple engines

internally and therefore tend to have higher latency, and sometimes multiple

engines run to service one search. Backtrackers tend to have one engine that

handles everything. But still, shouldn't the huge alternation be disastrous for

the backtracker? Perhaps, unless many of the matches occur in an early branch,

which is likely the case here. Namely, the second alternation matches a ` `

(single ASCII space), which is probably the most frequently occurring byte in

the haystack. An automata engine that doesn't use a DFA (which might be the

case here, because the regex is so big), will wind up spending a lot of time

keeping track of all branches of the alternation, even if it doesn't need to

explore all of them. In contrast, a backtracker will try one after the other,

and if most cases match an early branch, the backtracker is likely to take less

overall time.

Most regex engines are stuck in the 1 MB/s (or less) range. The regex crate and

PCRE2's JIT get up to about 10 MB/s, with PCRE2 edging out the regex crate.

Note that the regex was lightly modified from the original to increase

portability across different regex engines. For example, the `[\s--\r\n]` class

was changed to `[\t\v\f ]`.

As for the second benchmark, `multiple`, it uses the same patterns from each

alternation in the `single` benchmark, but treats each one as a distinct

pattern. Doing this requires explicit support for searching multiple regex

patterns. (RE2's and Rust's regex crate "regex set" functionality is not enough

for this, as it only reports which patterns match a haystack, and not where

they match. That's partially why the `rust/regex` engine in this barometer

actually just use the lower level `meta::Regex` APIs from the `regex-automata`

crate.)

In the `multiple` case, the `rust/regex` does very well and the key reason is

the abdication of capture groups as a necessary tool to determine which token

matched. Namely, now we can simply use a pattern ID from the match to determine

which "branch" in the original regex was taken. We no longer need to ask for or

inspect capture groups. This gives a critical benefit to automata engines that

support searching for multiple patterns, because it no longer requires them to

use slower engines for resolving capturing groups.

[Veryl]: https://github.com/dalance/veryl

[Parol]: https://github.com/jsinger67/parol

[parol-issue]: https://github.com/jsinger67/parol/issues/56

| Engine | single | compile-single | multi |

| - | - | - | - |

| dotnet/compiled | 1792.0 KB/s | 237.80us | - |

| go/regexp | 358.3 KB/s | 62.06us | - |

| hyperscan | - | - | 17.8 MB/s |

| icu | 934.7 KB/s | 59.23us | - |

| java/hotspot | 6.1 MB/s | - | - |

| javascript/v8 | 7.1 MB/s | - | - |

| pcre2 | 2.7 MB/s | **24.66us** | - |

| pcre2/jit | **12.4 MB/s** | 124.57us | - |

| perl | 1111.1 KB/s | - | - |

| python/re | 1850.4 KB/s | 910.50us | - |

| python/regex | 1662.4 KB/s | 2.39ms | - |

| re2 | 1185.4 KB/s | 148.68us | - |

| regress | 8.5 MB/s | - | - |

| rust/regex | 9.2 MB/s | 277.80us | **88.7 MB/s** |

| rust/regex/lite | 492.5 KB/s | 46.09us | - |

| rust/regexold | 248.2 KB/s | 220.44us | - |

Show individual benchmark parameters.

**single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/05-lexer-veryl/single` |

| model | [`count-captures`](MODELS.md#count-captures) |

| regex-path | [`wild/parol-veryl.txt`](benchmarks/regexes/wild/parol-veryl.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/parol-veryl.vl`](benchmarks/haystacks/wild/parol-veryl.vl) |

| count(`.*`) | 124800 |

`d/.*/std-regex` is excluded because its match count is 5,491,200. This suggest

it is either buggy or something funny is going on.

`dotnet/nobacktrack` is excluded because it gives a "too big" error.

`hyperscan` is excluded because it doesn't support the `count-captures`

benchmark model. It is included in the `multiple` benchmark below, which

doesn't require capture groups.

**compile-single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/05-lexer-veryl/compile-single` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/parol-veryl.txt`](benchmarks/regexes/wild/parol-veryl.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `abcdefg_foobar` |

| count(`.*`) | 1 |

This measures how long it takes to a compile a moderately large lexer.

`d/.*/std-regex` is excluded because its match count is 5,491,200. This

suggests it is either buggy or something funny is going on.

`dotnet/nobacktrack` is excluded because it gives a "too big" error.

`hyperscan` is excluded because it doesn't support the `count-captures`

benchmark model. It is included in the `multiple` benchmark below, which

doesn't require capture groups.

**multi**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/05-lexer-veryl/multi` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex-path | [`wild/parol-veryl.txt`](benchmarks/regexes/wild/parol-veryl.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/parol-veryl.vl`](benchmarks/haystacks/wild/parol-veryl.vl) |

| count(`hyperscan`) | 669500 |

| count(`.*`) | 150600 |

Hyperscan reports everything that matches, including overlapping matches,

and that's why its count is higher. It is likely still serviceable for

this use case, but might in practice require changing the regex to suit

Hyperscan's match semantics. Still, it's a decent barometer to include it here,

particularly because of its multi-regex support.

Most regex engines do not support searching for multiple patterns and finding

the corresponding match offsets, which is why this benchmark has very few

entries.

### cloud-flare-redos

This benchmark uses a regex that helped cause an [outage at

Cloudflare][cloudflare-blog]. This class of vulnerability is typically called a

"regular expression denial of service," or "ReDoS" for short. It doesn't always

require a malicious actor to trigger. Since it can be difficult to reason about

the worst case performance of a regex when using an unbounded backtracking

implementation, it might happen entirely accidentally on valid inputs.

The particular regex that contributed to the outage was:

```

(?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|\-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))

```

As discussed in Cloudflare's post mortem, the specific problematic portion of

the regex is:

```

.*(?:.*=.*)

```

Or more simply:

```

.*.*=.*;

```

We benchmark the original regex along with the simplified variant. We also

split the simplified variant into one with a short haystack (about 100 bytes)

and one with a long haystack (about 10,000 bytes). The benchmark results for

the original and simplified short variant should be roughly similar, but the

difference between the short and long variant is where things get interesting.

The automata based engines generally maintain a similar throughput for both the

short and long benchmarks, but the backtrackers slow way down. This is because

the backtracking algorithm for this specific regex and haystack doesn't scale

linearly with increases in the size of the haystack.

The purpose of this benchmark is to show a real world scenario where the use of

a backtracking engine can bite you in production if you aren't careful.

We include Hyperscan in this benchmark, although it is questionable to do so.

Hyperscan reports many overlapping matches from the regex used by Cloudflare

because of the trailing `.*`, so it is probably not a great comparison.

In particular, this regex was originally used in a firewall, so it seems

likely that it would be used in a "is a match" or "not a match" scenario. But

our benchmark here reproduces the analysis in the appendix of Cloudflare's

port mortem. But the real utility in including Hyperscan here is that it

demonstrates that it is not a backtracking engine. While its throughput is not

as high as some other engines, it remains roughly invariant with respect to

haystack length, just like other automata oriented engines.

Note that `rust/regex` has very high throughput here because the regex is

small enough to get compiled into a full DFA. The compilation process also

"accelerates" some states, particularly the final `.*`. This acceleration works

by noticing that almost all of the state's transitions loop back on itself, and

only a small number transition to another state. The final `.*` for example

only leaves its state if it sees the end of the haystack or a `\n`. So the DFA

will actually run `memchr` on `\n` and skip right to the end of the haystack.

[cloudflare-blog]: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

| Engine | original | simplified-short | simplified-long |

| - | - | - | - |

| d/ldc/std-regex | 17.0 MB/s | 31.4 MB/s | 34.8 MB/s |

| dotnet/compiled | 170.1 MB/s | **95.0 GB/s** | 18.6 GB/s |

| dotnet/nobacktrack | 255.1 MB/s | 243.2 MB/s | 302.8 MB/s |

| go/regexp | 43.8 MB/s | 46.5 MB/s | 51.3 MB/s |

| hyperscan | 85.0 MB/s | 80.4 MB/s | 84.7 MB/s |

| icu | 3.4 MB/s | 3.6 MB/s | 43.0 KB/s |

| java/hotspot | 9.2 MB/s | 6.2 MB/s | 91.0 KB/s |

| javascript/v8 | 19.4 MB/s | 18.9 MB/s | 335.6 KB/s |

| pcre2 | 2.8 MB/s | 2.7 MB/s | 30.1 KB/s |

| pcre2/jit | 49.8 MB/s | 42.5 MB/s | 671.2 KB/s |

| perl | 10.3 MB/s | 9.9 MB/s | 176.4 KB/s |

| python/re | 22.3 MB/s | 21.9 MB/s | 383.3 KB/s |

| python/regex | 6.3 MB/s | 6.2 MB/s | 91.9 KB/s |

| re2 | 349.5 MB/s | 327.5 MB/s | 493.7 MB/s |

| regress | 8.0 MB/s | 7.7 MB/s | 96.7 KB/s |

| rust/regex | **566.9 MB/s** | 1594.7 MB/s | **77.6 GB/s** |

| rust/regex/lite | 17.4 MB/s | 20.5 MB/s | 21.0 MB/s |

| rust/regexold | 443.7 MB/s | 481.6 MB/s | 618.1 MB/s |

Show individual benchmark parameters.

**original**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/06-cloud-flare-redos/original` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````(?:(?:"\|'\|\]\|\}\|\\\|\d\|(?:nan\|infinity\|true\|false\|null\|undefined\|symbol\|math)\|`\|-\|\+)+[)]*;?((?:\s\|-\|~\|!\|\{\}\|\\|\\|\|\+)*.*(?:.*=.*)))````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `math x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..]` |

| count(`hyperscan`) | 5757 |

| count(`.*`) | 107 |

**simplified-short**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/06-cloud-flare-redos/simplified-short` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````.*.*=.*````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `x=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [.. snip ..]` |

| count(`hyperscan`) | 5252 |

| count(`.*`) | 102 |

**simplified-long**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/06-cloud-flare-redos/simplified-long` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````.*.*=.*````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`cloud-flare-redos.txt`](benchmarks/haystacks/cloud-flare-redos.txt) |

| count(`hyperscan`) | 50004999 |

| count(`.*`) | 10000 |

### unicode-character-data

This regex parses data from `UnicodeData.txt`, which is part of the [Unicode

Character Database][ucd]. This regex was [extracted from the `ucd-parse`

crate][ucd-parse-regex], which is part of the [ucd-generate] project.

This benchmark works by iterating over every line in the haystack and then

running the regex on each line. Every line matches the regex, so regex engines

that attempt to do some extra work to reject non-matches quickly will get

penalized. For example, `rust/regex` looks for a semi-colon first via its

"reverse inner" optimization, since a semi-colon is a required part of the

regex. But this optimization is just extra work here. Indeed, disabling it will

improve the thoughput of `rust/regex` on this benchmark.

`pcre2/jit` does remarkably well here, and these types of regexes are one of

the many things that `pcre2/jit` does quickly compared to most other regex

engines.

We also include compilation time for this regex, where PCRE2 again does quite

well.

[ucd]: https://unicode.org/ucd/

[ucd-generate]: https://github.com/BurntSushi/ucd-generate

[ucd-parse-regex]: https://github.com/BurntSushi/ucd-generate/blob/47ae5cbe739d46d3d2eed75e1326d9814d940c3f/ucd-parse/src/unicode_data.rs#L103-L124

| Engine | parse-line | compile |

| - | - | - |

| dotnet/compiled | 121.8 MB/s | 38.20us |

| dotnet/nobacktrack | 31.6 MB/s | 125.80us |

| go/regexp | 79.0 MB/s | 11.54us |

| icu | 139.0 MB/s | 13.98us |

| java/hotspot | 208.1 MB/s | - |

| javascript/v8 | 243.3 MB/s | - |

| pcre2 | 201.2 MB/s | **2.12us** |

| pcre2/jit | **699.3 MB/s** | 12.13us |

| perl | 23.2 MB/s | - |

| python/re | 52.0 MB/s | 102.51us |

| python/regex | 36.2 MB/s | 266.36us |

| re2 | 101.0 MB/s | 14.46us |

| regress | 207.9 MB/s | 6.40us |

| rust/regex | 362.8 MB/s | 27.33us |

| rust/regex/lite | 30.3 MB/s | 4.26us |

| rust/regexold | 90.8 MB/s | 17.46us |

Show individual benchmark parameters.

**parse-line**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/07-unicode-character-data/parse-line` |

| model | [`grep-captures`](MODELS.md#grep-captures) |

| regex-path | [`wild/ucd-parse.txt`](benchmarks/regexes/wild/ucd-parse.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/UnicodeData-15.0.0.txt`](benchmarks/haystacks/wild/UnicodeData-15.0.0.txt) |

| count(`.*`) | 558784 |

`d/.*/std-regex` is omitted because its match count, `523860`, differs from

everything else. It's not clear whether it has a bug or not.

**compile**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/07-unicode-character-data/compile` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/ucd-parse.txt`](benchmarks/regexes/wild/ucd-parse.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `249D;PARENTHESIZED LATIN SMALL LETTER B;So;0;L; 0028 [.. snip ..]` |

| count(`.*`) | 1 |

`d/.*/std-regex` is omitted because its match count in the `parse-line`

benchmark, `523860`, differs from everything else. It's not clear whether it

has a bug or not.

### words

This benchmark measures how long it takes for a regex engine to find words in

a haystack. We compare one regex that finds all words, `\b\w+\b` and another

regex that only looks for longer words, `\b\w{12,}\b`. We also compare ASCII

regexes on English text with Unicode regexes on Russian text.

The split between finding all words and finding only long words tends to

highlight the overhead of matching in each regex engine. Regex engines that are

quicker to get in and out of its match routine do better at finding all words

than regex engines that have higher overhead. For example, `regress` is faster

than `rust/regex` on `all-english`, but substantially slower than `rust/regex`

on `long-english`. This is likely because `rust/regex` is doing more work per

search call than `regress`, which is in part rooted in the optimizations it

performs to gain higher throughput.

Otherwise, `pcre2/jit` does quite well here across the board, but especially on

the Unicode variants. When comparing it against `rust/regex` for example, it

is substantially faster. In the case of `rust/regex`, its faster DFA oriented

engines cannot handle the Unicode aware `\b` on non-ASCII haystacks, and this

causes `rust/regex` to use a slower internal engine. It's so slow in fact

that `python/re` and `python/regex` are both faster than `rust/regex` for the

Unicode benchmarks. For the ASCII `long-english` benchmark, `rust/regex` and

`re2` both do well because most of the time is spent in its lazy DFA, which has

pretty good throughput performance when compared to a pure backtracker.

Note that several regex engines can't be used in the Unicode variants because

either they don't support a Unicode aware `\w` or because they don't support a

Unicode aware `\b` (or both).

| Engine | all-english | all-russian | long-english | long-russian |

| - | - | - | - | - |

| d/ldc/std-regex | 47.9 MB/s | 72.1 MB/s | **1570.6 MB/s** | 102.8 MB/s |

| dotnet/compiled | 139.3 MB/s | 217.8 MB/s | 120.8 MB/s | 160.3 MB/s |

| dotnet/nobacktrack | 45.0 MB/s | 62.2 MB/s | 160.9 MB/s | 160.9 MB/s |

| go/regexp | 19.9 MB/s | - | 48.9 MB/s | - |

| hyperscan | 157.7 MB/s | - | 439.4 MB/s | - |

| icu | 81.2 MB/s | 108.4 MB/s | 41.9 MB/s | 56.0 MB/s |

| java/hotspot | 72.1 MB/s | 143.9 MB/s | 68.1 MB/s | 112.6 MB/s |

| javascript/v8 | 160.3 MB/s | - | 196.3 MB/s | - |

| pcre2 | 98.1 MB/s | 134.0 KB/s | 70.1 MB/s | 6.4 MB/s |

| pcre2/jit | **191.1 MB/s** | **228.6 MB/s** | 245.6 MB/s | **196.0 MB/s** |

| perl | 14.1 MB/s | 791.6 KB/s | 108.5 MB/s | 29.8 MB/s |

| python/re | 38.8 MB/s | 47.1 MB/s | 121.8 MB/s | 128.0 MB/s |

| python/regex | 22.7 MB/s | 44.1 MB/s | 33.7 MB/s | 105.5 MB/s |

| re2 | 66.8 MB/s | - | 925.0 MB/s | - |

| regress | 158.2 MB/s | - | 146.3 MB/s | - |

| rust/regex | 118.5 MB/s | 14.1 MB/s | 802.3 MB/s | 25.1 MB/s |

| rust/regex/lite | 31.3 MB/s | - | 44.7 MB/s | - |

| rust/regexold | 124.9 MB/s | 6.9 MB/s | 805.7 MB/s | 27.9 MB/s |

Show individual benchmark parameters.

**all-english**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/08-words/all-english` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````\b[0-9A-Za-z_]+\b````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`d/.*/std-regex`) | 56601 |

| count(`dotnet/compiled`) | 56601 |

| count(`dotnet/nobacktrack`) | 56601 |

| count(`icu`) | 56601 |

| count(`.*`) | 56691 |

We specifically write out `[0-9A-Za-z_]` instead of using `\w` because some

regex engines, such as the one found in .NET, make `\w` Unicode aware and there

doesn't appear to be any easy way of disabling it.

Also, the .NET engine makes `\b` Unicode-aware, which also appears impossible

to disable. To account for that, we permit a different count. The same goes for

D's std.regex here.

**all-russian**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/08-words/all-russian` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````\b\w+\b````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`dotnet.*`) | 53960 |

| count(`icu`) | 53960 |

| count(`java.*`) | 53960 |

| count(`perl`) | 53960 |

| count(`.*`) | 107391 |

`rust/regex/lite`, `regress`, `re2` and `go/regexp` are excluded because `\w`

is not Unicode aware. `hyperscan` is exclude because it doesn't support a

Unicode aware `\b`.

For `dotnet/compiled`, since the length of matching spans is in the number of

UTF-16 code units, its expected count is smaller.

For `perl`, it has the same count as `dotnet/compiled`, but only because it

counts total encoded codepoints. Since every match span in this benchmark

seemingly corresponds to codepoints in the basic multi-lingual plane, it

follows that the number of UTF-16 code units is equivalent to the number of

codepoints.

**long-english**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/08-words/long-english` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````\b[0-9A-Za-z_]{12,}\b````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`.*`) | 839 |

We specifically write out `[0-9A-Za-z_]` instead of using `\w` because some

regex engines, such as the one found in .NET, make `\w` Unicode aware and there

doesn't appear to be any easy way of disabling it.

Also, the fact that `\b` is Unicode-aware in .NET does not seem to impact the

match counts in this benchmark.

**long-russian**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/08-words/long-russian` |

| model | [`count-spans`](MODELS.md#count-spans) |

| regex | `````\b\w{12,}\b````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`dotnet.*`) | 2747 |

| count(`icu`) | 2747 |

| count(`java.*`) | 2747 |

| count(`perl`) | 2747 |

| count(`.*`) | 5481 |

`rust/regex/lite`, `regress`, `re2` and `go/regexp` are excluded because `\w`

is not Unicode aware. `hyperscan` is exclude because it doesn't support a

Unicode aware `\b`.

For `dotnet/compiled`, since the length of matching spans is in the number of

UTF-16 code units, its expected count is smaller.

For `perl`, it has the same count as `dotnet/compiled`, but only because it

counts total encoded codepoints. Since every match span in this benchmark

seemingly corresponds to codepoints in the basic multi-lingual plane, it

follows that the number of UTF-16 code units is equivalent to the number of

codepoints.

### aws-keys

This [measures a regex][pypi-aws-secrets-regex] for [detecting AWS keys in

source code][pypi-aws-secrets-regex][aws-key-blog]. In particular, to reduce

false positives, it looks for both an access key and a secret key within a few

lines of one another.

We also measure a "quick" version of the regex that is used to find possible

candidates by searching for things that look like an AWS access key.

The measurements here demonstrate why the [pypi-aws-secrets] project splits

this task into two pieces. First it uses the "quick" version to identify

candidates, and then it uses the "full" version to lower the false positive

rate of the "quick" version. The "quick" version of the regex runs around

an order of magnitude faster than the "full" version across the board. To

understand why, let's look at the "quick" regex:

```

((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))

```

Given this regex, every match starts with one of `ASIA`, `AKIA`, `AROA` or

`AIDA`. This makes it quite amenable to prefilter optimizations where a regex

engine can look for matches of one of those 4 literals, and only then use the

regex engine to confirm whether there is a match at that position. Some regex

engines will also notice that every match starts with an `A` and use `memchr`

to look for occurrences of `A` as a fast prefilter.

We also include compilation times to give an idea of how long it takes

to compile a moderately complex regex, and how that might vary with the

compilation time of a much simpler version of the regex.

Note that in all of the measurements for this group, we search the CPython

source code (concatenated into one file). We also lossily convert it to UTF-8

so that regex engines like `regress` can participate in this benchmark. (The

CPython source code contains a very small amount of invalid UTF-8.)

[pypi-aws-secrets]: https://github.com/pypi-data/pypi-aws-secrets

[pypi-aws-secrets-regex]: https://github.com/pypi-data/pypi-aws-secrets/blob/903a7bd35bc8d9963dbbb7ca35e8ecb02e31bed4/src/scanners/mod.rs#L15-L23

[aws-key-blog]: https://tomforb.es/i-scanned-every-package-on-pypi-and-found-57-live-aws-keys/

| Engine | full | quick | compile-full | compile-quick |

| - | - | - | - | - |

| d/ldc/std-regex | 15.1 MB/s | 470.0 MB/s | - | - |

| dotnet/compiled | 817.7 MB/s | 1218.9 MB/s | 96.20us | 38.40us |

| dotnet/nobacktrack | - | 947.1 MB/s | - | 187.50us |

| go/regexp | 115.0 MB/s | 851.2 MB/s | 19.09us | 2.86us |

| hyperscan | - | 1325.7 MB/s | - | 6.74ms |

| icu | 192.9 MB/s | 327.1 MB/s | 11.37us | 2.98us |

| java/hotspot | 40.0 MB/s | 119.3 MB/s | - | - |

| javascript/v8 | 308.4 MB/s | 297.5 MB/s | - | - |

| pcre2 | 939.6 MB/s | 1394.9 MB/s | **3.63us** | **839.00ns** |

| pcre2/jit | 1195.8 MB/s | 1012.4 MB/s | 19.81us | 4.85us |

| perl | 99.4 MB/s | 135.5 MB/s | - | - |

| python/re | 102.7 MB/s | 176.6 MB/s | 168.78us | 39.34us |

| python/regex | 104.6 MB/s | 121.9 MB/s | 471.18us | 95.01us |

| re2 | 553.5 MB/s | 1006.1 MB/s | 68.88us | 8.93us |

| regress | 280.4 MB/s | 749.4 MB/s | 8.63us | 2.09us |

| rust/regex | **1839.2 MB/s** | **1688.9 MB/s** | 88.13us | 15.47us |

| rust/regex/lite | 21.8 MB/s | 34.2 MB/s | 9.47us | 1.60us |

| rust/regexold | 670.5 MB/s | 1288.3 MB/s | 61.23us | 16.62us |

Show individual benchmark parameters.

**full**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/09-aws-keys/full` |

| model | [`grep-captures`](MODELS.md#grep-captures) |

| regex | `````(('\|")((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))('\|").*?(\n^.*?){0,4}(('\|")[a-zA-Z0-9+/]{40}('\|"))+\|('\|")[a-zA-Z0-9+/]{40}('\|").*?(\n^.*?){0,3}('\|")((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))('\|"))+````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`.*`) | 0 |

**quick**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/09-aws-keys/quick` |

| model | [`grep`](MODELS.md#grep) |

| regex | `````((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`.*`) | 0 |

**compile-full**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/09-aws-keys/compile-full` |

| model | [`compile`](MODELS.md#compile) |

| regex | `````(('\|")((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))('\|").*?(\n^.*?){0,4}(('\|")[a-zA-Z0-9+/]{40}('\|"))+\|('\|")[a-zA-Z0-9+/]{40}('\|").*?(\n^.*?){0,3}('\|")((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))('\|"))+````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `"AIDAABCDEFGHIJKLMNOP""aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [.. snip ..]` |

| count(`.*`) | 1 |

**compile-quick**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/09-aws-keys/compile-quick` |

| model | [`compile`](MODELS.md#compile) |

| regex | `````((?:ASIA\|AKIA\|AROA\|AIDA)([A-Z0-7]{16}))````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `AIDAABCDEFGHIJKLMNOP` |

| count(`.*`) | 1 |

### bounded-repeat

This group of benchmarks measures how well regex engines do with bounded

repeats. Bounded repeats are sub-expressions that are permitted to match

up to some fixed number of times. For example, `a{3,5}` matches 3, 4 or 5

consecutive `a` characters. Unlike unbounded repetition operators, the regex

engine needs some way to track when the bound has reached its limit. For this

reason, many regex engines will translate `a{3,5}` to `aaaa?a?`. Given that

the bounds may be much higher than `5` and that the sub-expression may be much

more complicated than a single character, bounded repeats can quickly cause the

underlying matcher to balloon in size.

We measure three different types of bounded repeats:

* A search for a number of consecutive letters, both ASCII only and Unicode

aware.

* A search for certain types of words surrounding a `Result` type in Rust

source code.

* A search for consecutive words, all beginning with a capital letter.

We also include measurements for the compilation time of the last two.

Hyperscan does unusually well here, particularly for an automata oriented

engine. It's plausible that it has some specific optimizations in place for

bounded repeats.

`rust/regex` slows down quite a bit on the `context` regex. Namely, the

`context` regex is quite gnarly and its `(?s:.)` sub-expression coupled with

the bounded repeat causes a large portion of its transition table to get filled

out. This in turn results in more time than usual being spent actually building

the lazy DFA's transition table during a search. Typically, the lazy DFA's

transition table is built pretty quickly and then mostly reused on subsequent

searches. But in this case, the transition table exceeds the lazy DFA's cache

capacity and results in the cache getting cleared. However, the rate at which

new transitions are created is still low enough that the lazy DFA is used

instead of falling back to a slower engine.

| Engine | letters-en | letters-ru | context | capitals | compile-context | compile-capitals |

| - | - | - | - | - | - | - |

| d/ldc/std-regex | 229.1 MB/s | 70.7 MB/s | 156.7 MB/s | 271.7 MB/s | - | - |

| dotnet/compiled | 135.0 MB/s | 179.8 MB/s | 180.0 MB/s | 833.4 MB/s | 31.90us | 25.00us |

| dotnet/nobacktrack | 153.2 MB/s | 145.6 MB/s | 52.7 MB/s | 660.0 MB/s | 175.70us | 43.20us |

| go/regexp | 32.0 MB/s | 27.3 MB/s | 31.9 MB/s | 58.1 MB/s | 22.54us | 17.08us |

| hyperscan | 724.5 MB/s | 268.0 MB/s | **498.1 MB/s** | **2.7 GB/s** | 24.65ms | 651.35us |

| icu | 54.1 MB/s | 73.3 MB/s | 73.9 MB/s | 276.0 MB/s | 5.44us | 2.81us |

| java/hotspot | 83.5 MB/s | 149.3 MB/s | 73.4 MB/s | 127.1 MB/s | - | - |

| javascript/v8 | 157.3 MB/s | 60.3 MB/s | 153.7 MB/s | 742.9 MB/s | - | - |

| pcre2 | 57.8 MB/s | 421.9 KB/s | 77.4 MB/s | 566.6 MB/s | **881.00ns** | 29.18us |

| pcre2/jit | 334.8 MB/s | 288.9 MB/s | 377.6 MB/s | 1558.1 MB/s | 5.38us | 36.73us |

| perl | 69.8 MB/s | 54.0 MB/s | 90.0 MB/s | 207.9 MB/s | - | - |

| python/re | 77.3 MB/s | - | 72.6 MB/s | 57.3 MB/s | 43.16us | 26.65us |

| python/regex | 31.4 MB/s | 77.1 MB/s | 30.4 MB/s | 275.4 MB/s | 102.39us | 56.61us |

| re2 | 506.5 MB/s | 7.7 MB/s | 89.0 MB/s | 987.7 MB/s | 93.10us | 119.84us |

| regress | 167.4 MB/s | 18.3 MB/s | 169.7 MB/s | 414.7 MB/s | - | **1.22us** |

| rust/regex | **733.8 MB/s** | **648.3 MB/s** | 100.6 MB/s | 825.6 MB/s | 56.70us | 57.18us |

| rust/regex/lite | 28.1 MB/s | - | 29.0 MB/s | 56.5 MB/s | 8.58us | 13.57us |

| rust/regexold | 611.4 MB/s | 535.7 MB/s | 20.7 MB/s | 823.7 MB/s | 36.00us | 62.54us |

Show individual benchmark parameters.

**letters-en**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/letters-en` |

| model | [`count`](MODELS.md#count) |

| regex | `````[A-Za-z]{8,13}````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-sampled.txt`](benchmarks/haystacks/opensubtitles/en-sampled.txt) |

| count(`hyperscan`) | 3724 |

| count(`.*`) | 1833 |

**letters-ru**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/letters-ru` |

| model | [`count`](MODELS.md#count) |

| regex | `````\p{L}{8,13}````` |

| case-insensitive | `false` |

| unicode | `true` |

| haystack-path | [`opensubtitles/ru-sampled.txt`](benchmarks/haystacks/opensubtitles/ru-sampled.txt) |

| count(`hyperscan`) | 8570 |

| count(`.*`) | 3475 |

**context**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/context` |

| model | [`count`](MODELS.md#count) |

| regex | `````[A-Za-z]{10}\s+[\s\S]{0,100}Result[\s\S]{0,100}\s+[A-Za-z]{10}````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`rust-src-tools-3b0d4813.txt`](benchmarks/haystacks/rust-src-tools-3b0d4813.txt) |

| count(`hyperscan`) | 109 |

| count(`.*`) | 53 |

**capitals**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/capitals` |

| model | [`count`](MODELS.md#count) |

| regex | `````(?:[A-Z][a-z]+\s*){10,100}````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`rust-src-tools-3b0d4813.txt`](benchmarks/haystacks/rust-src-tools-3b0d4813.txt) |

| count(`hyperscan`) | 237 |

| count(`.*`) | 11 |

**compile-context**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/compile-context` |

| model | [`compile`](MODELS.md#compile) |

| regex | `````[A-Za-z]{10}\s+[\s\S]{0,100}Result[\s\S]{0,100}\s+[A-Za-z]{10}````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `abcdefghij blah blah blah Result blib blab klmnopqrst` |

| count(`.*`) | 1 |

`d/.*/std-regex` is excluded because it caches regex compilation.

**compile-capitals**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/10-bounded-repeat/compile-capitals` |

| model | [`compile`](MODELS.md#compile) |

| regex | `````(?:[A-Z][a-z]+\s*){10,100}````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `Crazy Janey Mission Man Wild Billy Greasy Lake Hazy Davy Kil [.. snip ..]` |

| count(`hyperscan`) | 12 |

| count(`.*`) | 1 |

`d/.*/std-regex` is excluded because it caches regex compilation.

### unstructured-to-json

These benchmarks come from a [task that converts unstructured log data to

structured JSON data][OpatrilPeter description]. It works by iterating over

every line in the log file and parsing various parts of each line into

different sections using capture groups. The regex matches every line, so any

fast logic design to reject non-matches will generally penalize regex engines

here.

The original regex looks like this:

```

(?x)

^

(?P[^\ ]+\ [^\ ]+)

[\ ](?P[DIWEF])[1234]:[\ ]

(?P

    (?:

        (?:

            \[ [^\]]*? \] | \( [^\)]*? \)

        ):[\ ]

    )*

)

(?P.*?)

[\ ]\{(?P[^\}]*)\}

$

```

(The actual regex is flattened since not all engines support verbose mode. We

also remove the names from each capture group.)

`pcre2/jit` does _really_ well here. I'm not personally familiar with how

PCRE2's JIT works, but if I had to guess, I'd say there are some clever

optimizations with respect to the `[^ ]+` (and similar) sub-expressions in this

regex.

Otherwise, the backtracking engines generally outperform the automata engines

in this benchmark. Interestingly, all of `re2`, `go/regexp` and `rust/regex`

principally use their own bounded backtracking algorithms. But it looks like

"proper" backtrackers tend to be better optimized than the ones found in RE2

and its descendants. (Bounded backtracking does have to pay for checking that

no combination of haystack position and NFA state is visited more than once,

but even removing that check does not bring, e.g., `rust/regex` up to speeds

similar to other backtrackers.)

[OpatrilPeter description]: https://github.com/rust-lang/regex/discussions/960#discussioncomment-5106322

| Engine | extract | compile |

| - | - | - |

| dotnet/compiled | 673.8 MB/s | 39.30us |

| dotnet/nobacktrack | 38.3 MB/s | 431.00us |

| go/regexp | 86.3 MB/s | 6.17us |

| icu | 99.9 MB/s | 7.95us |

| java/hotspot | 218.2 MB/s | - |

| javascript/v8 | 997.5 MB/s | - |

| pcre2 | 207.8 MB/s | **1.33us** |

| pcre2/jit | **1561.3 MB/s** | 6.96us |

| perl | 147.8 MB/s | - |

| python/re | 119.5 MB/s | 73.51us |

| python/regex | 128.2 MB/s | 194.30us |

| re2 | 113.3 MB/s | 9.08us |

| regress | 286.2 MB/s | 4.09us |

| rust/regex | 107.2 MB/s | 19.49us |

| rust/regex/lite | 25.5 MB/s | 2.59us |

| rust/regexold | 74.0 MB/s | 12.16us |

Show individual benchmark parameters.

**extract**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/11-unstructured-to-json/extract` |

| model | [`grep-captures`](MODELS.md#grep-captures) |

| regex-path | [`wild/unstructured-to-json.txt`](benchmarks/regexes/wild/unstructured-to-json.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/unstructured-to-json.log`](benchmarks/haystacks/wild/unstructured-to-json.log) |

| count(`.*`) | 600 |

`d/.*/std-regex` is exclused because it match count, `500`, differs from

everything else.

**compile**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/11-unstructured-to-json/compile` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/unstructured-to-json.txt`](benchmarks/regexes/wild/unstructured-to-json.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `2022/06/17 06:25:22 I4: [17936:140245395805952:(17998)]: (8f [.. snip ..]` |

| count(`.*`) | 1 |

`d/.*/std-regex` is exclused because it match count in the `extract` benchmark,

`500`, differs from everything else.

### dictionary

This benchmark highlights how well each regex engine does searching for a small

dictionary of words. The dictionary is made up of about 2,500 words, where

every word is at least 15 bytes in length. The number of words was chosen to

be small enough that _most_ regex engines can execute a search in reasonable

time. The bigger minimum length of each word was chosen in order to make this

a throughput benchmark. That is, there is only one match found here, so this

benchmark is measuring the raw speed with which an engine can handle a big

alternation of plain literals.

Most regex engines run quite slowly here. `perl`, `re2` and `rust/regex` lead

the pack with throughput measured in MB/s, while the rest are measured in

KB/s. One might think that this is a benchmark that would manifest as a bright

dividing line between finite automata engines and backtracking engines. Namely,

finite automata engines should handle all of the alternations in "parallel,"

where as backtrackers will essentially try to match each alternate at each

position in the haystack (owch). Indeed, this seems _mostly_ true, but `perl`

(a backtracker) does quite well while `go/regexp` (a finite automata engine)

does quite poorly. Moreover, what explains the differences between `perl`,

`re2` and `rust/regex`?

There are several knots to untangle here.

First, we'll tackle the reason why `go/regexp` has a poor showing here. The

answer lies in how the Thompson NFA construction works. A Thompson NFA can be

built in worst case linear time (in the size of the pattern), but in exchange,

it has _epsilon transitions_ in its state graph. Epsilon transitions are

transitions in a finite state machine that are followed without consuming

any input. In a case like `foo|bar|quux`, you can think of the corresponding

Thompson NFA (very loosely) as creating a single starting state with three

epsilon transitions to each of `foo`, `bar` and `quux`. In a Thompson NFA

simulation (i.e., a regex search using a Thompson NFA), all of these epsilon

transitions have to be continually followed at every position in the haystack.

With a large number of alternations, the amount of time spent shuffling through

these epsilon transitions can be quite enormous. While the search time remains

linear with respect to the size of the haystack, the "constant" factor here

(i.e., the size of the regex pattern) can become quite large. In other words,

a Thompson NFA scales poorly with respect to the size of the pattern. In this

particular case, a Thompson NFA just doesn't do any better than a backtracker.

The second knot to untangle here is why `perl` does so well despite being a

backtracker. While I'm not an expert on Perl internals, it appears to do well

here because of something called a _trie optimization_. That is, Perl's regex

engine will transform large alternations like this into an equivalent but

much more efficient structure by essentially building a trie and encoding it

into the regex itself. It turns out that `rust/regex` does the same thing,

because the exact same optimization helps a backtracker in the same way it

helps a Thompson NFA simulation. The optimization exploits the fact that the

branches in the alternation are not truly independent and actually share a lot

of overlap. Without the optimization, the branches are treated as completely

independent and one must brute force their way through each one.

So what does this trie optimization look like? Consider a regex like

`zapper|z|zap`. There is quite a bit of redundant structure. With some

care, and making sure to preserve leftmost-first match semantics, it can be

translated to the equivalent pattern `z(apper||ap)`. Notice how in the pattern

we started with, the alternation needs to be dealt with for every byte in the

haystack, because you never know which branch is going to match, if any. But in

the latter case, you now don't even need to consider the alternation until the

byte `z` matches, which is likely to be quite rare.

Indeed, the algorithm for constructing such a pattern effectively proceeds by

building a trie from the original alternation, and then converting the trie

back to whatever intermediate representation the regex engine uses.

The last knot to untangle is to explain the differences between `perl`, `re2`

and `rust/regex`. Perl still uses a backtracking strategy, but with the trie

optimization described above, it can try much fewer things for each position

in the haystack. But what's going on with `re2` and `rust/regex`? In this

case, `re2` uses the Thompson NFA simulation, but `re2` does not use the trie

optimization described above, so it gets stuck in a lot epsilon transition

shuffling. Finally, `rust/regex` does the trie optimization _and_ uses its lazy

DFA internally for this case. `re2` probably could too, but both libraries use

various heuristics for deciding which engine to use. In this case, the regex

might be too big for `re2` to use its lazy DFA.

OK, that wraps up discussion of the `single` benchmark. But what is the `multi`

benchmark? Where `single` represents combining all words in the dictionary into

a single pattern, `multi` represents a strategy where each word is treated as

its own distinct pattern. In the `single` case, Hyperscan actually rejects

the pattern for being too large, but is happy to deal with it if each word

is treated as its own pattern. The main semantic difference between these

strategies is that the `multi` approach permits not only identifying where a

match occurred, but *which* word in the dictionary matched. And this is done

without using capture groups.

Hyperscan does really well here. While its source code is difficult to

penetrate, my understanding is that Hyperscan uses its "FDR" algorithm here,

which is essentially SIMD-ified variant of multi-substring Shift-Or. This

benchmark represents Hyperscan's bread and butter: multi-pattern search.

`rust/regex` actually does _worse_ in the `multi` case versus the `single`

case. `rust/regex`'s support for multi-pattern search is still young, and in

particular, the multi-pattern case currently inhibits the trie optimization

discussed above.

Finally, we also include compile-time benchmarks for each of the above cases in

order to give an idea of how long it takes to build a regex from a dictionary

like this. I don't have much to say here other than to call out the fact

that the trie optimization does have a meaningful impact on regex compile

times in the `rust/regex` case at least.

| Engine | single | multi | compile-single | compile-multi |

| - | - | - | - | - |

| d/ldc/std-regex | 41.3 MB/s | - | - | - |

| dotnet/compiled | 1436.7 KB/s | - | 10.41ms | - |

| go/regexp | 624.5 KB/s | - | 5.66ms | - |

| hyperscan | - | **8.2 GB/s** | - | 19.99ms |

| icu | 141.6 KB/s | - | **1.56ms** | - |

| java/hotspot | 107.3 KB/s | - | - | - |

| javascript/v8 | 28.4 KB/s | - | - | - |

| perl | 133.7 MB/s | - | - | - |

| python/re | 151.6 KB/s | - | 25.51ms | - |

| python/regex | 144.9 KB/s | - | 69.17ms | - |

| re2 | 5.3 MB/s | - | 4.17ms | - |

| regress | 90.9 KB/s | - | 2.97ms | - |

| rust/regex | **712.2 MB/s** | 196.5 MB/s | 7.47ms | **13.79ms** |

| rust/regex/lite | 51.7 KB/s | - | 1.76ms | - |

| rust/regexold | 29.6 KB/s | - | 6.10ms | - |

Show individual benchmark parameters.

**single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/12-dictionary/single` |

| model | [`count`](MODELS.md#count) |

| regex-path | [`dictionary/english/length-15.txt`](benchmarks/regexes/dictionary/english/length-15.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-medium.txt`](benchmarks/haystacks/opensubtitles/en-medium.txt) |

| count(`.*`) | 1 |

`dotnet/nobacktrack` is omitted because the regex is too large.

`hyperscan` is omitted because the regex is too large.

`pcre2/*` are omitted because the regex is too large.

**multi**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/12-dictionary/multi` |

| model | [`count`](MODELS.md#count) |

| regex-path | [`dictionary/english/length-15.txt`](benchmarks/regexes/dictionary/english/length-15.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`opensubtitles/en-medium.txt`](benchmarks/haystacks/opensubtitles/en-medium.txt) |

| count(`.*`) | 1 |

Only `hyperscan` and `rust/regex` are included because they are the only regex

engines to support multi-pattern regexes. (Note that the `regex` crate API

does not support this. You need to drop down to the `meta::Regex` API in the

`regex-automata` crate.)

**compile-single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/12-dictionary/compile-single` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`dictionary/english/length-15.txt`](benchmarks/regexes/dictionary/english/length-15.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `Zubeneschamali's` |

| count(`.*`) | 1 |

`d/.*/std-regex` is excluded because it caches regex compilation.

`dotnet/nobacktrack` is omitted because the regex is too large.

`hyperscan` is omitted because the regex is too large.

`java/hotspot` is omitted because we currently don't benchmark Perl regex

compilation.

`javascript/v8` is omitted because we currently don't benchmark Perl regex

compilation.

`pcre2/*` are omitted because the regex is too large.

`perl` is omitted because we currently don't benchmark Perl regex compilation.

**compile-multi**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/12-dictionary/compile-multi` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`dictionary/english/length-15.txt`](benchmarks/regexes/dictionary/english/length-15.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `Zubeneschamali's` |

| count(`.*`) | 1 |

Only `hyperscan` and `rust/regex` are included because they are the only regex

engines to support multi-pattern regexes. (Note that the `regex` crate API

does not support this. You need to drop down to the `meta::Regex` API in the

`regex-automata` crate.)

### noseyparker

This benchmark measures how well regex engines do when asked to look for

matches for many different patterns. The patterns come from the [Nosey Parker]

project, which finds secrets and sensitive

information in textual data and source repositories. Nosey Parker operates

principally by defining a number of rules for detecting secrets (for example,

AWS API keys), and then looking for matches of those rules in various corpora.

The rules are, as you might have guessed, defined as regular expressions.

I went through each of its rules and extracted a total of 96 regular

expressions, as of [commit `be8c26e8`][be8c26e8]. These 96 regexes make up the

`single` and `multi` benchmarks below, with `single` corresponding to joining

all of patterns into one big alternation and `multi` corresponding to treating

each pattern as its own regex. In the latter case, only the `rust/regex` and

`hyperscan` engines are measured, since they are the only ones to support

multi-regex matching.

This is a particularly brutal benchmark. Most regex engines can't deal with it

at all, and will either reject it at compilation time for being too big or

simply take longer than we're willing to wait. (rebar imposes a reasonable

timeout for all benchmarks, and if the timeout is exceeded, no measurements are

collected.)

Hyperscan is in its own class here. Hyperscan was purpose built to deal with

the multi-pattern use case, and it deals with it *very* well here. The specific

patterns also put this in its wheelhouse because they all have some kind of

literal string in them. Hyperscan uses a [literal searching and finite automata

decomposition strategy][hyperpub] to quickly identify candidate matches and

avoids doing redundant work. Although how it all fits together and avoids

pitfalls such as worst case quadratic search time doesn't appear to be written

down anywhere.

`rust/regex` just barely does serviceably here. It uses its lazy DFA to handle

this regex, but with the default cache sizes, profiling suggests that it is

spending a lot of its time building the DFA. It's plausible that increasing the

cache size for such a big regex would let it execute searches faster.

`pcre2/jit` doesn't do as well here, but one might expect that because it is

a backtracking engine. With that said, no other backtracking engine could deal

with this regex at all, so `pcre2/jit` is doing quite well relative to other

backtracking engines.

Finally, we also include compile time benchmarks for each of the `single` and

`multi` cases to give a general sense of how long this monster regex takes to

build.

[Nosey Parker]: https://github.com/praetorian-inc/noseyparker

[be8c26e8]: https://github.com/praetorian-inc/noseyparker/tree/be8c26e8b2e8550f101ae62c3f374d0226808214

[hyperpub]: https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf

| Engine | single | multi | compile-single | compile-multi |

| - | - | - | - | - |

| hyperscan | **4.3 GB/s** | **4.3 GB/s** | 215.37ms | 133.45ms |

| pcre2/jit | 13.0 MB/s | - | **591.49us** | - |

| rust/regex | 122.9 MB/s | 99.7 MB/s | 2.24ms | **2.61ms** |

| rust/regexold | 9.7 MB/s | - | 3.91ms | - |

Show individual benchmark parameters.

**single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/13-noseyparker/single` |

| model | [`count`](MODELS.md#count) |

| regex-path | [`wild/noseyparker.txt`](benchmarks/regexes/wild/noseyparker.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`hyperscan`) | 241 |

| count(`.*`) | 55 |

* `d/.*/std-regex` is omitted becasue it times out.

* `dotnet/compiled` is omitted because it times out.

* `dotnet/nobacktrack` is omitted because the regex is too big.

* `go/regexp` is omitted because there are bounded repeats that exceed its

limit.

* `icu` is omitted because it times out.

* `java/hotspot` is omitted because it times out.

* `javascript/v8` is omitted because it doesn't support inline flags.

* `pcre2` is omitted because it times out.

* `perl` is omitted because it times out.

* `python/*` is omitted because it times out.

* `re2` is omitted because it seems to fail and reports a count of `0`.

* `regress` is omitted because it doesn't support inline flags.

* `rust/regex/lite` is omitted because it times out.

**multi**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/13-noseyparker/multi` |

| model | [`count`](MODELS.md#count) |

| regex-path | [`wild/noseyparker.txt`](benchmarks/regexes/wild/noseyparker.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack-path | [`wild/cpython-226484e4.py`](benchmarks/haystacks/wild/cpython-226484e4.py) |

| count(`hyperscan`) | 241 |

| count(`.*`) | 55 |

Only `hyperscan` and `rust/regex` are included because they are the only

regex engines to support multi-pattern regexes. (Note that the `regex` crate

API does not support this. You need to drop down to the `meta::Regex` API in

the `regex-automata` crate.)

**compile-single**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/13-noseyparker/compile-single` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/noseyparker.txt`](benchmarks/regexes/wild/noseyparker.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `TWITTER_API_KEY = 'UZYoBAfBzNace3mBwPOGYw'` |

| count(`.*`) | 1 |

We only include the engines that are measured in the `single` benchmark.

**compile-multi**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/13-noseyparker/compile-multi` |

| model | [`compile`](MODELS.md#compile) |

| regex-path | [`wild/noseyparker.txt`](benchmarks/regexes/wild/noseyparker.txt) |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `TWITTER_API_KEY = 'UZYoBAfBzNace3mBwPOGYw'` |

| count(`.*`) | 1 |

We only include the engines that are measured in the `multi` benchmark.

### quadratic

This set of benchmarks is meant to convince you that, even if you use a regex

engine that purports to guarantee worst case linear time searches, it is likely

possible to use it in a way that results in worst case quadratic time!

The regex we use here is `.*[^A-Z]|[A-Z]` and the haystack we search is the

letter `A` repeated `100`, `200` and `1000` times. There are two key insights

to understanding how this results in quadratic behavior:

1. It requires one to iterate over all matches in a haystack. Some regex

engines (e.g., `rust/regex` and `go/regexp`) provide first class APIs for such

an operation. They typically handle the pathological case of an empty match

for you, which would result in an infinite loop in naively written code. Some

regex engines (e.g., `pcre2` and `re2`) do not provide any APIs for iterating

over all matches. Callers have to write that code themselves. The point here

is that a regex search is executed many times for a haystack.

2. Because of how leftmost-first match semantics work, a regex engine might

scan all the way to the end of a haystack before reporting a match that starts

and ends at the *beginning* of the haystack. The reason for this is that most

regex engines will, by default, greedily consume as much as possible.

Quadratic behavior occurs by exploiting both of the insights above: by crafting

a regex and a haystack where every search scans to the end of the haystack, but

also that every search reports a match at the beginning of the search that is

exactly one character long.

Indeed, this is exactly what the regex `.*[^A-Z]|[A-Z]` does on a haystack

like `AAAAA`. Leftmost-first match semantics says that if there are multiple

matches that occur at the same position, then the match generated "first" by

the pattern should be preferred. In this case, `.*[^A-Z]` is preferred over

`[A-Z]`. But since `.*` matches as much as possible, it is not actually known

whether that branch matches until the regex engine reaches the end of the

haystack and realizes that it cannot match. At that point, the match from the

second branch, `[A-Z]` corresponding to the first `A`, is reported. Since we're

iterating over every match, the search advances to immediately after the first

`A` and repeats the same behavior: scanning all the way to the end of the

haystack, only to discover there is no match, and then reporting the second `A`

as the next match. This repeats itself, scanning the entire haystack a number

of times proportional to `n^2`, where `n` is the length of the haystack.

It is important to note that in a finite automata oriented regex engine, the

fact that `[A-Z]` matches at the beginning of the haystack is known after the

regex engine scans that part of the haystack. That is, its internal state is

aware of the fact that a match exists. It specifically continues searching

because leftmost-first semantics demand it. Once it reaches the end of the

haystack (or a point at which no other match could occur), then it stops and

returns the most recent match that it found. Unlike a backtracker, it does not

need to go back to the beginning of the haystack and start looking for a match

of the second branch.

Given the semantics of leftmost-first matching, there is no way to avoid this.

It is, unfortunately, just how the cookie crumbles.

With all of that said, `hyperscan` is the one regex engine that manages to

maintain the same throughput for each of the `1x`, `2x` and `10x` benchmarks.

That is, it does **not** exhibit worst case quadratic behavior here. It

retains its linear search time. How does it do it? The secret lay in the fact

that Hyperscan doesn't implement leftmost-first match semantics. (Indeed,

this is why some of its match counts differ throughout the benchmarks in

rebar.) Instead, Hyperscan reports a match as soon as it is seen. Once a match

is found, it doesn't continue on to try and greedily match the regex. For

example, the regex `\w+` will report 5 matches in the haystack `aaaaa`, where

as for most other regex engines, only one match will be reported. This means

`hyperscan` can zip through this benchmark in one pass of the haystack.

The `rust/regex` engine can also do this, but requires dropping down to the

`regex-automata` crate and using `Input::new(haystack).earliest(true)` when

running a search. This instructs the regex engine to report matches as they're

seen, just like Hyperscan. Indeed, if the `rust/regex` runner program uses this

approach, then its throughput remains constant for the `1x`, `2x` and `10x`

benchmarks, just like for Hyperscan.

Credit goes to [this bug filed against the `go/regexp`

engine][go-regexp-quadratic] for making me aware of this issue.

Note: We use `[A-Z]` in this example instead of `A` in an attempt to subvert

any sort of literal optimizations done by the regex engine.

[go-regexp-quadratic]: https://github.com/golang/go/issues/11181

| Engine | 1x | 2x | 10x |

| - | - | - | - |

| d/ldc/std-regex | 1192.2 KB/s | 615.7 KB/s | 124.7 KB/s |

| dotnet/compiled | 56.1 MB/s | 50.2 MB/s | 26.3 MB/s |

| dotnet/nobacktrack | 8.3 MB/s | 5.4 MB/s | 1255.2 KB/s |

| go/regexp | 1813.8 KB/s | 983.2 KB/s | 204.7 KB/s |

| hyperscan | **174.0 MB/s** | **179.9 MB/s** | **181.7 MB/s** |

| icu | 3.5 MB/s | 1934.9 KB/s | 420.9 KB/s |

| java/hotspot | 9.2 MB/s | 5.8 MB/s | 1038.0 KB/s |

| javascript/v8 | 16.4 MB/s | 10.6 MB/s | 2.9 MB/s |

| pcre2 | 2.1 MB/s | 1162.5 KB/s | 243.5 KB/s |

| pcre2/jit | 18.6 MB/s | 11.4 MB/s | 3.0 MB/s |

| perl | 2.6 MB/s | 1761.8 KB/s | 483.4 KB/s |

| python/re | 3.1 MB/s | 1943.4 KB/s | 460.6 KB/s |

| python/regex | 3.8 MB/s | 2.5 MB/s | 707.7 KB/s |

| re2 | 9.6 MB/s | 6.5 MB/s | 1836.1 KB/s |

| regress | 5.6 MB/s | 3.1 MB/s | 678.2 KB/s |

| rust/regex | 17.9 MB/s | 8.6 MB/s | 1707.2 KB/s |

| rust/regex/lite | 1058.6 KB/s | 574.3 KB/s | 119.2 KB/s |

| rust/regexold | 14.1 MB/s | 7.6 MB/s | 1665.4 KB/s |

Show individual benchmark parameters.

**1x**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/14-quadratic/1x` |

| model | [`count`](MODELS.md#count) |

| regex | `````.*[^A-Z]\|[A-Z]````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]` |

| count(`.*`) | 100 |

This is our baseline benchmark the searches a haystack with the letter `A`

repeated 100 times.

**2x**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/14-quadratic/2x` |

| model | [`count`](MODELS.md#count) |

| regex | `````.*[^A-Z]\|[A-Z]````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]` |

| count(`.*`) | 200 |

This is like `1x`, but doubles the haystack length. This

should provide a way to show the quadratic nature of this particular benchmark.

The throughputs reported *should* remain roughly the same if the time

complexity is linear, but in fact, the throughputs decrease by about a factor

of 2. That demonstrates a superlinear relationship between the inputs and the

time taken.

**10x**

| Parameter | Value |

| --------- | ----- |

| full name | `curated/14-quadratic/10x` |

| model | [`count`](MODELS.md#count) |

| regex | `````.*[^A-Z]\|[A-Z]````` |

| case-insensitive | `false` |

| unicode | `false` |

| haystack | `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [.. snip ..]` |

| count(`.*`) | 1000 |

This is like `1x`, but increases the haystack length by a factor of 10. This

should provide more evidence that the relationship is quadratic in the same

way that the `2x` benchmark does.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/burntsushi/rebar

Awesome Lists containing this project

README