Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/tc39/proposal-regexp-buffer-boundaries

Regular Expression Buffer Boundaries for ECMAScript
https://github.com/tc39/proposal-regexp-buffer-boundaries

Last synced: 26 days ago
JSON representation

Regular Expression Buffer Boundaries for ECMAScript

Host: GitHub
URL: https://github.com/tc39/proposal-regexp-buffer-boundaries
Owner: tc39
License: bsd-3-clause
Created: 2021-10-15T20:58:47.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-05-28T00:58:15.000Z (about 2 years ago)
Last Synced: 2024-05-02T06:14:20.971Z (about 2 months ago)
Language: HTML
Homepage: https://tc39.es/proposal-regexp-buffer-boundaries
Size: 210 KB
Stars: 7
Watchers: 5
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-regex - Buffer boundaries - Anchors `\A` and `\z`, not affected by flag `m`. (JavaScript regex evolution / Regex processors)

README

        
# Regular Expression Buffer Boundaries for ECMAScript

This proposal seeks to introduce `\A` and `\z` character escapes to Unicode-mode regular expressions as synonyms for `^` and `$` that are not affected by the `m` (multiline) flag.

## Status

**Stage:** 2  

**Champion:** Ron Buckton ([@rbuckton](https://github.com/rbuckton))  

_For detailed status of this proposal see [TODO](#todo), below._  

## Authors

* Ron Buckton ([@rbuckton](https://github.com/rbuckton))  

# Motivations

> NOTE: See https://github.com/rbuckton/proposal-regexp-features for an overview of

> how this proposal fits into other possible future features for Regular Expressions.

Buffer Boundaries are a common feature across a wide array of regular expression engines that 

allow you to match the start or end of the entire input regardless of whether the `m` (multiline) flag

has been set. Buffer Boundaries also allow you to match the start/end of a line *and* the start/end of 

the input in a single RegExp using the `m` flag.

While its possible to emulate `\A` and `\z` using existing patterns, the alternatives are harder to

far read, and require a more comprehensive working understanding of regular experssions to interpret.

For example, compare the following approaches:

```js

// emulate `m`-mode `^` outside of `m`-mode:

const a = /^foo|(?<=^|[\u000A\u000D\u2028\u2029])bar/u;

// emulate non-`m`-mode `^` inside of `m`-mode using modifiers (proposed):

const b = /(?-m:^)foo|^bar/mu;

// using `\A`:

const c = /\Afoo|^bar/mu;

```

In the example above, it is far less likely that a reader will readily understand the expression in

example (a). Not only is the content of the regular expression much harder to read, but understanding

its purpose requires interpreting how **six** different features of regular expressions interact: 

grouping, positive lookbehind, the `^` metacharacer, disjunctions, character classes, and unicode escapes.

Example (b) is a an improvement, but still requires the reader to visually balance the parentheses as

well as to interpret how **four** different regular expression features interact: grouping, modifiers

(proposed), the `m` flag, and the `^` metacharacter.

In comparison, example (c) is far easier to read. It consists of a terse escape sequence consisting

of only two characters (`\A`), which makes it far easier to distinguish between special pattern syntax

and plain text segments like `foo` and `bar`. 

The `\A` and `\z` escapes have broad support across multiple other languages and regular expression 

engines. As a result it has the benefit of extensive existing documentation online, including 

[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression#Examples), numerous tutorial websites, as

well as the documentation from other languages. This significantly lessens the learning curve for `\A`

over its alternatives.

# Prior Art 

* [Perl](https://rbuckton.github.io/regexp-features/engines/perl.html#feature-buffer-boundaries)  

* [PCRE](https://rbuckton.github.io/regexp-features/engines/pcre.html#feature-buffer-boundaries)  

* [Boost.Regex](https://rbuckton.github.io/regexp-features/engines/boost.regex.html#feature-buffer-boundaries)  

* [.NET](https://rbuckton.github.io/regexp-features/engines/dotnet.html#feature-buffer-boundaries)  

* [Oniguruma](https://rbuckton.github.io/regexp-features/engines/oniguruma.html#feature-buffer-boundaries)  

* [Hyperscan](https://rbuckton.github.io/regexp-features/engines/hyperscan.html#feature-buffer-boundaries)  

* [ICU](https://rbuckton.github.io/regexp-features/engines/icu.html#feature-buffer-boundaries)  

* [Glib/GRegex](https://rbuckton.github.io/regexp-features/engines/glib-gregex.html#feature-buffer-boundaries)  

See https://rbuckton.github.io/regexp-features/features/buffer-boundaries.html for additional information.

# Syntax

Buffer boundaries are similar to the `^` and `$` anchors, except that they are not affected by the `m` (multiline) flag:

- `\A` — Matches the start of the input.

- `\z` — Matches the end of the input.

- ~~`\Z` — A zero-width assertion consisting of an optional newline at the end of the buffer. Equivalent to `(?=\R?\z)`.~~

> NOTE: Requires the `u` or `v` flag, as `\A`, `\z`, and `\Z` are currently just escapes for `A`, `z` and `Z` without the `u` or `v` flag. 

> NOTE: Not supported inside of a character class.

> NOTE: The `\Z` assertion is no longer being considered as part of this proposal as of December 15th, 2021, but has

> been reserved for possible future use.

For more information about the `v` flag, see https://github.com/tc39/proposal-regexp-set-notation.

~~For more information about the `\R` escape sequence, see https://github.com/tc39/proposal-regexp-r-escape.~~

# Examples

```js

// without buffer boundaries

const pattern = String.raw`^foo$`;

const re1 = new RegExp(pattern, "u");

re1.test("foo"); // true

re1.test("foo\nbar"); // false

const re2 = new RegExp(pattern, "um");

re1.test("foo"); // true

re1.test("foo\nbar"); // true

// with buffer boundaries

const pattern = String.raw`\Afoo\z`;

const re1 = new RegExp(pattern, "u");

re1.test("foo"); // true

re1.test("foo\nbar"); // false

const re2 = new RegExp(pattern, "um");

re1.test("foo"); // true

re1.test("foo\nbar"); // false

// mixing buffer boundaries and anchors

const re = /\Afoo|^bar$|baz\z/um;

re.test("foo");         // true

re.test("foo\n");       // true

re.test("\nfoo");       // false

re.test("bar");         // true

re.test("bar\n");       // true

re.test("\nbar");       // true

re.test("baz");         // true

re.test("baz\n");       // false

re.test("\nbaz");       // true

```

# History

- October 28, 2021 — Proposed for Stage 1 ([slides](https://1drv.ms/p/s!AjgWTO11Fk-TkfoSnHYFCoo4mYndTA?e=r0YIxu))

  - Outcome: Advanced to Stage 1

- December 15, 2021 — Proposed for Stage 2 ([slides](https://1drv.ms/p/s!AjgWTO11Fk-Tkfs-sKyEtV6B_S-poQ?e=U7ToKV))

  - Outcome: `\A` and `\z` advanced to Stage 2 (`\Z` did not advance, but will be reserved)

  - Stage 2 Reviewers: Richard Gibson, Waldemar Horwat

# TODO

The following is a high-level list of tasks to progress through each stage of the [TC39 proposal process](https://tc39.github.io/process-document/):

### Stage 1 Entrance Criteria

* [x] Identified a "[champion][Champion]" who will advance the addition.  

* [x] [Prose][Prose] outlining the problem or need and the general shape of a solution.  

* [x] Illustrative [examples][Examples] of usage.  

* [ ] ~~High-level [API][API].~~  

### Stage 2 Entrance Criteria

* [x] [Initial specification text][Specification].  

* [ ] ~~[Transpiler support][Transpiler] (_Optional_).~~  

### Stage 3 Entrance Criteria

* [x] [Complete specification text][Specification].  

* [ ] Designated reviewers have [signed off][Stage3ReviewerSignOff] on the current spec text.  

* [ ] The ECMAScript editor has [signed off][Stage3EditorSignOff] on the current spec text.  

### Stage 4 Entrance Criteria

* [ ] [Test262](https://github.com/tc39/test262) acceptance tests have been written for mainline usage scenarios and [merged][Test262PullRequest].  

* [ ] Two compatible implementations which pass the acceptance tests: [\[1\]][Implementation1], [\[2\]][Implementation2].  

* [ ] A [pull request][Ecma262PullRequest] has been sent to tc39/ecma262 with the integrated spec text.  

* [ ] The ECMAScript editor has signed off on the [pull request][Ecma262PullRequest].  

[Process]: https://tc39.es/process-document/

[Proposals]: https://github.com/tc39/proposals/

[Grammarkdown]: http://github.com/rbuckton/grammarkdown#readme

[Champion]: #status

[Prose]: #motivations

[Examples]: #examples

[API]: #api

[Specification]: https://tc39.es/proposal-regexp-buffer-boundaries

[Transpiler]: #todo

[Stage3ReviewerSignOff]: #todo

[Stage3EditorSignOff]: #todo

[Test262PullRequest]: #todo

[Implementation1]: #todo

[Implementation2]: #todo

[Ecma262PullRequest]: #todo