Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/susji/mre

Toy regular expression library
https://github.com/susji/mre

regex regular-expression

Last synced: about 2 months ago
JSON representation

Toy regular expression library

Host: GitHub
URL: https://github.com/susji/mre
Owner: susji
License: gpl-3.0
Created: 2021-11-20T11:49:16.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2021-11-22T17:31:14.000Z (about 3 years ago)
Last Synced: 2024-06-21T00:05:19.856Z (8 months ago)
Topics: regex, regular-expression
Language: Go
Homepage:
Size: 31.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # MRE

`MRE` is a simple regular expression library. It provides basic capabilities

for matching and extracting string contents. We assume that our input is

UTF-8-encoded.

## Technical details

Our regular expressions support only regular languages, that is, we do not

support backreferences. All matchers are greedy, that is, there is no `?`

suffix.

Subexpressions (`(..)`) imply capturing.

Ranges in set expressions are treated directly with their `uint32` codepoint

values.

If a `regexp` does not begin with `^`, it will be evaluated as containing an

implicit `.*?` in the very beginning. Similarly, if `regexp` does not end with

`$`, it will understood as implicit `.*?` in the very end.

By default, if any of the special characters are to be used for matching

literal runes outside bracketed expressions (sets, they must be escaped with

`\`. Runes within set expressions (`[..]`) are treated literally with the

exception of rune ranges (`-`) and negations (`^`) -- to match them literally,

place them accordingly in bracketed expressions. Otherwise set runes are

matched literally.

XXX Add `]` like POSIX ERE to set matching, ie. for it to be matched as a rune,

it needs to be placed right after `[` or `[^`.

We want alternation (`|`) to bind very loosely and thus we use the traditional

precedence-via-nonterminal-levels approach. The grammar below only deals with

parsing and for this reason escapes are not included as they are treated in the

lexing phase.

Our grammar is roughly the following:

```ebnf

regexp 	= [ "^" ], { or-expr, } [ "$" ]

or-expr = atoms, { "|", atoms }

atoms   = { atom, [ times ] }

atom    = subexpr

        | set

        | "."

        | rune

subexpr = "(", expr, ")"

set     = "[", { "^" }, { rune, [ "-", rune ] }, "]"

times   = "+"

        | "*"

        | "?"

        | "{", posnum, "}"

        | "{", posnum, ",", [ posnum ], "}"

posnum  = "0" | digit, { digit }

digit 	= "0" | ... | "9"

rune 	= any-unicode-codepoint

```