https://github.com/rotemdan/regexp-composer

Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.
https://github.com/rotemdan/regexp-composer
regular-expression
Last synced: 4 months ago
JSON representation
Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.
Host: GitHub
URL: https://github.com/rotemdan/regexp-composer
Owner: rotemdan
License: mit
Created: 2024-11-27T10:01:29.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-13T08:31:30.000Z (about 1 year ago)
Last Synced: 2025-06-15T02:49:33.827Z (about 1 year ago)
Topics: regular-expression
Language: TypeScript
Homepage:
Size: 56.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

          # Regular expression composer

An easy-to-use TypeScript / JavaScript regular expression builder library designed to simplify the writing of regular expressions, in a composable, function-oriented style that's significantly more readable and less error-prone than standard regular expression syntax.

* Produces standard JavaScript regular expressions

* Supports all regular expression patterns accepted by the JavaScript engine

* Supports all JavaScript runtimes (browsers, Node.js, Deno, Bun)

* Designed as Unicode aware, from the ground up. Unicode mode enabled and required

* Patterns are created using functions and can be composed and embedded on multiple regular expressions

* Automatically escapes special characters

* Automatically wraps complex patterns with non-capturing groups (`(?:pattern)`)

* Accepts codepoints as integers, in addition to hexadecimal strings (converts as needed)

* Unifies disjunctions (like `hello|world`) and character class patterns (like `[Va-zX]`) to a single `anyOf` pattern, where they can be freely mixed

* Special tokens are expressed as safer constants like `inputStart` (`^`), `inputEnd` (`$`), `anyChar` (`*`) and `lineFeed` (`\n`)

* Ensures character and codepoint ranges are valid. Will error on `charRange('z', 'a')` or `codepointRange('a4', 'a1')`

* Fast and lightweight

* Full TypeScript type checking

* No dependencies

## Basic usage

Install package:

```sh

npm install regexp-composer

```

Build and use a simple regular expression

```ts

import { buildRegExp, possibly, inputStart } from 'regexp-composer'

// Build regExp object

const regExp = buildRegExp([inputStart, 'Hello world.', possibly(' How are you?')])

// Use it

regExp.test('Hello world.') // returns true

regExp.test('Hello world!') // returns false

regExp.test(' Hello world.') // returns false

regExp.test('Hello world. How are you?') // returns true

```

You can also encode a pattern to a RegExp source string, without compiling it to a RegExp object, using `encodePattern`:

```ts

import { encodePattern, possibly, inputStart } from 'regexp-composer'

// Build regexp

const regExpSource = encodePattern([inputStart, 'Hello world.', possibly(' How are you?')])

console.log(regExpSource) // Prints '^Hello world\.(?: How are you\?)?'

```

## Example patterns

Match the string `'Hello world.'`:

```ts

'Hello world.'

```

(note characters like `.` within strings are always taken as literals and will be automatically escaped if needed)

Encodes to:

```

Hello world\.

```

Match the string `'Hello world.'`, optionally followed by `' How are you?'`:

```ts

['Hello world.', possibly(' How are you?')]

```

Encodes to:

```

Hello world\.(?: How are you\?)?

```

(note `(?: )` is a non-capturing group inserted to wrap the optional pattern)

Match a sequence of one or more English characters or digits:

```ts

oneOrMore(anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')))

```

Encodes to:

```

[a-zA-Z0-9]+

```

Match a phone number, like `+23 (555) 432-1234`:

```ts

// The `digit` pattern is reused several times in `phoneNumberPattern`:

const digit = charRange('0', '9')

const phoneNumberPattern = [

	possibly(['+', captureAs('countryCode', repeated([1, 3], digit)), oneOrMore(' ')]),

	possibly(['(', captureAs('areaCode', repeated(3, digit)), ')', oneOrMore(' ')]),

	captureAs('localNumber', [

		repeated(3, digit),

		possibly(anyOf('-', ' ')),

		repeated(4, digit),

	])

]

```

Encodes to:

```

(?:\+(?(?:[0-9]){1,3}) +)?(?:\((?(?:[0-9]){3})\) +)?(?(?:[0-9]){3}(?:(?:[- ]))?(?:[0-9]){4})

```

# Pattern reference

## String and character literals

String and character literals are represented as simple strings, like:

```

'Hello'

'Cześć'

'こんにちは'

'X'

'嗨'

```

## Sequence of patterns

A sequence of patterns is written as an array:

```ts

[pattern1, pattern2, pattern3, ...]

```

## Optional

### `possibly(pattern)`

Accept if given pattern is matched, or skip if not.

Encodes to `pattern?` or `(?:pattern)?`.

## Choice

###  `anyOf(patterns)`

Accepts the **first pattern** that is matched in the pattern list, or fails if no pattern match.

Patterns can be both single character (like `'x'` or `charRange('a', 'z')` or multi-character, (like `oneOrMore('Hello')`).

Encodes to `(?:pattern1|pattern2|pattern3|...)`.

For efficiency, consecutive single-character patterns are grouped when encoded. For example:

```ts

anyOf('V', 'B', 'hello', oneOrMore('bye'), 'good', charRange('a', 'z'), lineFeed, 'world')

```

Encodes to:

```

(?:[VB]|hello|(?:bye)+|good|[a-z\n]|world)

```

### `notAnyOfChars(singleCharPatterns)`

Accepts any character except characters that match the given list of **single character patterns**.

Encodes to `[^singleCharPatterns]`.

For example:

```ts

notAnyOfChars('V', 'B', charRange('a', 'z'), lineFeed, codepointRange(5234, 5312), unicodeProperty('Punctuation'))

```

Encodes to `[^VBa-z\n\u{1472}-\u{14c0}\p{Punctuation}]`.

#### Negating a choice of multi-character patterns

`notAnyOfChars` only works on single character patterns. Negating a set of multi-character patterns, like `NOT('cat', 'dog', 'elephant')`, requires knowing the length, or additional criterions, for a successful positive match (otherwise, how would the RegExp engine know what to match?).

To achieve this, you can use a form of conditional matching, like `matches(pattern, { except: excludedPattern })`, described in a later section:

```ts

matches(oneOrMore(unicodeProperty('Letter')), { except: anyOf('cat', 'dog', 'elephant') })

```

This provides enough information for the RegExp engine to know which patterns to accept, and which to exclude.

## Repetition

### `zeroOrMore(pattern)`

Accepts the given pattern, repeated zero or more times.

Encodes to `pattern*` or `(?:pattern)*`.

### `zeroOrMoreNonGreedy(pattern)`

Accepts the given pattern, repeated zero or more times. Non-greedy.

Encodes to `pattern*?` or `(?:pattern)*?`.

### `oneOrMore(pattern)`

Accepts the given pattern, repeated one or more times.

Encodes to `pattern+` or `(?:pattern)+`.

### `oneOrMoreNonGreedy(pattern)`

Accepts the given pattern, repeated one or more times. Non-greedy.

Encodes to `pattern+?` or `(?:pattern)+?`.

### `repeated(count, pattern)`

Accepts the given pattern, only if repeated exactly `count` times.

Encodes to `(?:pattern){count}`.

### `repeated([min, max?], pattern)`

Accepts the given pattern, repeated between `min` and `max` times.

When `max` is not given, it default to `Infinity`.

Encodes to `(?:pattern){min,max}`, or `(?:pattern){min,}` when `max` is not given or set to `Infinity`.

### `repeatedNonGreedy([min, max?], pattern)`

Accepts the given pattern, repeated between `min` and `max` times. Non-greedy.

When `max` is not given, it default to `Infinity`.

Encodes to `(?:pattern){min,max}?`, or `(?:pattern){min,}?` when `max` is not given or set to `Infinity`.

## Single character patterns

### `codepoint(hexCode)`

Accepts a single character with the given Unicode codepoint, provided as a hexadecimal string.

Encodes to `\u{hexCode}`.

### `codepoint(integerCode)`

Accepts a single character with the given Unicode codepoint, provided as an integer.

`integerCode` is converted to a Hex-valued string when encoded.

Encodes to `\u{hexCode}`.

### `charRange(startChar, endChar)`

Accepts a single character within the given character range.

Encodes to `[startChar-endChar]`.

### `codepointRange(startHexCode, endHexCode)`

Accepts a single character within the given Unicode codepoint range.

`startHexCode` and `endHexCode` should be provided as hexadecimal strings.

Encodes to `[\u{startHexCode}-\u{endHexCode}]`.

### `codepointRange(startIntegerCode, endIntegerCode)`

Accepts any character within given Unicode codepoint range.

`startIntegerCode` and `endIntegerCode` are converted to a hexadecimal valued strings when encoded.

Encodes to `[\u{startHexCode}-\u{endHexCode}]`.

### `unicodeProperty(propertyName)`

Accepts a character matching the given Unicode property name.

Encodes to `\p{propertyName}`.

### `unicodeProperty(propertyName, value)`

Accepts a character matching the given Unicode property name and value.

Encodes to `\p{propertyName=value}`.

### `notUnicodeProperty(property)`

Accepts any character that doesn't match the given Unicode property.

Encodes to `\P{property}]`.

### `notUnicodeProperty(property, value)`

Accepts any character that doesn't match the given Unicode property and value.

Encodes to `\P{property=value}`.

## Grouping

### `capture(pattern)`

Captures an unnamed group.

Encodes to `(pattern)`

### `captureAs(name, pattern)`

Captures a named group.

Encodes to `(?pattern)`.

## Backreferences

### `sameAs(groupIndex)`

Matches a pattern to a previous unnamed capturing group.

`groupIndex` is the index of a preceding group. It must be an integer between `1` and `9`.

Encodes to `(?:\groupIndex)`.

### `sameAs(groupName)`

Matches a pattern to a previous named capturing group.

`groupName` is the name of a preceding named group.

Encodes to `\k`

### Potential issues with backreference indexes greater than 9

`groupIndex` has been limited to the range of `1..9`, because otherwise, in the case there are more than 9 groups that precede the backreference, the encoded RegExp would produce an ambiguity with a backreference followed by one or more digit literals. For example `\10` can either be interpreted as either a backreference to the 10th group, or as a backreference to the 1st group, followed by the literal character `0`.

In the official specification, this ambiguity is resolved by greedily interpreting the sequence `\10` as a backreference if there are 10 or more preceding groups. However, this context-sensitive logic breaks the ability to efficiently parse the regular expression using a context-free grammar! For that reason I've decided to disallow those cases. For backreference indexes greater than 9, you can use named backreferences instead.

## Conditional matching

These patterns provide a simplified approach to express various lookahead and lookbehind patterns.

### `matches(pattern, { ifFollowedBy: followingPattern })`

Matches a pattern, with the condition that it is followed by a second pattern.

Encodes to `pattern(?=followingPattern)`.

(positive lookahead positioned after the pattern)

### `matches(pattern, { ifNotFollowedBy: followingPattern })`

Matches a pattern, with the condition that it is not followed by a second pattern.

Encodes to `pattern(?!followingPattern)`.

(negative lookahead positioned after the pattern).

### `matches(pattern, { ifPrecededBy: precedingPattern })`

Matches a pattern, with the condition that it is preceded by a second pattern.

Encodes to `(?<=precedingPattern)pattern`.

(positive lookbehind positioned before the pattern).

### `matches(pattern, { ifNotPrecededBy: precedingPattern })`

Matches a pattern, with the condition that it is not preceded by a second pattern.

Encodes to `(?
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rotemdan/regexp-composer

Awesome Lists containing this project

README