https://github.com/rotemdan/regexp-composer
Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.
https://github.com/rotemdan/regexp-composer
regular-expression
Last synced: 3 months ago
JSON representation
Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.
- Host: GitHub
- URL: https://github.com/rotemdan/regexp-composer
- Owner: rotemdan
- License: mit
- Created: 2024-11-27T10:01:29.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-13T08:31:30.000Z (about 1 year ago)
- Last Synced: 2025-06-15T02:49:33.827Z (about 1 year ago)
- Topics: regular-expression
- Language: TypeScript
- Homepage:
- Size: 56.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Regular expression composer
An easy-to-use TypeScript / JavaScript regular expression builder library designed to simplify the writing of regular expressions, in a composable, function-oriented style that's significantly more readable and less error-prone than standard regular expression syntax.
* Produces standard JavaScript regular expressions
* Supports all regular expression patterns accepted by the JavaScript engine
* Supports all JavaScript runtimes (browsers, Node.js, Deno, Bun)
* Designed as Unicode aware, from the ground up. Unicode mode enabled and required
* Patterns are created using functions and can be composed and embedded on multiple regular expressions
* Automatically escapes special characters
* Automatically wraps complex patterns with non-capturing groups (`(?:pattern)`)
* Accepts codepoints as integers, in addition to hexadecimal strings (converts as needed)
* Unifies disjunctions (like `hello|world`) and character class patterns (like `[Va-zX]`) to a single `anyOf` pattern, where they can be freely mixed
* Special tokens are expressed as safer constants like `inputStart` (`^`), `inputEnd` (`$`), `anyChar` (`*`) and `lineFeed` (`\n`)
* Ensures character and codepoint ranges are valid. Will error on `charRange('z', 'a')` or `codepointRange('a4', 'a1')`
* Fast and lightweight
* Full TypeScript type checking
* No dependencies
## Basic usage
Install package:
```sh
npm install regexp-composer
```
Build and use a simple regular expression
```ts
import { buildRegExp, possibly, inputStart } from 'regexp-composer'
// Build regExp object
const regExp = buildRegExp([inputStart, 'Hello world.', possibly(' How are you?')])
// Use it
regExp.test('Hello world.') // returns true
regExp.test('Hello world!') // returns false
regExp.test(' Hello world.') // returns false
regExp.test('Hello world. How are you?') // returns true
```
You can also encode a pattern to a RegExp source string, without compiling it to a RegExp object, using `encodePattern`:
```ts
import { encodePattern, possibly, inputStart } from 'regexp-composer'
// Build regexp
const regExpSource = encodePattern([inputStart, 'Hello world.', possibly(' How are you?')])
console.log(regExpSource) // Prints '^Hello world\.(?: How are you\?)?'
```
## Example patterns
Match the string `'Hello world.'`:
```ts
'Hello world.'
```
(note characters like `.` within strings are always taken as literals and will be automatically escaped if needed)
Encodes to:
```
Hello world\.
```
Match the string `'Hello world.'`, optionally followed by `' How are you?'`:
```ts
['Hello world.', possibly(' How are you?')]
```
Encodes to:
```
Hello world\.(?: How are you\?)?
```
(note `(?: )` is a non-capturing group inserted to wrap the optional pattern)
Match a sequence of one or more English characters or digits:
```ts
oneOrMore(anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')))
```
Encodes to:
```
[a-zA-Z0-9]+
```
Match a phone number, like `+23 (555) 432-1234`:
```ts
// The `digit` pattern is reused several times in `phoneNumberPattern`:
const digit = charRange('0', '9')
const phoneNumberPattern = [
possibly(['+', captureAs('countryCode', repeated([1, 3], digit)), oneOrMore(' ')]),
possibly(['(', captureAs('areaCode', repeated(3, digit)), ')', oneOrMore(' ')]),
captureAs('localNumber', [
repeated(3, digit),
possibly(anyOf('-', ' ')),
repeated(4, digit),
])
]
```
Encodes to:
```
(?:\+(?(?:[0-9]){1,3}) +)?(?:\((?(?:[0-9]){3})\) +)?(?(?:[0-9]){3}(?:(?:[- ]))?(?:[0-9]){4})
```
# Pattern reference
## String and character literals
String and character literals are represented as simple strings, like:
```
'Hello'
'Cześć'
'こんにちは'
'X'
'嗨'
```
## Sequence of patterns
A sequence of patterns is written as an array:
```ts
[pattern1, pattern2, pattern3, ...]
```
## Optional
### `possibly(pattern)`
Accept if given pattern is matched, or skip if not.
Encodes to `pattern?` or `(?:pattern)?`.
## Choice
### `anyOf(patterns)`
Accepts the **first pattern** that is matched in the pattern list, or fails if no pattern match.
Patterns can be both single character (like `'x'` or `charRange('a', 'z')` or multi-character, (like `oneOrMore('Hello')`).
Encodes to `(?:pattern1|pattern2|pattern3|...)`.
For efficiency, consecutive single-character patterns are grouped when encoded. For example:
```ts
anyOf('V', 'B', 'hello', oneOrMore('bye'), 'good', charRange('a', 'z'), lineFeed, 'world')
```
Encodes to:
```
(?:[VB]|hello|(?:bye)+|good|[a-z\n]|world)
```
### `notAnyOfChars(singleCharPatterns)`
Accepts any character except characters that match the given list of **single character patterns**.
Encodes to `[^singleCharPatterns]`.
For example:
```ts
notAnyOfChars('V', 'B', charRange('a', 'z'), lineFeed, codepointRange(5234, 5312), unicodeProperty('Punctuation'))
```
Encodes to `[^VBa-z\n\u{1472}-\u{14c0}\p{Punctuation}]`.
#### Negating a choice of multi-character patterns
`notAnyOfChars` only works on single character patterns. Negating a set of multi-character patterns, like `NOT('cat', 'dog', 'elephant')`, requires knowing the length, or additional criterions, for a successful positive match (otherwise, how would the RegExp engine know what to match?).
To achieve this, you can use a form of conditional matching, like `matches(pattern, { except: excludedPattern })`, described in a later section:
```ts
matches(oneOrMore(unicodeProperty('Letter')), { except: anyOf('cat', 'dog', 'elephant') })
```
This provides enough information for the RegExp engine to know which patterns to accept, and which to exclude.
## Repetition
### `zeroOrMore(pattern)`
Accepts the given pattern, repeated zero or more times.
Encodes to `pattern*` or `(?:pattern)*`.
### `zeroOrMoreNonGreedy(pattern)`
Accepts the given pattern, repeated zero or more times. Non-greedy.
Encodes to `pattern*?` or `(?:pattern)*?`.
### `oneOrMore(pattern)`
Accepts the given pattern, repeated one or more times.
Encodes to `pattern+` or `(?:pattern)+`.
### `oneOrMoreNonGreedy(pattern)`
Accepts the given pattern, repeated one or more times. Non-greedy.
Encodes to `pattern+?` or `(?:pattern)+?`.
### `repeated(count, pattern)`
Accepts the given pattern, only if repeated exactly `count` times.
Encodes to `(?:pattern){count}`.
### `repeated([min, max?], pattern)`
Accepts the given pattern, repeated between `min` and `max` times.
When `max` is not given, it default to `Infinity`.
Encodes to `(?:pattern){min,max}`, or `(?:pattern){min,}` when `max` is not given or set to `Infinity`.
### `repeatedNonGreedy([min, max?], pattern)`
Accepts the given pattern, repeated between `min` and `max` times. Non-greedy.
When `max` is not given, it default to `Infinity`.
Encodes to `(?:pattern){min,max}?`, or `(?:pattern){min,}?` when `max` is not given or set to `Infinity`.
## Single character patterns
### `codepoint(hexCode)`
Accepts a single character with the given Unicode codepoint, provided as a hexadecimal string.
Encodes to `\u{hexCode}`.
### `codepoint(integerCode)`
Accepts a single character with the given Unicode codepoint, provided as an integer.
`integerCode` is converted to a Hex-valued string when encoded.
Encodes to `\u{hexCode}`.
### `charRange(startChar, endChar)`
Accepts a single character within the given character range.
Encodes to `[startChar-endChar]`.
### `codepointRange(startHexCode, endHexCode)`
Accepts a single character within the given Unicode codepoint range.
`startHexCode` and `endHexCode` should be provided as hexadecimal strings.
Encodes to `[\u{startHexCode}-\u{endHexCode}]`.
### `codepointRange(startIntegerCode, endIntegerCode)`
Accepts any character within given Unicode codepoint range.
`startIntegerCode` and `endIntegerCode` are converted to a hexadecimal valued strings when encoded.
Encodes to `[\u{startHexCode}-\u{endHexCode}]`.
### `unicodeProperty(propertyName)`
Accepts a character matching the given Unicode property name.
Encodes to `\p{propertyName}`.
### `unicodeProperty(propertyName, value)`
Accepts a character matching the given Unicode property name and value.
Encodes to `\p{propertyName=value}`.
### `notUnicodeProperty(property)`
Accepts any character that doesn't match the given Unicode property.
Encodes to `\P{property}]`.
### `notUnicodeProperty(property, value)`
Accepts any character that doesn't match the given Unicode property and value.
Encodes to `\P{property=value}`.
## Grouping
### `capture(pattern)`
Captures an unnamed group.
Encodes to `(pattern)`
### `captureAs(name, pattern)`
Captures a named group.
Encodes to `(?pattern)`.
## Backreferences
### `sameAs(groupIndex)`
Matches a pattern to a previous unnamed capturing group.
`groupIndex` is the index of a preceding group. It must be an integer between `1` and `9`.
Encodes to `(?:\groupIndex)`.
### `sameAs(groupName)`
Matches a pattern to a previous named capturing group.
`groupName` is the name of a preceding named group.
Encodes to `\k`
### Potential issues with backreference indexes greater than 9
`groupIndex` has been limited to the range of `1..9`, because otherwise, in the case there are more than 9 groups that precede the backreference, the encoded RegExp would produce an ambiguity with a backreference followed by one or more digit literals. For example `\10` can either be interpreted as either a backreference to the 10th group, or as a backreference to the 1st group, followed by the literal character `0`.
In the official specification, this ambiguity is resolved by greedily interpreting the sequence `\10` as a backreference if there are 10 or more preceding groups. However, this context-sensitive logic breaks the ability to efficiently parse the regular expression using a context-free grammar! For that reason I've decided to disallow those cases. For backreference indexes greater than 9, you can use named backreferences instead.
## Conditional matching
These patterns provide a simplified approach to express various lookahead and lookbehind patterns.
### `matches(pattern, { ifFollowedBy: followingPattern })`
Matches a pattern, with the condition that it is followed by a second pattern.
Encodes to `pattern(?=followingPattern)`.
(positive lookahead positioned after the pattern)
### `matches(pattern, { ifNotFollowedBy: followingPattern })`
Matches a pattern, with the condition that it is not followed by a second pattern.
Encodes to `pattern(?!followingPattern)`.
(negative lookahead positioned after the pattern).
### `matches(pattern, { ifPrecededBy: precedingPattern })`
Matches a pattern, with the condition that it is preceded by a second pattern.
Encodes to `(?<=precedingPattern)pattern`.
(positive lookbehind positioned before the pattern).
### `matches(pattern, { ifNotPrecededBy: precedingPattern })`
Matches a pattern, with the condition that it is not preceded by a second pattern.
Encodes to `(?