https://github.com/rotemdan/grammar-composer

Defines and generates parsers from composable grammar definitions. Includes advanced features like lexer-free parsing, selective packrat memoization and static analysis.
https://github.com/rotemdan/grammar-composer

context-free-grammar grammar lexer-free-parsing parser-generator parsing-expression-grammar peg

Last synced: 6 months ago
JSON representation

Defines and generates parsers from composable grammar definitions. Includes advanced features like lexer-free parsing, selective packrat memoization and static analysis.

Host: GitHub
URL: https://github.com/rotemdan/grammar-composer
Owner: rotemdan
License: mit
Created: 2024-12-03T10:27:15.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-11T03:57:38.000Z (7 months ago)
Last Synced: 2024-12-11T04:28:57.042Z (7 months ago)
Topics: context-free-grammar, grammar, lexer-free-parsing, parser-generator, parsing-expression-grammar, peg
Language: TypeScript
Homepage:
Size: 34.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # Grammar composer

A library to define, build and efficiently parse context-free grammars.

* Grammars are defined using TypeScript class declarations

* No need for separate tokenization step. Tokenization is defined as part of the grammar via embedded `Pattern` objects that are internally processed through the [`regexp-composer`](https://github.com/rotemdan/regexp-composer) regular expression library

* The generated parser accepts raw characters as input, meaning it's a form of lexer-free, or hybrid parser, supporting contextual tokenization - that is, low-level character patterns can be specialized to different high-level parser contexts, and sub-patterns captured in the low-level regular expressions are directly embedded as part of the resulting parse tree

* Top-down parsing (roughly equivalent to PEG parsing), with optional "packrat" caching that can be enabled or disabled for individual productions

* Supports right-recursion, but will currently error when left-recursion is detected

* Uses sophisticated static analysis to automatically identify and annotate optional productions

* Provides useful parse-time error reporting, identifying the exact production involved and most likely alternatives at the failed position

## Installation

```

npm install grammar-composer

```

And also the related regular expression builder package:

```

npm install regexp-composer

```

## Example: XML grammar

The grammar is defined within a container class `XmlGrammar`. It contains a mixture of higher-level, context-free productions and lower-level, regular expression productions.

* Context-free grammar productions are defined by anonymous functions `() => ...`

* Regular expression productions are defined by `pattern(...)`

In this example, context-free operators are prefixed with `G`, and regular expression operators are prefixed with `R`, to avoid confusion between similarly named operators:

```ts

import * as G from 'grammar-composer'

import * as R from 'regexp-composer'

export class XmlGrammar {

	document = () => [

		G.zeroOrMore(

			G.anyOf(

				this.textFragment,

				this.openingTag,

				this.closingTag,

				this.comment,

				this.declarationTag,

			)

		)

	]

	textFragment = G.pattern([

		R.oneOrMore(R.notAnyOfChars('<'))

	])

	openingTag = () => [

		this.openingTagStart,

		G.zeroOrMore(this.attribute),

		this.tagEnd

	]

	openingTagStart = G.pattern([

		'<',

		R.possibly('?'),

		R.captureAs('tagName',

			R.oneOrMore(R.notAnyOfChars(R.whitespace, '"', "'", '?', '!', '/', '>'))

		),

		R.zeroOrMore(R.whitespace),

	])

	tagEnd = G.pattern([

		R.zeroOrMore(R.whitespace),

		R.possibly(R.anyOf('/', '?')),

		'>'

	])

	attribute = G.pattern([

		R.zeroOrMore(R.whitespace),

		R.captureAs('attributeName',

			R.oneOrMore(R.notAnyOfChars(R.whitespace, '=', '"', "'", '?', '/', '>'))

		),

		R.zeroOrMore(R.whitespace),

		R.possibly([

			'=',

			R.zeroOrMore(R.whitespace),

			quotedString,

			R.zeroOrMore(R.whitespace),

		])

	])

	closingTag = G.pattern([

		'',

		R.zeroOrMore(R.whitespace),

		R.captureAs('tagName',

			R.oneOrMore(R.notAnyOfChars(R.whitespace, '/', '>'))

		),

		R.zeroOrMore(R.whitespace),

		'>'

	])

	declarationTag = () => [

		this.declarationTagOpening,

		G.zeroOrMore(this.declarationTagAttribute),

		this.tagEnd

	]

	declarationTagOpening = G.pattern([

		''))

		),

		R.zeroOrMore(R.whitespace)

	])

	declarationTagAttribute = G.pattern([

		R.zeroOrMore(R.whitespace),

		R.anyOf(

			R.captureAs('attributeName',

				R.oneOrMore(R.notAnyOfChars(R.whitespace, '"', "'", '/', '!', '?', '>'))

			),

			quotedString,

		),

		R.zeroOrMore(R.whitespace),

	])

	comment = G.pattern([

		''

	])

}

const quotedString = R.anyOf(

	[

		'"',

		R.captureAs('doubleQuotedStringContent',

			R.zeroOrMore(R.notAnyOfChars('"'))

		),

		'"'

	],

	[

		"'",

		R.captureAs('singleQuotedStringContent',

			R.zeroOrMore(R.notAnyOfChars("'"))

		),

		"'"

	],

)

```

Building and parsing using the XML grammar:

```ts

import { buildGrammar } from 'grammar-composer'

	const xmlString = `

    Adobe SVG Viewer

    Open

    Open New

    

    Zoom In

    Zoom Out

    

    Quality

    Pause

    Mute

    

    Find...

    Find Again

    Copy

`

// Build the grammar. 'document' is the starting production.

//

// Although `XmlGrammar` is defined as a class, there's no need to instantiate it,

// just pass it as it is.

const grammar = buildGrammar(XmlGrammar, 'document')

// Parse the XML string with the built grammar

const parseTree = grammar.parse(xmlString)

```

The resulting parse tree looks like:

```ts

[

    {

        "name": "document",

        "startOffset": 0,

        "endOffset": 644,

        "sourceText": "\n\n\n\n    Adobe SVG Viewer\n    Open\n    Open New\n    \n    Zoom In\n    Zoom Out\n    \n    Quality\n    Pause\n    Mute\n    \n    Find...\n    Find Again\n    Copy

\n\n\n",

        "children": [

            {

                "name": "textFragment",

                "startOffset": 0,

                "endOffset": 1,

                "sourceText": "\n",

                "children": []

            },

            {

                "name": "declarationTag",

                "startOffset": 1,

                "endOffset": 19,

                "sourceText": "",

                "children": [

                    {

                        "name": "declarationTagOpening",

                        "startOffset": 1,

                        "endOffset": 11,

                        "sourceText": "",

                        "children": []

                    }

                ]

            },

            {

                "name": "textFragment",

                "startOffset": 19,

                "endOffset": 21,

                "sourceText": "\n\n",

                "children": []

            },

            {

                "name": "openingTag",

                "startOffset": 21,

                "endOffset": 27,

                "sourceText": "",

                "children": [

                    {

                        "name": "openingTagStart",

                        "startOffset": 21,

                        "endOffset": 26,

                        "sourceText": "",

                        "children": []

                    }

                ]

            },

            {

                "name": "textFragment",

                "startOffset": 27,

                "endOffset": 32,

                "sourceText": "\n    ",

                "children": []

            },

            {

                "name": "openingTag",

                "startOffset": 32,

                "endOffset": 40,

                "sourceText": "",

                "children": [

                    {

                        "name": "openingTagStart",

                        "startOffset": 32,

                        "endOffset": 39,

                        "sourceText": "",

                        "children": []

                    }

                ]

            },

...

```

## Operators

Context-free operators are mostly named similarly to the ones in [`regexp-composer`](https://github.com/rotemdan/regexp-composer).

### `zeroOrMore(grammarElement)`

Match the grammar element zero or more times.

### `oneOrMore(grammarElement)`

Match the grammar element one or more times.

### `anyOf(grammarElement1, grammarElement2, grammarElement3, ...)`

Match any of the grammar elements. The first successful match, in order, would be accepted without trying subsequent ones.

### `bestOf(grammarElement1, grammarElement2, grammarElement3, ...)`

Match the best grammar element. All possibilities would be tried, and the the longest match (in terms of character count) would be chosen.

### `possibly(grammarElement)`

Optionally accept the grammar element, or skip if it doesn't match.

### `pattern(regexpPattern)`

Accept a regular expression pattern compatible with `regexp-composer` `Pattern` type (either a simple string, pattern object, or array of pattern objects).

### `cached(grammarElement)`

Store the result of parsing using this grammar element and reuse when it's subsequently evaluated **at the same text position**.

### `uncached(grammarElement)`

Don't cache this grammar element.

## Future

* Allow raw regular expressions as part of the grammar

* Allow to include user-provided parser functions

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rotemdan/grammar-composer

Awesome Lists containing this project

README