Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/tjvr/a-parser

Last synced: 3 days ago
JSON representation
Host: GitHub
URL: https://github.com/tjvr/a-parser
Owner: tjvr
Created: 2018-09-21T21:33:53.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2023-04-25T20:44:08.000Z (over 1 year ago)
Last Synced: 2024-11-18T04:23:10.539Z (about 1 month ago)
Language: JavaScript
Size: 502 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        
# What is this?

This is a work-in-progress parser generator: a framework for generating parsers from a grammar definition. It takes inspiration from [Nearley](https://github.com/kach/nearley) and other projects.

A _parser_ is used to turn text (such as the source code for a programming language) into a _parse tree_. This tree-like structure contains all of the hierarchy of the source code, and is classically the input to a _compiler_ which traverses the tree to emit machine code (at a rough simplification. Most production compilers use several different intermediate tree formats.)

A parser is usually used with a _tokenizer_ (or "lexer"). The tokenizer does the "dumb" job of splitting the text into "words", called "tokens"; the parser does the "smart" job of recognising sequences of such tokens. We recommend [Moo](https://github.com/no-context/moo) as a very fast and friendly tokenizer.

So, as far as this project is concerned: a Parser turns a stream of Tokens into a tree of Nodes.

To construct a Parser, you create a Grammar, and then pick a _parsing algorithm_ to use.

# Parse trees

The `grammar/node` file contains types for representing a parse tree. These are loosely modelled on the Node objects in the [ESTree spec](https://github.com/estree/estree/blob/master/es5.md#node-objects), which is based on a production JavaScript parser.

```js

interface Node {

    type: string;

    region: Region | null;

    ...attrs

}

```

Parse trees are made up of Nodes. Grammars include annotations describing how to build a parse tree from them, which lets you conveniently omit syntactic information (such as whitespace, or operator tokens) from your parse tree, without writing any grammar rule post-processors or tree traversal code.

A `Node` includes a `Region`, which describes the location in the source file where the node was found. This is very useful for generating descriptive semantic error messages from your compiler: for example, you might like to highlight the region in which a type error was found.

```js

interface Region {

    start: Pos,

    end: Pos,

    buffer: String,

}

```

A `Region` consists of a start position (just before the first character that was matched) and an end position (just after the last character that was matched). It also includes a pointer to the entire source text.

A `Pos` is a simple pair of line and column numbers. Like Moo, these start from one. The character offset from the beginning of the source is also included.

```js

interface Pos {

    line: Number, // >= 1

    col: Number, // >= 1

    offset: Number,

}

```

A `Node` additionally includes attributes, based on the annotations in the source file. Attribute values have three different kinds:

- another `Node`

- the value of token

- an Array (a list of Nodes, or token values)

# Grammar definitions

Grammars are defined using a custom syntax, from which a parser can be generated.

(You might wonder how this syntax is defined itself: in fact, it has its own grammar, and can parse itself! We call this "bootstrapping".)

The `grammar` module contains code to interpret this syntax.

TODO explain atoms: what does `"if"` match?

## Post-processing

By default, rules will match input, but won't return anything. Annotations are used to describe how to construct a parse tree from what was matched. Rules without annotations always produce the value `null`.

    

    if_stmt -> "if" expr "then" block

    // null

Rules can be annotated with a **node type**. Any number of children may then be annotated with an **attribute name**. For example, this rule will produce a node of type `IfStatement`, with attributes `cond` and `body`.

    if_stmt IfStatement -> "if" cond:expr "then" body:block

    // {type: "IfStatement", cond: ..., body: ...}

You cannot annotate children if the rule is not annotated first. Otherwise, there is no node to attach the attributes to.

    if_stmt -> "if" cond:expr "then" body:block

    // Not allowed

### Root Annotation

It's often useful to pass through a node unchanged, without wrapping it in another object. For example, you might have a `number` rule that can match a `float`. We can use the **root annotation** (which can be thought of as an empty attribute name). Use this to pass through the child unchanged.

    float_literal Literal -> value:"float"

    // {type: "Literal", value: 3.14}

    number -> float_literal

    // null

    number -> :float_literal

    // {type: "Literal", value: 3.14}

Parentheses are another good example. You must define a _syntax_ rule for brackets -- otherwise they couldn't be parsed! -- but usually you don't want them to appear in the final parse tree.

    x -> "(" :x ")"

The root annotation can only be applied to a **single** child of a rule.

    foo -> :bar :quxx

    // Not allowed

The root annotation cannot be combined with other annotations.

    foo Object -> :bar

    // Not allowed

    num Object -> :expr "+" other:expr

    // Not allowed

### List Annotations

It's often useful to parse lists (e.g. statements in a program; comma-separated arguments to a function call).

You could define lists yourself, by inventing a node type for linked lists.

    // A program must have at least one statement.

    program StatementList -> head:statement

    // A program is a statement followed by the rest of the program.

    program StatementList -> head:statement tail:program

Each node in the linked list will have the type `StatementList`.

The rule above is right-recursive; we prefer **left-recursive** rules. In addition, there is a special built-in list type `[]`, to avoid you working with linked lists yourself.

The last item in the list is annotated as the root attribute. The special list attribute `[]` is used for the rest of the list.

    // A program must have at least one statement.

    program [] -> :statement

    // A program ends with a statement.

    program [] -> []:program :statement

The rule must be annotated with the special list type `[]`. One of the children may then be annotated as the root attribute; another child may then additionally be annotated with the list attribute.

You cannot have a list just contain itself; and as above, the root annotation can only be applied to a **single** child of a rule.

    program [] -> []:program

    // Not allowed

    program [] -> []:program :statement :statement

    // Not allowed

Apart from these restrictions, you can use these annotations anywhere in the rule.

    // Body is one or more lines separated by semicolons

    body [] -> :statement

    body [] -> []:body ";" :statement

    // Arguments are zero or more expressions separated by commas

    args [] -> 

    args [] -> []:args ";" :expr

## EBNF Operators

Three regex-like operators for optional tokens and repetition are provided:

### Option `?`

`val?` matches zero or one occurences of `val`. It expands to the generated rule `val?`:

    val? -> :val

    val? ->

In the expression `key:val?`, if `val` is not present, then `key` will be `null`.

### One or many `+`

`val+` matches one or many occurences of `val`. It expands to the generated rule `val+`:

    val+ [] -> []:val+ :val

    val+ [] -> :val

In the expression `key:val+`, `key` will always contain a non-empty array.

### Zero or many `*`

`val*` matches zero or many occurences of `val`. It expands to the generated rule `val*`:

    val* [] -> []:val* :val

    val* [] ->

In the expression `key:val*`, `key` will always contain an array, but it may be empty.

# Parsing

There may be various parsing algorithms implemented. Initially there is a proof-of-concept parser which uses Nearley's implementation of the Earley algorithm.

This project doesn't have a name yet, so the example below calls it `foo`. Using a parser will look something like this:

```js

const moo = require('moo')

const grammar = require('foo') // import the core library

const compileNearley = require('foo/compile-nearley') // import the Earley algorithm

// Get our tokenizer ready

const lexer = moo.compile(...)

// Define our Grammar

const myGrammar = foo.grammar(`

  program -> :expr+

  expr Quote -> "'" list:List

  expr List -> "(" items:expr+ ")"

  expr Atom -> name:"identifier"

  expr Literal -> value:"number"

  expr Literal -> value:"string"

`)

// Construct a Parser from our Grammar, using the Earley algorithm

const parser = compileNearley(myGrammar)

// Begin a new Parse

parser.reset()

// Tokenize the source code

lexer.reset(source)

for (let tok of lexer) {

  // Feed each token to the parser

  try {

    parser.eat(tok)

  } catch (err) {

    throw new Error(lexer.formatError("syntax error: " + err.message))

  }

}

// Get the final parse tree

const parseTree

try {

  parseTree = parser.result()

} catch (err) {

  throw new Error(lexer.formatError("unexpected EOF: " + err.message))

}

```