Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tjvr/a-parser
https://github.com/tjvr/a-parser
Last synced: 3 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/tjvr/a-parser
- Owner: tjvr
- Created: 2018-09-21T21:33:53.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2023-04-25T20:44:08.000Z (over 1 year ago)
- Last Synced: 2024-11-18T04:23:10.539Z (about 1 month ago)
- Language: JavaScript
- Size: 502 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# What is this?
This is a work-in-progress parser generator: a framework for generating parsers from a grammar definition. It takes inspiration from [Nearley](https://github.com/kach/nearley) and other projects.
A _parser_ is used to turn text (such as the source code for a programming language) into a _parse tree_. This tree-like structure contains all of the hierarchy of the source code, and is classically the input to a _compiler_ which traverses the tree to emit machine code (at a rough simplification. Most production compilers use several different intermediate tree formats.)
A parser is usually used with a _tokenizer_ (or "lexer"). The tokenizer does the "dumb" job of splitting the text into "words", called "tokens"; the parser does the "smart" job of recognising sequences of such tokens. We recommend [Moo](https://github.com/no-context/moo) as a very fast and friendly tokenizer.
So, as far as this project is concerned: a Parser turns a stream of Tokens into a tree of Nodes.
To construct a Parser, you create a Grammar, and then pick a _parsing algorithm_ to use.
# Parse trees
The `grammar/node` file contains types for representing a parse tree. These are loosely modelled on the Node objects in the [ESTree spec](https://github.com/estree/estree/blob/master/es5.md#node-objects), which is based on a production JavaScript parser.
```js
interface Node {
type: string;
region: Region | null;
...attrs
}
```Parse trees are made up of Nodes. Grammars include annotations describing how to build a parse tree from them, which lets you conveniently omit syntactic information (such as whitespace, or operator tokens) from your parse tree, without writing any grammar rule post-processors or tree traversal code.
A `Node` includes a `Region`, which describes the location in the source file where the node was found. This is very useful for generating descriptive semantic error messages from your compiler: for example, you might like to highlight the region in which a type error was found.
```js
interface Region {
start: Pos,
end: Pos,
buffer: String,
}
```A `Region` consists of a start position (just before the first character that was matched) and an end position (just after the last character that was matched). It also includes a pointer to the entire source text.
A `Pos` is a simple pair of line and column numbers. Like Moo, these start from one. The character offset from the beginning of the source is also included.
```js
interface Pos {
line: Number, // >= 1
col: Number, // >= 1
offset: Number,
}
```A `Node` additionally includes attributes, based on the annotations in the source file. Attribute values have three different kinds:
- another `Node`
- the value of token
- an Array (a list of Nodes, or token values)# Grammar definitions
Grammars are defined using a custom syntax, from which a parser can be generated.
(You might wonder how this syntax is defined itself: in fact, it has its own grammar, and can parse itself! We call this "bootstrapping".)
The `grammar` module contains code to interpret this syntax.
TODO explain atoms: what does `"if"` match?
## Post-processing
By default, rules will match input, but won't return anything. Annotations are used to describe how to construct a parse tree from what was matched. Rules without annotations always produce the value `null`.
if_stmt -> "if" expr "then" block
// nullRules can be annotated with a **node type**. Any number of children may then be annotated with an **attribute name**. For example, this rule will produce a node of type `IfStatement`, with attributes `cond` and `body`.
if_stmt IfStatement -> "if" cond:expr "then" body:block
// {type: "IfStatement", cond: ..., body: ...}You cannot annotate children if the rule is not annotated first. Otherwise, there is no node to attach the attributes to.
if_stmt -> "if" cond:expr "then" body:block
// Not allowed### Root Annotation
It's often useful to pass through a node unchanged, without wrapping it in another object. For example, you might have a `number` rule that can match a `float`. We can use the **root annotation** (which can be thought of as an empty attribute name). Use this to pass through the child unchanged.
float_literal Literal -> value:"float"
// {type: "Literal", value: 3.14}number -> float_literal
// nullnumber -> :float_literal
// {type: "Literal", value: 3.14}Parentheses are another good example. You must define a _syntax_ rule for brackets -- otherwise they couldn't be parsed! -- but usually you don't want them to appear in the final parse tree.
x -> "(" :x ")"
The root annotation can only be applied to a **single** child of a rule.
foo -> :bar :quxx
// Not allowedThe root annotation cannot be combined with other annotations.
foo Object -> :bar
// Not allowednum Object -> :expr "+" other:expr
// Not allowed### List Annotations
It's often useful to parse lists (e.g. statements in a program; comma-separated arguments to a function call).
You could define lists yourself, by inventing a node type for linked lists.
// A program must have at least one statement.
program StatementList -> head:statement// A program is a statement followed by the rest of the program.
program StatementList -> head:statement tail:programEach node in the linked list will have the type `StatementList`.
The rule above is right-recursive; we prefer **left-recursive** rules. In addition, there is a special built-in list type `[]`, to avoid you working with linked lists yourself.
The last item in the list is annotated as the root attribute. The special list attribute `[]` is used for the rest of the list.
// A program must have at least one statement.
program [] -> :statement// A program ends with a statement.
program [] -> []:program :statementThe rule must be annotated with the special list type `[]`. One of the children may then be annotated as the root attribute; another child may then additionally be annotated with the list attribute.
You cannot have a list just contain itself; and as above, the root annotation can only be applied to a **single** child of a rule.
program [] -> []:program
// Not allowedprogram [] -> []:program :statement :statement
// Not allowedApart from these restrictions, you can use these annotations anywhere in the rule.
// Body is one or more lines separated by semicolons
body [] -> :statement
body [] -> []:body ";" :statement// Arguments are zero or more expressions separated by commas
args [] ->
args [] -> []:args ";" :expr## EBNF Operators
Three regex-like operators for optional tokens and repetition are provided:
### Option `?`
`val?` matches zero or one occurences of `val`. It expands to the generated rule `val?`:
val? -> :val
val? ->In the expression `key:val?`, if `val` is not present, then `key` will be `null`.
### One or many `+`
`val+` matches one or many occurences of `val`. It expands to the generated rule `val+`:
val+ [] -> []:val+ :val
val+ [] -> :valIn the expression `key:val+`, `key` will always contain a non-empty array.
### Zero or many `*`
`val*` matches zero or many occurences of `val`. It expands to the generated rule `val*`:
val* [] -> []:val* :val
val* [] ->In the expression `key:val*`, `key` will always contain an array, but it may be empty.
# Parsing
There may be various parsing algorithms implemented. Initially there is a proof-of-concept parser which uses Nearley's implementation of the Earley algorithm.
This project doesn't have a name yet, so the example below calls it `foo`. Using a parser will look something like this:
```js
const moo = require('moo')
const grammar = require('foo') // import the core library
const compileNearley = require('foo/compile-nearley') // import the Earley algorithm// Get our tokenizer ready
const lexer = moo.compile(...)// Define our Grammar
const myGrammar = foo.grammar(`
program -> :expr+expr Quote -> "'" list:List
expr List -> "(" items:expr+ ")"
expr Atom -> name:"identifier"
expr Literal -> value:"number"
expr Literal -> value:"string"
`)// Construct a Parser from our Grammar, using the Earley algorithm
const parser = compileNearley(myGrammar)// Begin a new Parse
parser.reset()// Tokenize the source code
lexer.reset(source)for (let tok of lexer) {
// Feed each token to the parser
try {
parser.eat(tok)
} catch (err) {
throw new Error(lexer.formatError("syntax error: " + err.message))
}
}// Get the final parse tree
const parseTree
try {
parseTree = parser.result()
} catch (err) {
throw new Error(lexer.formatError("unexpected EOF: " + err.message))
}
```