{"id":13599481,"url":"https://github.com/DmitrySoshnikov/syntax","last_synced_at":"2025-04-10T12:32:53.949Z","repository":{"id":39838612,"uuid":"46961430","full_name":"DmitrySoshnikov/syntax","owner":"DmitrySoshnikov","description":"Syntactic analysis toolkit, language-agnostic parser generator.","archived":false,"fork":false,"pushed_at":"2024-01-06T11:55:32.000Z","size":1368,"stargazers_count":588,"open_issues_count":48,"forks_count":83,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-04-14T05:59:34.289Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DmitrySoshnikov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-11-27T05:30:04.000Z","updated_at":"2024-05-30T07:12:25.088Z","dependencies_parsed_at":"2023-01-24T03:31:09.913Z","dependency_job_id":"965f116a-bf7f-44b0-b04f-04c479461346","html_url":"https://github.com/DmitrySoshnikov/syntax","commit_stats":null,"previous_names":[],"tags_count":127,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitrySoshnikov%2Fsyntax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitrySoshnikov%2Fsyntax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitrySoshnikov%2Fsyntax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitrySoshnikov%2Fsyntax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DmitrySoshnikov","download_url":"https://codeload.github.com/DmitrySoshnikov/syntax/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247761052,"owners_count":20991532,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T17:01:04.950Z","updated_at":"2025-04-10T12:32:53.904Z","avatar_url":"https://github.com/DmitrySoshnikov.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"# syntax\n\n[![Build Status](https://travis-ci.org/DmitrySoshnikov/syntax.svg?branch=master)](https://travis-ci.org/DmitrySoshnikov/syntax) [![npm version](https://badge.fury.io/js/syntax-cli.svg)](https://badge.fury.io/js/syntax-cli)\n\nSyntactic analysis toolkit, language-agnostic parser generator.\n\nImplements [LR](https://en.wikipedia.org/wiki/LR_parser) and [LL](https://en.wikipedia.org/wiki/LL_parser) parsing algorithms.\n\nYou can get an introductory overview of the tool in [this article](https://medium.com/@DmitrySoshnikov/syntax-language-agnostic-parser-generator-bd24468d7cfc).\n\n### Table of Contents\n\n- [Installation](#installation)\n- [Development](#development)\n- [CLI usage example](#cli-usage-example)\n- [Parser generation](#parser-generation)\n- [Language agnostic parser generator](#language-agnostic-parser-generator)\n  - [JavaScript default](#javascript-default)\n  - [Python plugin](#python-plugin)\n  - [PHP plugin](#php-plugin)\n  - [Ruby plugin](#ruby-plugin)\n  - [C++ plugin](#c-plugin)\n  - [C# plugin](#c-plugin-1)\n  - [Rust plugin](#rust-plugin)\n  - [Java plugin](#java-plugin)\n  - [Julia plugin](#julia-plugin)\n- [Grammar format](#grammar-format)\n  - [JSON-like notation](#json-like-notation)\n  - [Yacc/Bison notation](#yaccbison-notation)\n  - [Grammar properties](#grammar-properties)\n- [Lexical grammar and tokenizer](#lexical-grammar-and-tokenizer)\n  - [Getting list of tokens](#getting-list-of-tokens)\n  - [Using custom tokenizer](#using-custom-tokenizer)\n  - [Start conditions of lex rules, and tokenizer states](#start-conditions-of-lex-rules-and-tokenizer-states)\n  - [Access tokenizer from parser semantic actions](#access-tokenizer-from-parser-semantic-actions)\n  - [Case-insensitive match](#case-insensitive-match)\n- [Working with precedence and associativity](#working-with-precedence-and-associativity)\n  - [Associative precedence](#associative-precedence)\n  - [Non-associative precedence](#non-associative-precedence)\n- [Handler arguments notation](#handler-arguments-notation)\n  - [Positioned notation](#positioned-notation)\n  - [Named notation](#named-notation)\n- [Capturing location objects](#capturing-location-objects)\n- [Parsing modes](#parsing-modes)\n  - [LL parsing](#ll-parsing)\n  - [LR parsing](#lr-parsing)\n  - [LR conflicts](#lr-conflicts)\n  - [Conflicts resolution](#conflicts-resolution)\n- [Validating grammar](#validating-grammar)\n- [Module include, and parser events](#module-include-and-parser-events)\n- [Debug mode](#debug-mode)\n\n\n### Installation\n\nThe tool can be installed as an [npm module](https://www.npmjs.com/package/syntax-cli) (notice, it's called `syntax-cli` there):\n\n```\nnpm install -g syntax-cli\n\nsyntax-cli --help\n```\n\n### Development\n\n1. Fork the https://github.com/DmitrySoshnikov/syntax repo\n2. Make your changes\n3. Make sure `npm test` passes (add new tests if needed)\n4. Submit a PR\n\n\u003e NOTE: If you need to implement a Syntax plugin for a new target programming language, address [this instruction](https://github.com/DmitrySoshnikov/syntax/blob/master/src/plugins/README.md).\n\nFor development from the github repository, run `build` command to transpile ES6 code:\n\n```\ngit clone https://github.com/\u003cyour-github-account\u003e/syntax.git\ncd syntax\nnpm install\nnpm run build\n\n./bin/syntax --help\n```\n\nOr for faster development cycle, one can also use `watch` command (notice though, it doesn't copy template files, but only transpiles ES6 code; for templates copying you have to use `build` command):\n\n```\nnpm run watch\n```\n\n### CLI usage example\n\n```\n./bin/syntax --grammar examples/grammar.lr0 --parse \"aabb\" --mode lr0 --table --collection\n```\n\n### Parser generation\n\nTo generate a parser module, specify the `--output` option, which is a path to the output parser file. Once generated, the module can normally be required, and used for parsing strings based on a given grammar.\n\nExample for the [JSON grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/json.ast.js):\n\n```\n./bin/syntax --grammar examples/json.ast.js --mode SLR1 --output json-parser.js\n\n✓ Successfully generated: json-parser.js\n```\n\nLoading as a JS module:\n\n```js\nconst JSONParser = require('./json-parser');\n\nlet value = JSONParser.parse('{\"x\": 10, \"y\": [1, 2]}');\n\nconsole.log(value); // JS object: {x: 10, y: [1, 2]}\n```\n\n### Language agnostic parser generator\n\nSee [this instruction](https://github.com/DmitrySoshnikov/syntax/blob/master/src/plugins/README.md) how to implement a new plugin.\n\n#### JavaScript default\n\nSyntax is language agnostic when it comes to parser generation. The same grammar can be used for parser generation in different languages. Currently Syntax supports _JavaScript_, _Python_, _PHP_, _Ruby_, _C#_, _Rust_, and _Java_. The target language is determined by the output file extension.\n\n#### Python plugin\n\nFor example, this is how to use the same [calculator grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.py.g) example to generate parser module in Python:\n\n```\n./bin/syntax -g examples/calc.py.g -m lalr1 -o calcparser.py\n```\n\nThe `calcparser` module then can be required normally in Python for parsing:\n\n```python\n\u003e\u003e\u003e import calcparser\n\u003e\u003e\u003e calcparser.parse('2 + 2 * 2')\n\u003e\u003e\u003e 6\n```\n\n[Another example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/module-include.py.g) shows how to use parser hooks (such as `on_parse_begin`, `on_parse_end`, and other) in Python. They are discussed below in the [module include](https://github.com/DmitrySoshnikov/syntax#module-include-and-parser-events) section.\n\n#### PHP plugin\n\nFor PHP the procedure is pretty much the same, take a look at the similar [example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.php.g):\n\n```\n./bin/syntax -g examples/calc.php.g -m lalr1 -o CalcParser.php\n```\n\nThe output file contains the class name corresponding to the file name:\n\n```php\n\u003c?php\n\nrequire('CalcParser.php');\n\nvar_dump(CalcParser::parse('2 + 2 * 2')); // int(6)\n```\n\nThe parser hooks for PHP can be found in [this example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/module-include.php.g).\n\n#### Ruby plugin\n\nRuby is another target language supported by Syntax. Its [calculator example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.rb.g) is very similar:\n\n```\n./bin/syntax -g examples/calc.rb.g -m lalr1 -o CalcParser.rb\n```\n\nAnd also the output file contains the class name corresponding to the file name:\n\n```ruby\nrequire 'CalcParser.php'\n\nputs CalcParser.parse('2 + 2 * 2') // 6\n```\n\nRuby's parsing hooks can be found in [the following example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/module-include.rb.g).\n\n#### C++ plugin\n\nSyntax has support for modern C++ as a target language. See its [calculator example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.cpp.g):\n\n```\n./bin/syntax -g examples/calc.cpp.g -m lalr1 -o CalcParser.h\n```\n\nThen callers can use the module as:\n\n```cpp\n#include \"CalcParser.h\"\n\nusing namespace syntax;\n\n...\n\nCalcParser parser;\n\nstd::cout \u003c\u003c parser.parse(\"2 + 2 * 2\");  // 6\nstd::cout \u003c\u003c parser.parse(\"(2 + 2) * 2\") // 8\n```\n\nParsing hooks example in C++ format can be found in [this example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.cpp.ast.g).\n\n#### C# plugin\n\nSyntax supports as well C# as a target language. See its [calculator example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.cs.g):\n\n```\n./bin/syntax -g examples/calc.cs.g -m lalr1 -o CalcParser.cs\n```\n\nThen callers can use the module as:\n\n```cs\nusing SyntaxParser;\n\n...\n\nvar parser = new CalcParser();\nConsole.WriteLine(parser.parse(\"2 + 2 * 2\")); // 6\n```\n\nParsing hooks example in C# format can be found in [this example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/module-include.cs.g).\n\n#### Rust plugin\n\nRust is a system programming language focusing on efficiency and memory safety. Syntax has support for generating parsers in Rust. See the [simple example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.rs.g), and an example of [generating an AST](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc-ast.rs.g) with recursive structures.\n\n```\n./bin/syntax -g examples/calc.rs.g -m lalr1 -o lib.rs\n```\n\nCallers can create a crate (called `syntax` in the example below), which contains the parser, and use it as:\n\n```rust\nextern crate syntax;\n\nuse syntax::Parser;\n\nfn main() {\n    let mut parser = Parser::new();\n\n    let result = parser.parse(\"2 + 2 * 2\");\n    println!(\"{:?}\", result); // 6\n}\n```\n\nCheck out [README](https://github.com/DmitrySoshnikov/syntax/blob/master/src/plugins/rust/README.md) file from rust directory for more information.\n\n#### Java plugin\n\nSyntax has support for generating LR parsers in Java. See the [simple example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.java.g), and an example of [generating an AST](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc-ast-java.bnf) with recursive structures.\n\n```\n./bin/syntax -g examples/calc.java.g -m lalr1 -o com/syntax/CalcParser.java\n```\n\nBy default Syntax generates parsers in the `com/syntax` package.\n\n```java\nimport com.syntax.*;\nimport java.text.ParseException;\n\npublic class SyntaxTest {\n  public static void main(String[] args) {\n\n    CalcParser calcParser = new CalcParser();\n\n    try {\n      System.out.println(calcParser.parse(\"2 + 2 * 2\")); // 6\n      System.out.println(calcParser.parse(\"(2 + 2) * 2\")); // 8\n    } catch (ParseException e) {\n      e.printStackTrace();\n    }\n  }\n}\n```\n\nCheck out [README](https://github.com/DmitrySoshnikov/syntax/blob/master/src/plugins/java/README.md) from the Java plugin for more information.\n\n#### Julia plugin\n\nJulia is a general purpose programming language often used within the scientific computing space. Julia was designed from the beginning for high performance and utilizes LLVM as a compile target. Julia is a dynamically typed functional programming language which relies heavily on the concept of Multiple Dispatch. Syntax has support for generating parsers in Julia. See the [simple example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.jl.g), and an example of [more complex usage to generate a Vector based AST](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/letter.jl.bnf) with recursive structures.\n\n```\n./bin/syntax -g examples/calc.jl.g -m lalr1 -o SyntaxParser.jl\n```\n\nThe resultant parser module depends on the common DataStructures.jl package for clarity and performance reasons. Callers can include the Julia file which defines a module called SyntaxParser.jl, which contains the parser, and use it as:\n\n```julia\nusing Pkg\nPkg.add(\"DataStructures\")\n\noutput = SyntaxParser.parse(\"5 + 5\")\n```\n\nFor complex Julia parser implementations it is recommended to leverage the JSON-like notation for the lexical grammar, as it is easier to properly manage and understand escape sequences that need to travel between JavaScript in the parser generator to Julia notation in the resultant parser file. For example, the following includes proper escape sequences for handling strings that are read in and will have escaped double quotes inside them as well as numbers with scientific notation:\n\n```js\n{\n  rules: [\n    // Comments\n    [\"\\\\/\\\\/.*\",                                    `# skip single line comments`],\n    [\"\\/\\\\*[\\\\s\\\\S]*?\\\\*\\/\",                        `# skip multiline comments`],\n    [`\\\\s+`,                                        `# skip whitespace`],\n\n    // Math operators (+, -, *, /, %)\n    [`(\\\\+|\\\\-)`,                                   `return \"ADDITIVE_OPERATOR\"`],\n    [`(\\\\*|\\\\/|%)`,                                 `return \"MULTIPLICATIVE_OPERATOR\"`],\n\n    // Literals\n    [`[-+]?([0-9]*[.])?[0-9]+([eE][-+]?\\\\d+)?`,     `return \"NUMERIC\"`],\n    [`\\\\\"(?:[^\\\\\"\\\\\\\\]|\\\\\\\\.)*\\\\\"`,                 `return \"STRING\"`],\n    [`#-?\\\\d+`,                                     `return \"OBJECT\"`],\n    [`E_[a-zA-Z]+`,                                 `return \"ERROR\"`],\n  ]\n}\n```\n\n### Grammar format\n\n_Syntax_ support two main notations to define grammars: _JSON-like_ notation, and _Yacc/Bison-style_ notation.\n\n#### JSON-like notation\n\nJSON-\"like\" is because it's excented JSON notation, and may include any JavaScript syntax (e.g. quotes may be omitted for properties, can use comments, etc):\n\n```js\n/**\n * Basic calculator grammar in JSON notation.\n */\n\n{\n  // ---------------------------\n  // Lexical grammar.\n\n  lex: {\n    rules: [\n      [`\\\\s+`,        `/* skip whitespace */`],\n      [`\\\\d+`,        `return 'NUMBER'`],\n      [`\\\\+`,         `return '+'`],\n      [`\\\\*`,         `return '*'`],\n      [`\\\\(`,         `return '('`],\n      [`\\\\)`,         `return ')'`],\n    ],\n  },\n\n  // ---------------------------\n  // Operators precedence.\n\n  operators: [\n    [`left`, `+`],\n    [`left`, `*`],\n  ],\n\n  // ---------------------------\n  // Syntactic grammar.\n\n  bnf: {\n    e: [[`e + e`,   `$$ = $1 + $3`],\n        [`e * e`,   `$$ = $1 * $3`],\n        [`( e )`,   `$$ = $2`],\n        [`NUMBER`,   `$$ = Number($1)`]],\n  }\n}\n```\n\nAs we can see, `lex` defines _lexical grammar_, `bnf` provides _syntactic grammar_, and _operators_ may defines [associativity and precedence](#working-with-precedence-and-associativity) of needed symbols. List of available [grammar properties](#grammar-properties) is specified below.\n\n#### Yacc/Bison notation\n\nAnd here is the same grammar in the Yacc/Bison format:\n\n```\n/**\n * Basic calculator grammar in Yacc/Bison notation.\n */\n\n%lex\n\n%%\n\n\\s+             /* skip whitespace */\n\\d+             return 'NUMBER'\n\n/lex\n\n%left '+'\n%left '*'\n\n%%\n\ne\n  : e '+' e    { $$ = $1 + $3 }\n  | e '*' e    { $$ = $1 * $3 }\n  | '(' e ')'  { $$ = $2 }\n  | NUMBER     { $$ = Number($1) }\n  ;\n```\n\nSimple tokens like `'+'` can be defined inline (with quotes), and complex tokens like `NUMBER` has to be defined in the lexical grammar. Lexical and syntactic grammars can also be defined in two separate files.\n\nA grammar in Yacc/Bison format is also _just parsed_ by _Syntax_ using our [BNF parser](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/bnf.bnf). The resulting parsed AST corresponds exactly to the JSON-like notation described above.\n\n#### Grammar properties\n\nBelow is the list of available grammar properties.\n\n* `lex` - [lexical grammar](#lexical-grammar-and-tokenizer).\n* `bnf` - syntactic grammar in BNF format.\n* `operators` - associativity and precedence of needed grammar symbols (usually operators, but not necessarily). Can be used to resolve \"shift-reduce\" conflicts in cases like _\"dangling-else\"_ problem, math-operators, etc.\n* `moduleInclude` -- the code which is included \"as is\" into the generated parser module. Usually used to require or define inline classes for AST nodes, and any additional code.\n* `startSymbol` - starting symbol (if not specified, it's inferred from the LHS of the first rule).\n* `tokens` - explicit list of tokens (if not specified, it's automatically inferred from the grammar).\n\n### Lexical grammar and tokenizer\n\nTokenizers use formalism of _regular grammars_ in order to split a string into a list of _tokens_. One of the convenient implementations of the regular grammars is _regular expressions_.\n\nA basic format of a lexical grammar should provide at least `rules` section:\n\n```js\n{\n  rules: [\n    [`\\\\d+`,    `return 'NUMBER'`],\n    [`\"[^\"]*\"`, `yytext = yytext.slice(1, -1); return 'STRING';`],\n    ...\n  ],\n}\n```\nThe first element of a lexical rule is the _regexp pattern_ to match, and the second element is the corresponding token handler, which should return _type_ of the matched token.\n\nHandlers may access the matched text as `yytext` variable, which is also can be mutated -- in the example above for the `STRING` token we modify matched text to be the quoted value, stripping the quotes themselves.\n\nA handler can be arbitrary complex function, and in addition may return _multiple tokens_, using an array (see also [this example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/indent-explicit.g.js#L136-L147)):\n\n```\n// Return 3 tokens for one matched value.\nreturn ['DEDENT', 'DEDENT', 'NL'];\n```\n\nLexical grammar may also define [macros](https://github.com/DmitrySoshnikov/syntax/blob/ca8f0c86401c6c18cea1885ba602e76d62855d63/examples/json.grammar.js#L25) field -- variables which can be used later in rules, and also [start conditions](#start-conditions-of-lex-rules-and-tokenizer-states) for _tokenizer states_, which are discussed below.\n\n```js\n{\n  macros: {\n    id: `[a-zA-Z0-9_]`,\n  },\n\n  rules: [\n    [`{id}+`,    `return 'IDENTIFIER'`],\n    ...\n  ],\n}\n```\n\n#### Getting list of tokens\n\nIt is possible to analyze just a list of tokens either from the `lex` part of the `--grammar`, or from a standalone `--lex` file.\n\nExample:\n\n```js\n// ~/lang.lex\n\n{\n  rules: [\n    [`\\\\s+`,       `/* skip whitespace */`],\n    [`\\\\d+`,       `return 'NUMBER'`],\n    [`(\\\\+|\\\\-)`,  `return 'ADDITIVE_OPERATOR'`],\n  ],\n}\n```\n\nExtract the tokens:\n\n```\n./bin/syntax --lex ~/lang.lex --tokenize -p '2 + 5'\n```\n\nThe result:\n\n```js\n[\n  {\n    \"type\": \"NUMBER\",\n    \"value\": \"2\",\n    \"startOffset\": 0,\n    \"endOffset\": 1,\n    \"startLine\": 1,\n    \"endLine\": 1,\n    \"startColumn\": 0,\n    \"endColumn\": 1\n  },\n  {\n    \"type\": \"ADDITIVE_OPERATOR\",\n    \"value\": \"+\",\n    \"startOffset\": 2,\n    \"endOffset\": 3,\n    \"startLine\": 1,\n    \"endLine\": 1,\n    \"startColumn\": 2,\n    \"endColumn\": 3\n  },\n  {\n    \"type\": \"NUMBER\",\n    \"value\": \"5\",\n    \"startOffset\": 4,\n    \"endOffset\": 5,\n    \"startLine\": 1,\n    \"endLine\": 1,\n    \"startColumn\": 4,\n    \"endColumn\": 5\n  }\n]\n```\n\nAs you can see, along with the type, and the value, a tokenizer also captures token locations: absolute offsets, line, and column numbers.\n\n#### Using custom tokenizer\n\n\u003e NOTE: built-in tokenizer uses underlying regexp implementation to extract stream of tokens.\n\nIt is possible to provide a custom tokenizer if a built-in isn't sufficient. For this pass the `--custom-tokenizer` option, which is a path to a file that implements a tokenizer. In this case the built-in tokenizer code won't be generated.\n\n```\n./bin/syntax --grammar examples/json.ast.js --mode SLR1 --output json-parser.js --custom-tokenizer './my-tokenizer.js'\n\n✓ Successfully generated: json-parser.js\n```\n\nIn the generated code, the custom tokenizer is just required as a module: `require('./my-tokenizer.js')`.\n\nThe tokenizer should implement the following iterator-like interface:\n\n- `initString`: initializes a parsing string;\n- `hasMoreTokens`: whether stream of tokens still has more tokens to consume;\n- `getNextToken`: returns next token in the format `{type, value}`;\n\nFor example:\n\n```js\n// File: ./my-tokenizer.js\n\nconst MyTokenizer = {\n\n  initString(string) {\n    this._string = string;\n    this._cursor = 0;\n  },\n\n  hasMoreTokens() {\n    return this._cursor \u003c this._string.length;\n  },\n\n  getNextToken() {\n    // Implement logic here.\n\n    return {\n      // Basic data.\n      type: \u003c\u003cTOKEN_TYPE\u003e\u003e,\n      value: \u003c\u003cTOKEN_VALUE\u003e\u003e,\n\n      // Location data.\n      startOffset: \u003c\u003cSTART_OFFSET\u003e\u003e,\n      endOffset: \u003c\u003cEND_OFFSET\u003e\u003e,\n      startLine: \u003c\u003cSTART_LINE\u003e\u003e,\n      endLine: \u003c\u003cEND_LINE\u003e\u003e,\n      startColumn: \u003c\u003cSTART_COLUMN\u003e\u003e,\n      endColumn: \u003c\u003cEND_COLUMN\u003e\u003e,\n    }\n  },\n};\n\nmodule.exports = MyTokenizer;\n```\n\n#### Start conditions of lex rules, and tokenizer states\n\nBuilt-in tokenizer supports _stateful tokenization_. This means the same lex rule can be applied in different states, and result to a different token. For lex rules it's known as _start conditions_.\n\nRules with explicit start conditions are executed _only_ when lexer enters the state corresponding to their names. Start conditions can be _inclusive_ (`%s`, 0), and _exclusive_ (`%x`, 1). Inclusive conditions also include rules _without_ any start conditions, and exclusive conditions do not include other rules when the parser enter their state. The rules with `*` start condition are always included.\n\n```js\n\"lex\": {\n  \"startConditions\": {\n    \"comment\": 1, // exclusive\n    \"string\": 1   // exclusive\n  },\n\n  \"rules\": [\n    ...\n\n    // On `/*` enter `comment` state:\n    [\"\\\\/\\\\*\", \"this.pushState('comment');\"],\n\n    // The rule is executed only when tokenizers enters `comment`, or `string` state:\n    [[\"comment\", \"string\"], \"\\\\n\", \"lines++; return 'NL';\"],\n\n    ...\n  ],\n}\n```\n\nMore information on the topic can be found in [this gist](https://gist.github.com/DmitrySoshnikov/f5e2583b37e8f758c789cea9dcdf238a).\n\nAs an example take a look at [this example grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/lexer-start-conditions.g.js), which calculates line numbers in a source file, including line numbers in comments. The comments themselves are skipped during tokenization, however the new lines are handled within comments separately to count those line numbers as well.\n\nAnother example is the [grammar for BNF](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/bnf.g) itself, which we use to parse BNF grammars represented as strings, rather than in JSON format. There we have `action` start condition to correctly parse `{` and `}` of JS code, being inside an actual handler for a grammar rule, which is itself surrounded by  `{` and `}` braces.\n\n#### Access tokenizer from parser semantic actions\n\nIt is also possible to access tokenizer instance from the parser semantic actions. It is exposed via the `yy.lexer` object (`yy.tokenizer` is an alias).\n\nHaving access to the lexer, it is possible, for example, to change its state, and yield different token types for the same characters.\n\nAs an example, differently parsing `{` and `}` being in an _expression_ or in a _statement_ position in ECMAScript language:\n\n```\n{x: 1} // BlockStatement\n({x: 1}) // ObjectLiteral\n```\n\nA simplified example for this can be found in the [parser-lexer-communication.g](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/parser-lexer-communication.g) grammar example.\n\n#### Case-insensitive match\n\nLexical grammar rules can also be _case-insensitive_. From the command line it's control via the appropriate `--case-insensitive` (`-i`) option. It also can be specified in the lexical grammar itself -- for the whole grammar, or per each rule:\n\n```js\n// case-insensitive.lex\n\n{\n  \"rules\": [\n    // This rule is by default case-insensitive:\n    [`x`, `return \"X\"`],\n\n    // This rule overrides global options:\n    [`y`, `return \"Y\"`, {\"case-insensitive\": false}],\n  ],\n\n  // Global options for the whole lexical grammar.\n  \"options\": {\n    \"case-insensitive\": true,\n  },\n}\n```\n\nSee [this example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/case-insensitive-lex.g) for details.\n\n### Working with precedence and associativity\n\nPrecedence and associativity operators allow building more readable and elegant grammars, avoiding different kinds of conflicts, like \"shift-reduce\" conflicts.\n\nSupported precedence operators are:\n\n* `%left` -- left-associative;\n* `%right` -- right-associative;\n* `%nonassoc` -- non-associative.\n\n#### Associative precedence\n\nHaving the following snippet from the calculator grammar:\n\n```\n%%\n\ne\n  : e '+' e\n  | e '*' e\n  | '(' e ')'\n  | NUMBER\n  ;\n```\n\nwe get a _\"shift-reduce\"_ conflict for the input like:\n\n```\n2 + 2 + 2\n```\n\nOnce have parsed the first `2 + 2`, and having `+` as the lookahead, the parser cannot decide, whether it should _reduce_ the parsed `2 + 2` value to `e`, _or_ it should to _shift_ further, since `2` itself can be `e` (by `NUMBER` rule), and parser can expect `+` after `e`.\n\nDefining associativity for the `+` operator solves it easily and elegantly:\n\n```\n%left '+'\n%left '*'\n\n%%\n\ne\n  : e '+' e\n  | e '*' e\n  | '(' e ')'\n  | NUMBER\n  ;\n```\n\nHere by `%left '+'` we say that `+` is _left-associative_, which means that parser should parse `2 + 2 + 2` as `(2 + 2) + 2`, and not as `2 + (2 + 2)`. In other words, once the parser have parsed `2 + 2`, and still sees `+` as the lookahead, it chooses to _reduce_ instead of shift, and \"shift-reduce\" conflict is resolved.\n\nAnother example is having the snippet as:\n\n```\n2 + 2 * 2\n```\n\nIf parser would reduce in this case, we would get an _invalid mathematical expression_, since this expression, without any grouping parenthesis, should be parsed as `2 + (2 * 2)`, and not as `(2 + 2) * 2`.\n\nTo solve this we use `%left '*'` which in our grammar definition stays in order _after_ the `%left '+'`, which makes the `*` operator to have a _higher precedence_, than the `+` has. In this case parser chooses to _shift_ further instead of reducing.\n\nNote, from [JSON-like notation](#json-like-notation) they are defined as:\n\n```js\noperators: [\n  ['left', '+'],\n  ['left', '*'],\n  // etc.\n]\n```\n\n#### Non-associative precedence\n\nSometimes we don't need any associativity, but just want to specify _precedence_ of some symbols. As a classic example, we can take the [dangling-else](https://en.wikipedia.org/wiki/Dangling_else) problem, for which we use `%nonassoc` operator:\n\n```\n%nonassoc THEN\n%nonassoc 'else'\n\n%%\n\nIfStatement\n  : 'if' '(' Expression ')' Statement %prec THEN\n  | 'if' '(' Expression ')' Statement 'else' Statement\n  ;\n```\n\nAs we can see, `'else'` token has _higher precedence_ again, since goes after (\"virtual\") `THEN` token, so there is no \"shift-reduce\" conflict in this case.\n\nHere `%prec` is used in production to specify which precedence to apply, using the \"virtual\" `THEN` symbol -- in this case it's not a real token (in contrast with `'else'`), but just _precedence name_ in order to refer it from the production.\n\nYou can find this problem handled in this [grammar example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/lang.bnf).\n\n### Handler arguments notation\n\nThe following notation is used for semantic action (handler) arguments:\n\n- `yytext` -- a matched token value\n- `yyleng` -- the length of the matched token\n- Positioned arguments, e.g. `$1`, `$2`, etc.\n- Positioned locations, e.g. `@1`, `@2`, etc.\n- Named arguments, e.g. `$foo`, `$bar`, etc.\n- Named locations, e.g. `@foo`, `@bar`, etc.\n- `$$` -- result value\n- `@$` -- result location\n\n#### Positioned notation\n\nThis is the simplest notation -- the semantic action arguments can be accessed via their number. For example, for the production:\n\n```\nexp : exp '+' term { $$ = $1 + $3 }\n```\n\nThe `exp` can be accessed as `$1`, the `$2` would contain `'+'`, and `$3` corresponds to the `term`.\n\n#### Named notation\n\nSometimes using positioned arguments can be less readable, and may cause refactoring issues. E.g. if some symbol is removed from the production, the handler code should be updated:\n\n```\nexp : '+' exp term { $$ = $2 + $3 }\n```\n\nIn this case using _named arguments_ might be more suitable:\n\n```\nexp : exp '+' term { $$ = $exp + $term }\n```\n\nStill the same, even if the production is changed:\n\n```\nexp : '+' exp term { $$ = $exp + $term }\n```\n\nNotice though, that for _duplicated symbols_ named notation doesn't work, since causes ambiguity:\n\n```\nexp : exp '+' exp { $$ = $exp + $exp } /* ERROR! */\n```\n\nIn this case the positioned arguments should be used:\n\n```\nexp : exp '+' exp { $$ = $1 + $3 } /* OK! */\n```\n\n### Capturing location objects\n\nFor some tools (e.g source-code transformation tools) it is important not only to produce AST nodes, but also to capture all the locations in the original source code. _Syntax_ supports `--loc` option for this. A default structure of a location object is the same as for a token:\n\n```js\n{\n  startOffset,\n  endOffset,\n  startLine,\n  endLine,\n  startColumn,\n  endColumn,\n}\n```\n\nHowever in actual AST nodes generation it is possible to build a custom location information based on this default location object.\n\nThe locations are accessed using `@1`, or `@foo` notation, the result location is in the `@$`:\n\n```\nexp : exp + exp\n  {\n    $$ = $1 + $2;\n\n    // Default algorithm.\n    @$.startLine = @1.startLine;\n    @$.endLine = @3.endLine;\n    @$.startColumn = @1.startColumn;\n    @$.endColumn = @3.endColumn;\n    ...\n  }\n```\n\nBy default _Syntax_ automatically calculates resulting location taking _start part_ from the _first symbol_ of a production, and the _end part_ -- from the _last symbol_ of the production. So the example above can actually omit manual result location calculation, and be just:\n\n\n```\nexp : exp + exp { $$ = $1 + $2; }\n```\n\nIt is possible to override though the default algorithm by just mutating the `@$`, and it's also possible to create a custom location:\n\n```\nexp : exp + exp { $$ = new AdditionNode($1, $3, Loc(@$)) }\n```\n\nIn this case function `Loc` can create custom location format. Here is [another example](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc-loc.bnf) of a grammar which uses location objects.\n\n### Parsing modes\n\n_Syntax_ supports several _LR_ parsing modes: _LR(0)_, _SLR(1)_, _LALR(1)_, _CLR(1)_ as well _LL(1)_ mode. The same grammar can be analyzed in different modes, from the CLI it's controlled via the `--mode` option, e.g. `--mode slr1`.\n\n\u003e Note: de facto standard for automatically generated parsers is usually the _LALR(1)_ parser. The _CLR(1)_ parser, being the most powerful, and able to parse wider grammar sets, can have much more states than LALR(1), and usually is suitable only for educational purposes. As well as its less powerful counterparts, _LR(0)_ and _SLR(1)_ which are less used on practice (although, some production-ready grammars can also normally be parsed by _SLR(1)_, e.g. [JSON grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/json.ast.js)).\n\nSome grammars can be handled by one mode, but not by another. In this case a _conflict_ will be shown in the table.\n\n#### LL parsing\n\nCurrently an LL(1) grammar is supposed to be already _left-factored_, and to be _non-left-recursive_. See section on [LL conflicts](https://en.wikipedia.org/wiki/LL_parser#Solutions_to_LL.281.29_Conflicts) for details.\n\n\u003e Note: left-recursion elimination, and left-factoring process can be automated for most of the cases (excluding some edge cases, which should be done manually), and implement a transformation to a non-left-recursive grammar.\n\nA typical LL parsing table is less, than a corresponding LR-table. However, LR grammars cover more languages than LL grammars. In addition, an LL(1) grammar usually might look less elegant, or even less readable, than an LR grammar. As an example, take a look at the calculator grammar in the [non-left-recursive LL mode](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.ll1), [left-recursive LR mode](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calculator.g), and also [left-recursive, and precedence-based LR-mode](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc.slr1).\n\nAt the moment, LL parser only implements syntax validation, not providing semantic actions (e.g. to construct an AST). For the semantic handlers, and actual AST construction see LR parsing.\n\n#### LR parsing\n\nLR parsing, and its the most practical version, the LALR(1), is widely used in automatically generated parsers. LR grammars usually look more readable, than corresponding LL grammars, since in contrast with the later, LR parser generators by default allow _left-recursion_, and do automatic conflict resolutions. The precedence and assoc operators allow building more elegant grammars with smaller parsing tables.\n\nTake a look at the [example grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calc-eval.g) with a typical _syntax-directed translation (SDT)_, using semantic actions for AST construction, direct evaluation, and any other transformation.\n\nDefault algorithm used for the practical _LALR(1)_ mode, is the _\"LALR(1) by SLR(1)\"_. It is enabled by the default `--mode lalr1`, or explicit `--mode lalr1_by_slr1`. In addition, Syntax implements _\"LALR(1) by compressing CLR(1)\"_ algorithm (available via the `--mode lalr1_by_clr1`), which is slower in parser generation, and is suitable only for educational purposes.\n\n\u003e **NOTE:** prefer usage of the `--mode lalr1` as _the most practical_.\n\n#### LR conflicts\n\nIn LR parsing there are two main types of conflicts: _\"shift-reduce\" (s/r)_ conflict, and _\"reduce-reduce\" (r/r)_ conflict. Taking as an example grammar from `examples/example1.slr1`, we see that the parsing table is normally constructed for `SLR(1)` mode, but has a \"shift-reduce\" conflict if ran in the `LR(0)` mode:\n\n```\n./bin/syntax --grammar examples/example1.slr1 --table\n```\n\n```\n./bin/syntax --grammar examples/example1.slr1 --table --mode lr0\n```\n\n![sl1-grammar](http://dmitrysoshnikov.com/wp-content/uploads/2015/12/imageedit_2_9168334335.png)\n![sl1-grammar-lr0-m](http://dmitrysoshnikov.com/wp-content/uploads/2015/12/imageedit_2_6530197571.png)\n\n#### Conflicts resolution\n\nSometimes changing parsing mode is not enough for fixing conflicts: for some grammars conflicts may stay and in the _LALR(1)_, and even the _CLR(1)_ modes. LR conflicts can be resolved though automatically and semi-automatically by specifying precedence and associativity of operators.\n\nFor example, the [following grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calculator-assoc-conflict.g) has a _shift-reduce_ conflict:\n\n```\n%token id\n\n%%\n\nE\n  : E '+' E\n  | E '*' E\n  | id\n  ;\n```\n\nTherefore, a parsing is not possible using this grammar:\n\n```\n./bin/syntax -g examples/calculator-assoc-conflict.g -m lalr1 -w -p 'id * id + id'\n\nParsing mode: LALR(1).\n\nParsing: id * id + id\n\nRejected: Parse error: Found \"shift-reduce\" conflict at state 6, terminal '+'.\n```\n\nThis can be fixed though using operators associativity and precedence:\n\n```\n%token id\n\n%left '+'\n%left '*'\n\n%%\n\nE\n  : E '+' E\n  | E '*' E\n  | id\n  ;\n```\n\nSee detailed description of the conflicts resolution algorithm in [this example grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calculator-assoc.g), which is can be parsed normally:\n\n```\n./bin/syntax -g examples/calculator-assoc.g -m lalr1 -w -p 'id * id + id'\n\nParsing mode: LALR(1).\n\nParsing: id * id + id\n\n✓ Accepted\n```\n\n### Validating grammar\n\nBy using `--validate` option, it is possible to check whether your grammar is free from different kinds of conflicts, and if it is not, to get needed information about which grammar rules conflict, and wich possible solutions can be applied to resolve them.\n\nFor example, discussed above [calculator-assoc-conflict](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calculator-assoc-conflict.g) grammar has _\"shift-reduce\"_ conflicts in two productions:\n\n```\n./bin/syntax -g examples/calculator-assoc-conflict.g -m slr1 --validate\n\nParsing mode: SLR(1).\n\nGrammar has the following unresolved conflicts:\n\n\"Shift-reduce\" conflicts:\n\n   1. Production: E -\u003e E '+' E, conflicts with symbols '+', '*'.\n   2. Production: E -\u003e E '*' E, conflicts with symbols '+', '*'.\n\nPossible solutions:\n\n  1. Conflicts possibly can be resolved by using \"operators\" section,\n     where you can specify precedence and associativity.\n\n  2. By using different parsing mode, e.g. LALR1 instead of SLR1.\n\n  3. Restructuring grammar.\n```\n\nAs we can see, _Syntax_ showed which rules conflict with which lookahead symbols, and provided possible solutions.\n\nIn this case, rule `E -\u003e E '+' E` conflicts with lookahead `'+'`, and `'*'`. This means, that if have `id + id + id`, which would expand to `E '+' E '+' E`, the parser wouldn't know whether to _reduce_ first `E '+' E` to `E`, or whether to _shift_ further when it sees the second `'+'` symbol.\n\nBy specifying operators precedence, and associativity, we can resolve this conflict, which is done in the [calculator-assoc](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/calculator-assoc.g) grammar:\n\n```\n./bin/syntax -g examples/calculator-assoc.g -m slr1 --validate\n\nParsing mode: SLR(1).\n\n✓ Grammar doesn't have any conflicts!\n```\n\n### Module include, and parser events\n\nThe `moduleInclude` directive allows injecting an arbitrary code to the generated parser file. This is usually code to require needed dependencies, or to define them inline. As an example, see [the corresponding example grammar](https://github.com/DmitrySoshnikov/syntax/blob/master/examples/module-include.g.js), which defines all classes for AST nodes inline, and then uses them in the rule handlers.\n\nThe code can also define handlers for some parse events (attaching them to `yyparse` object), in particular for `onParseBegin` and `onParseEnd`. This allow injecting a code which is executed when the parsing process starts, and ends accordingly.\n\n```js\n\"moduleInclude\": `\n  class Node {\n    constructor(type) {\n      this.type = type;\n    }\n  }\n\n  class BinaryExpression extends Node {\n    ...\n  }\n\n  // Event handlers.\n\n  yyparse.onParseBegin = (string) =\u003e {\n    console.log('Parsing code:', string);\n  };\n\n  yyparse.onParseEnd = (value) =\u003e {\n    console.log('Parsed value:', value);\n  };\n`,\n\n...\n\n[\"E + E\",  \"$$ = new BinaryExpression($1, $3, $2)\"],\n```\n\n### Debug mode\n\nDebug mode allows measuring timing of certain steps, and analyzing other debug information. From the CLI it's activated using `--debug` (`-d`) option:\n\n```\n./bin/syntax -g examples/calc-eval.g -m slr1 -p '2 + 2 * 2' --debug\n\nDEBUG mode is: ON\n\n[DEBUG] Grammar (bnf) is in JS format\n[DEBUG] Grammar loaded in: 2.255ms\n\nParsing mode: SLR(1).\n\nParsing:\n\n2 + 2 * 2\n\n[DEBUG] Building canonical collection: 15.219ms\n[DEBUG] Number of states in the collection: 22\n[DEBUG] Building LR parsing table: 4.169ms\n[DEBUG] LR parsing: 2.180ms\n\n✓ Accepted\n\nParsed value:\n\n6\n\n[DEBUG] Total time: 70.284ms\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDmitrySoshnikov%2Fsyntax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDmitrySoshnikov%2Fsyntax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDmitrySoshnikov%2Fsyntax/lists"}