{"id":26229591,"url":"https://github.com/wldfngrs/parser-generator","last_synced_at":"2026-05-28T19:02:31.236Z","repository":{"id":276885333,"uuid":"930549312","full_name":"wldfngrs/parser-generator","owner":"wldfngrs","description":"Yet Another Parser Generator takes a grammar specification for an LR(1) grammar as input and generates a C++ header file containing tables and helper structs for parsing the LR(1) grammar.","archived":false,"fork":false,"pushed_at":"2025-03-11T16:25:24.000Z","size":202,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T17:39:43.478Z","etag":null,"topics":["c-plus-plus","cpp17","grammar","grammar-parser","lookahead","lr-parser","lr1","lr1-parser","parser-generator"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wldfngrs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-10T20:14:39.000Z","updated_at":"2025-03-11T17:36:20.000Z","dependencies_parsed_at":"2025-03-02T23:23:28.165Z","dependency_job_id":"178ba492-d280-447e-aef9-c76f1dc31de6","html_url":"https://github.com/wldfngrs/parser-generator","commit_stats":null,"previous_names":["wldfngrs/parser-generator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/wldfngrs/parser-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wldfngrs%2Fparser-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wldfngrs%2Fparser-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wldfngrs%2Fparser-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wldfngrs%2Fparser-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wldfngrs","download_url":"https://codeload.github.com/wldfngrs/parser-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wldfngrs%2Fparser-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28006375,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-plus-plus","cpp17","grammar","grammar-parser","lookahead","lr-parser","lr1","lr1-parser","parser-generator"],"created_at":"2025-03-12T22:16:54.414Z","updated_at":"2025-12-24T19:03:59.027Z","avatar_url":"https://github.com/wldfngrs.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Yet Another Parser-Generator (YAPG)\nYAPG sounds hilarious, but yes, this is yet another parse-tables generator. \n\nYAPG (if you'd call it that) takes a specification for an LR(1) grammar as input and generates a C++ header file containing action and goto tables, alongside helper structs for writing a parser for the grammar from the tables.\n\n### Contents\n* [Why?](#why)\n* [Installation](#installation)\n* [Usage](#usage)\n* [LR(1) Grammar Specification Syntax](#lr1-grammar-specification-syntax)\n\t* [Parentheses grammar](#parentheses-grammar)\n\t* [Action Table, Goto Table](#action-table-goto-table)\n\t* [Parentheses parse function](#parentheses-parse-function)\n\t* [Precedence and associativity](#precedence-and-associativity)\n\t\t* [Terminal precedence](#terminal-precedence)\n\t\t* [Explicit rule/production precedence](#explicit-ruleproduction-precedence)\n* [Conflicts](#conflicts)\n\t* [SHIFT-REDUCE conflicts](#shift-reduce-conflicts)\n\t* [REDUCE-REDUCE conflicts](#reduce-reduce-conflicts)\n* [Testing](#testing)\n\t* [Mathematical Expressions Interpreter](#mathematical-expressions-interpreter)\n\t* [Parentheses Interpreter](#parentheses-interpreter)\n\n## Why?\nWhile working on a compiler to LLVM IR, I had to deal with a pretty common compiler enginnering dilemma: hand-written or generated parsers. Arguing for the former, the parsers for a lot of the more successful languages are hand-written (clang, rust, gcc). The flexibility it provides for handling complex grammars is a significant advantage. Conversely? Well, generated parsers are insanely cool to me. I think that's more than enough reason.\n\nNow, I'm a sucker for \"figuring out how things work\" so of course I decided to write my own parser generator. Following three weeks of studying as much theory on parsing as I could find, alongside pretty consistent programming, I was thankfully successful. \n\nI don't expect this to be a big thing (whatever that means). It's just something I had a lot of fun working on. Plus, I'd include a less-than formal description of table-driven LR(1) parsing and table generation in a seperate [blog post](https://cruxofthematter.substack.com/). So much of the existing material exploring the topic can be \"tough\" looking.\n\n## Installation\nThe following commands clone the repository and compile the parser-generator:\n```\n$ git clone git@github.com:wldfngrs/parser-generator.git\n$ cd parser-generator\n$ mkdir build\n$ cd build\n$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++\n$ ninja parsegen\n```\nIf the build is successful, you should have an executable; `parsegen`, in the `build/` directory.\n\n## Usage\n```\n$ ./parsegen \u003cpath/to/grammar\u003e [OPTIONAL] \u003cpath/to/output/file\u003e\n$ ./parsegen -h or ./parsegen -H for help information\n```\nThe parser generator expects two arguments:\n- _**path to file containing specification of the LR(1) grammar**_,\n- _**path to file where the tables should be written to**_.\n\nIf an output file is not provided explicitly, the parser-generator defaults to `\"output.h\"` within the `build/` directory.\n\nEither of the help options; `-h`, `-H`, print the grammar specification syntax to standard output.\n\n## LR(1) Grammar Specification Syntax\n**Note**: _`RHS` ('Right Hand Side'), `LHS` ('Left Hand Side')_\n\nThe input grammar specification should follow these base rules:\n- **Terminals must be prefixed by 't_'. Each terminal definition line starting this way.**\n- **The first line in the terminal definition block should be the goal production lookahead terminal**.\n- **Terminals must be defined first, in a contiguous block without interleaving empty lines between individual definitions.**\n- **The first occuring empty line signifies the end of the terminal definition block, and the start of the rules/productions block.**\n- **The first line in the rules/productions block should be the goal production.**\n- **Terminals (symbols prefixed by 't_') cannot appear on the RHS of a rule/production.**\n- **The RHS and LHS of a production must be delimited by ' \u003e '. Note: _[SPACE] [GREATER_THAN] [SPACE]_**\n\nOf course, you'd receive helpful error and diagnostic messages if any of these rules are ignored.\n### Parentheses grammar\nFor a simple, yet robust example, take a 'parentheses' grammar that expects a right-parenthesis symbol `)` for every left-parenthesis symbol `(` (in that order). For example, `()` is valid but `)(` is not.\n\nHere's a correct grammar specification for such a 'parentheses' grammar:\n\n```\nt_EOF\nt_LP\nt_RP\n\nGoal \u003e List\nList \u003e List Pair\nList \u003e Pair\nPair \u003e t_LP Pair t_RP\nPair \u003e t_LP t_RP\n```\nAnd here's the output file generated by the parser generator, for the parentheses grammar:\n\n**`parentheses/parse-tables.h`**\n```c++\n#pragma once\n\n#include \u003cunordered_map\u003e\n#include \u003cunordered_set\u003e\n#include \u003cstring\u003e\n#include \u003cstring_view\u003e\n#include \u003cvector\u003e\n\nstd::unordered_set\u003cstd::string\u003e strings {\n\t\"Goal\", \"Pair\", \"List\"\n};\n\nenum TokenType {\n\tt_RP, t_EOF, t_LP\n};\n\nstd::vector\u003cstd::pair\u003cstd::string_view, size_t\u003e\u003e reduce_info {\n\t{ *strings.find(\"Goal\"), 1 }, { *strings.find(\"Pair\"), 3 },\n\t{ *strings.find(\"List\"), 2 }, { *strings.find(\"List\"), 1 },\n\t{ *strings.find(\"Pair\"), 2 }\n};\n\nstruct PairHash {\n\tsize_t operator()(const std::pair\u003csize_t, std::string_view\u003e\u0026 pair) const {\n\t\treturn std::hash\u003csize_t\u003e{}(pair.first) ^ std::hash\u003cstd::string_view\u003e{}(pair.second);\n\t}\n\n\tsize_t operator()(const std::pair\u003csize_t, TokenType\u003e\u0026 pair) const {\n\t\treturn std::hash\u003csize_t\u003e{}(pair.first) ^ pair.second;\n\t}\n};\n\nenum ActionType {\n\tSHIFT,\n\tREDUCE,\n\tACCEPT\n};\n\nstruct Action {\n\tActionType type;\n\tsize_t value;\n};\n\nstd::unordered_map\u003cstd::pair\u003csize_t, TokenType\u003e, Action, PairHash\u003e actionTable {\n\t{{ 6, t_EOF }, {REDUCE, 4 }}, {{ 0, t_LP }, {SHIFT, 2 }},\n\t{{ 8, t_RP }, {SHIFT, 10 }}, {{ 6, t_LP }, {REDUCE, 4 }},\n\t{{ 11, t_LP }, {REDUCE, 1 }}, {{ 5, t_RP }, {SHIFT, 11 }},\n\t{{ 11, t_EOF }, {REDUCE, 1 }}, {{ 4, t_EOF }, {REDUCE, 2 }},\n\t{{ 10, t_RP }, {REDUCE, 1 }}, {{ 4, t_LP }, {REDUCE, 2 }},\n\t{{ 7, t_LP }, {SHIFT, 7 }}, {{ 9, t_RP }, {REDUCE, 4 }},\n\t{{ 3, t_EOF }, {REDUCE, 3 }}, {{ 3, t_LP }, {REDUCE, 3 }},\n\t{{ 1, t_EOF }, {ACCEPT, 0 }}, {{ 1, t_LP }, {SHIFT, 2 }},\n\t{{ 2, t_RP }, {SHIFT, 6 }}, {{ 2, t_LP }, {SHIFT, 7 }},\n\t{{ 7, t_RP }, {SHIFT, 9 }}\n};\n\nstd::unordered_map\u003cstd::pair\u003csize_t, std::string_view\u003e, size_t, PairHash\u003e gotoTable {\n\t{{ 0, *strings.find(\"List\") }, {1}}, {{ 0, *strings.find(\"Pair\") }, {3}},\n\t{{ 1, *strings.find(\"Pair\") }, {4}}, {{ 2, *strings.find(\"Pair\") }, {5}},\n\t{{ 7, *strings.find(\"Pair\") }, {8}}\n};\n```\n### Action Table, Goto Table\nThe generated _**actionTable**_ is an `std::unordered_map` that takes as 'key' an `std::pair` of the parse function's current state **and** the next token, to return as 'value' an `Action` struct object. \n\nIt defines the action to be taken; `SHIFT`, `REDUCE`, `ACCEPT`, given a current state and a next token.\n\nThe `Action` struct contains two fields: _type_ (`SHIFT`, `REDUCE`, `ACCEPT`) and _value_. For each _type_, the _value_ field is interpreted differently:\n\n**Note**: _`PF` ('parse function'), `NT` ('next token')_\n\n- `SHIFT`: Given the `PF` current state, and `NT`, _value_ represents the `PF` next state.\n- `REDUCE`: Given the `PF` current state, and `NT`, _value_ represents an index to _**reduce_info**_.\n\n  _**reduce_info**_ is an `std::vector` of `std::pair` elements. For each `std::pair` element, the first field is the non-terminal symbol to reduce to, and the second field is the number of states to pop off the `PF` state stack.\n\n  This non-terminal symbol information can be used for creating Abstract Syntax Tree nodes during the parse.\n- `ACCEPT`: Given the `PF` current state, and `NT`, _value_ is redundant. This is an accepting state, signifiying a validated input.\n\nThe generated _**gotoTable**_ is an `std::unordered_map` that takes as 'key' an `std::pair` of the parse function's current state **and** the `LHS` production symbol to be reduced to, to return as 'value' the `PF` next state. \n\nIt defines the next state given a current reduce non-terminal symbol and a next token.\n\nThese tables and helper structs can then be referenced within the parser's `parse()` function.\n### Parentheses parse function\nHere's the `parse()` function for the above specified parentheses [grammar](#parentheses-grammar):\n\n`parentheses/parentheses.h`\n```c++\nstatic bool parse(std::vector\u003cTokenType\u003e tokens) {\n\tstd::stack\u003csize_t\u003e states;\n\tstates.push(0);\n\tauto state = static_cast\u003csize_t\u003e(0);\n\n\tfor (auto i = 0; i \u003c tokens.size();) {\n\t\tstate = states.top();\n\t\tif (actionTable.find(std::make_pair(state, tokens[i])) == actionTable.end()) {\n\t\t\treturn false;\n\t\t}\n\t\t\n\t\tAction\u0026 action = actionTable[std::make_pair(state, tokens[i])];\n\n\t\tif (action.type == ActionType::REDUCE) {\n\t\t\tauto index = actionTable[std::make_pair(state, tokens[i])].value;\n\t\t\tauto pop_count = reduce_info[index].second;\n\t\t\twhile (pop_count--) {\n\t\t\t\tstates.pop();\n\t\t\t}\n\t\t\tstate = states.top();\n\t\t\tauto reduce_symbol = reduce_info[index].first;\n\t\t\tauto next_state = gotoTable[std::make_pair(state, reduce_symbol)];\n\t\t\tstates.push(next_state);\n\t\t}\n\t\telse if (action.type == ActionType::SHIFT) {\n\t\t\tauto next_state = actionTable[std::make_pair(state, tokens[i])].value;\n\t\t\tstates.push(next_state);\n\t\t\t// Only SHIFT actions proceed to the next token\n\t\t\t++i;\n\t\t}\n\t\telse if (action.type == ActionType::ACCEPT) {\n\t\t\treturn true;\n\t\t}\n\t}\n}\n```\nFor those familiar with table-generated parsers, the algorithm is easy to follow. For those who aren't, the parse function takes a list of tokens ordered by their chronological position in the input source file, iterating through them one by one.\n\nImplicitly assuming an initial state of `0`, the parse function progresses to subsequent states based on the 'action' mapped to the current state **and** the next token in the chronologically-ordered list of tokens. \n\nIf the pair of current state **and** next token do not exist in the generated _**actionTable**_, the parse function has entered an invalid state, and should return **false** signifiying a failed parse. \n\nElse, the parse function retrieves the 'action' mapped to that pair, and executes the corresponding behaviour tied to that 'action':\n\n- `SHIFT` action: push the next state to the state stack. Move to the next token.\n- `REDUCE` action: pop _x_ states off the state stack. Where _x_ is```reduce_info[reduce-action-value].second```. Refer to parentheses [parse function](#parentheses-parse-function).\n- `ACCEPT` action: return **true**.\n\n### Precedence and associativity\nThe grammar specification syntax allows you explicitly specify precedence and associativity behaviour of terminals. This is commonly useful in evaluating mathematical expressions.\n\nAs an example, here's the grammar specification of a simple 'mathematical expressions' grammar:\n```\nt_EOF\nt_PLUS 1 l\nt_MINUS 1 l\nt_TIMES 2 l\nt_DIVIDE 2 l\nt_NUMBER 4\nt_LP\nt_RP\n\nStatement \u003e Expression\nExpression \u003e t_NUMBER\nExpression \u003e Grouping\nExpression \u003e Add\nExpression \u003e Sub\nExpression \u003e Mul\nExpression \u003e Div\nExpression \u003e Unary\nGrouping \u003e t_LP Expression t_RP 2\nAdd \u003e Expression t_PLUS Expression\nSub \u003e Expression t_MINUS Expression\nMul \u003e Expression t_TIMES Expression\nDiv \u003e Expression t_DIVIDE Expression\nUnary \u003e t_MINUS Expression 3\n```\n#### Terminal precedence\nIn the terminal definition block,\n```\nt_EOF\nt_PLUS 1 l\nt_MINUS 1 l\nt_TIMES 2 l\nt_DIVIDE 2 l\nt_NUMBER 4\nt_LP\nt_RP\n```\n\nFollowing the terminal name definition, on the same line, the terminal's precedence value and associativity behavior can be explicitly set. \n\nPrecedence values must be integers (positive or negative). If not explicitly set, the terminal precedence value defaults to `0`. The higher the precedence value set, the higher the terminal precedence.\n\nAssociativity behaviour must be one of `l` (left-associative), `r` (right-associative), or `n` (non-associative). If not explicitly set, the associativity behavior defaults to `n` (non-associative). \n\nIn the above terminal definition block, the `t_PLUS` (`+`) and `t_MINUS` (`-`) terminals (binary operators for addition and subtraction respectively) have the same precedence value; `1`, and associativity behavior; `l` (left-associative), specified. \n\nHowever, `t_TIMES` (`*`) and `t_DIVIDE` (`/`) (binary operators for multiplication and division) both have a higher precedence value; `2`. Multiplication and division, as a result, would have a higher precedence than addition and substraction (as expected).\n\nNumbers, `t_NUMBER`, the most important node of any mathematical expression, has the highest precedence value set `4`.\n\n#### Explicit rule/production precedence\nThe precedence of a rule/production can be set by specifying an integer value at the end of a rule/production. For example,\n```\nUnary \u003e t_MINUS Expression 3\n```\nNote that the `Unary` rule expects the same terminal as a `Sub` rule; `t_MINUS`. However, it's standard mathematical understanding that Unary operations are evaluated before any other binary operations. \n\nExplicit rule precedence forces the parser to somewhat override the already set precedence value of `t_MINUS`, treating it as an implicit Unary terminal instead (within the `Unary` rule/production) with the explicitly set precedence value.\n\n## Conflicts\nWith table-driven parsers, two kinds of conflicts potentially arise in the table-generation process: `SHIFT-REDUCE` and `REDUCE-REDUCE` conflicts.\n\n### SHIFT-REDUCE conflicts\nConsider an expression, `2 + 2 * 4`. This could be evaluated in two different ways. \n\nOn the one hand, the addition can be evaluated first, `2 + 2`, and the result evaluated against the `* 4`, like so; `(2 + 2) * 4`. This would produce `16` as output, which is incorrect. \n\nFortunately, we understand that the precedence of mathematical operators demands that multiplication operations be evaluated before addition operations. That is, `2 + (2 * 4)`, to produce `10` as output. \n\nHow to encode this behavior in the parser-generator? [Precedence and associativity rules](#precedence-and-associativity), of course.\n\nSimply put, a `SHIFT-REDUCE` conflict occurs when both a `SHIFT` and `REDUCE` action would produce a valid parse function next state. These conflicts are resolved by applying these conditions in order:\n\nLet `rule` be a rule subject to reduce and `term` be a terminal/token that is encountered on input.\n\n  1. If explicit `rule` precedence is bigger than `term` precedence, perform a `REDUCE`.\n  2. If precedence of last terminal in `rule` is bigger than `term` precedence, perform a `REDUCE`.\n  3. If precedence of last terminal in `rule` is equal to `term` precedence and last terminal in `rule` is left-associative, perform a `REDUCE`\n  4. Otherwise, perform a `SHIFT`.\n\n### REDUCE-REDUCE conflicts\nIn some cases, the language is ill-formed and the grammar specification on `REDUCE` actions is unclear. That is, more than one rules/productions have the same `LHS`. \n\nThis is a fatal error that results in the parser generator terminating early with an error message identifying the rules/productions leading to ambiguous `REDUCE` actions.\n\n## Testing\nRather than a bunch of carefully hand-picked test-cases, this repository includes two REPL interpreters for both the 'mathematical expressions' grammar and the 'parentheses' grammar.\n### Mathematical Expressions Interpreter\nRun these commands within the `build/` directory to build and then run the 'mathematical expressions' interpreter:\n```\n$ ninja expressions\n$ ./expressions\nMath Expressions Evaluator ('q' or CTRL-C to exit)\n\u003e\n```\nYou can now verify the correctness of the parser's expression evaluation order of fundamental (`*`, `+`, `-`, `\\`, Unary, Brackets) mathematical operations.\n### Parentheses Interpreter\nRun these commands within the `build/` directory to build and then run the 'parentheses' interpreter:\n```\n$ ninja parentheses\n$ ./parentheses\nParentheses Grammar Interpreter (enter 'q' or CTRL-C to exit)\n\u003e\n```\nSimilarly, you can verify the correctness of a string of parentheses; that is, every left parenthesis `(` has a matching right parenthesis `)` in the correct order (i.e. balanced and well-formed).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwldfngrs%2Fparser-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwldfngrs%2Fparser-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwldfngrs%2Fparser-generator/lists"}