{"id":19207607,"url":"https://github.com/danilafe/pegasus","last_synced_at":"2025-04-30T15:24:16.742Z","repository":{"id":45156598,"uuid":"156638388","full_name":"DanilaFe/pegasus","owner":"DanilaFe","description":"A parser generator for C and Crystal.","archived":false,"fork":false,"pushed_at":"2024-11-07T18:42:52.000Z","size":224,"stargazers_count":61,"open_issues_count":4,"forks_count":3,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-30T15:24:10.930Z","etag":null,"topics":["c","compilers","crystal","parser","parser-generator"],"latest_commit_sha":null,"homepage":"","language":"Crystal","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanilaFe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-08T02:25:40.000Z","updated_at":"2025-01-12T08:16:23.000Z","dependencies_parsed_at":"2024-11-07T19:40:45.535Z","dependency_job_id":null,"html_url":"https://github.com/DanilaFe/pegasus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanilaFe%2Fpegasus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanilaFe%2Fpegasus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanilaFe%2Fpegasus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanilaFe%2Fpegasus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanilaFe","download_url":"https://codeload.github.com/DanilaFe/pegasus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251730059,"owners_count":21634319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","compilers","crystal","parser","parser-generator"],"created_at":"2024-11-09T13:20:57.842Z","updated_at":"2025-04-30T15:24:16.720Z","avatar_url":"https://github.com/DanilaFe.png","language":"Crystal","readme":"# pegasus\nA parser generator based on Crystal and the UNIX philosophy. It is language agnostic, but can\ncurrently generate parsers for the [C](#c-output) and [Crystal](#crystal-output) languages.\n\n_Warning: Pegasus is experimental. Its APIs are not yet solidified, and are subject to change at any time._\n\n## Table of Contents\n* [Architecture](#architecture)\n* [Usage](#usage)\n  * [Tokens](#tokens)\n  * [Rules](#rules)\n  * [A Note on Parse Trees](#a-note-on-parse-trees)\n  * [Regular Expressions](#regular-expressions)\n  * [Included Programs](#included-programs)\n  * [Options](#options)\n  * [Semantic Actions](#semantic-actions)\n* [C Output](#c-output)\n* [C Output With Semantic Actions](#c-output-with-semantic-actions)\n* [Crystal Output](#crystal-output)\n* [Crystal Output With Semantic Actions](#crystal-output-with-semantic-actions)\n* [JSON Format](#json-format)\n\n## Architecture\nPegasus is based on the UNIX philosophy of doing one thing, and doing it well.\nThe core pegasus program isn't as much a parser generator as it is a Push Down \nAutomaton generator.\n\nPegasus reads the grammar files, creates a Deterministic Finite Automaton (DFA) that is then used to tokenize (lex) text. Then, it creates an\nLALR(1) Push Down Automaton that is then used to parse text. However, it doesn't actually generate a parser: it outputs the generated tables for both automatons,\nas well as some extra information, as JSON. Another program, specific to each\nlanguage, then reads the JSON and is responsible for code output.\n\nThis is beneficial because this prevents the code generator from being dependent on a language. JSON is a data interchange format, and it is easily readable from almost any programming language. If I, or others, want to add a code generation target, they can just parse the JSON in their preferred language, rather than Crystal. An additional benefit is that the addition of a target doesn't require the pegasus core to be updated or recompiled.\n## Usage\nPegasus parses grammars written in very basic notation. The grammars are separated into two\nsections: the __tokens__ and the __rules__.\n### Tokens\nThe tokens are terminals, and are described using\nregular expressions. An example token declaration is as follows:\n```\ntoken hello = /hello/;\n```\nNotice that the token declaration is terminated by a semicolon. Also notice that the regular expression is marked at both ends by a forward slash, `/`. In order to write a regular expression that includes a forward slash, it needs to be escaped, like `\\/`. More information on regular expressions accepted by Pegasus can be found below.\n### Rules\nGrammar rules appear after tokens in the grammar file. An example rule is given as follows:\n```\nrule S = hello;\n```\nThis rule uses the token we declared above, that is, `hello`, which matches the string hello.\nIn order to expect multiple tokens, we simply write them one after another:\n```\nrule S = hello hello;\n```\nGrammar rules aren't limited to only tokens. The names of other grammar rules, declared either earlier or later in the file, can also be used. For example:\n```\nrule S = two_hello hello;\nrule two_hello = hello hello;\n```\nHere, we declare a second rule, `two_hello`, and then use it in the `S` rule.\n\nSometimes, it's useful to be able to declare several alternatives for rule. For example, we want to have an \"operand\" rule in a basic calculator, and an operand can either be a variable like \"x\" or a number like \"3\". We can write a rule as follows:\n```\nrule operand = number | variable;\n```\n### A Note on Parse Trees\nEarlier, we saw two rules written as follows:\n```\nrule S = two_hello hello;\nrule two_hello = hello hello;\n```\nWhile it accepts the same language, this is __not__ equivalent to the following:\n```\nrule S = hello hello hello;\n```\nThe reason is that Pegasus, by default, produces parse trees. The first grammar will produce\na parse tree whose root node, `S`, has two children, one being `two_hello` and the other being `hello`. The `two_hello` node will have two child nodes, both `hello`. However, the second variant will produce a parse tree whose root node, `S`, has three children, all `hello`.\n### Regular Expressions\nRegular\nexpressions support some basic operators:\n* `hey+` matches `hey`, `heyy`, `heyyy`, and so on.\n* `hey?` matches `hey` or `he`\n* `hey*` matches `he`, `hey`, `heyy`, and so on.\n\nOperators can also be applied to groups of characters:\n* `(ab)+` matches `ab`, `abab`, `ababab`, and so on.\n\nPlease note, however, that Pegasus's lexer does not capture groups.\n### Options\nPegasus supports an experimental mechanism to aid in parser generation, which involves attaching options\nto tokens or rules. Right now, the only option that is recognized is attached to a token definition. This option is \"skip\".\nOptions are delcared as such:\n```\ntoken space = / +/ [ skip ];\n```\nThe skip option means that the token it's attached to, in this case `space`, will be immediately discarded, and parsing will go on\nas if it wasn't there. For example, if we want a whitespace-insensitive list of digits, we can write it as such: \n```\ntoken space = / +/ [ skip ];\ntoken digit = /[0-9]/;\ntoken list_start = /\\[/;\ntoken list_end = /\\]/;\ntoken comma = /,/;\n\nrule list = list_start list_recursive list_end;\nrule list_recursive = digit | digit comma list_recursive;\n```\nNow, this will be able to parse equivalently the strings \"[3]\", \"[ 3 ]\" and [ 3]\", because the whitespace token is ignored.\n### Semantic Actions\nIt's certainly convenient to create a parse tree that perfectly mimics the structure of a language's grammar. However, this isn't always desirable - if the user desires to construct an Abstract Syntax Tree, they're left having to walk the structure of the resulting tree _again_, frequently checking what rule created a particular nonterminal, or how many children a root node has. This is less than ideal - we don't want to duplicate the work of specifying the grammar when we walk the trees. Furthermore, if the grammar changes, the code that walks the parse trees will certainly need to change.\n\nTo remedy this, I've been toying with the idea of including _semantic actions_ into Pegasus, in a very similar way to Yacc / Bison. Semantic actions are pieces of code that run when a particular rule in the grammar is matched. However, this would mean that the user has to write these actions in some particular language (Yacc / Bison use C/C++). Since Pegasus aims to be language agnostic, writing code in a particular language in the main grammar file is undesirable. Thus, I chose the approach of separating semantic actions into a separate file format. The format uses `$$` to delimit code blocks, and contains the following sections:\n\n* Types that various nonterminals are assigned. For instance, a boolean expression can be assigned the C++ type \"bool\".\n* The actual rules that are of each of the types declared above.\n* The init code (placed in a global context before the parsing function)\n* The semantic actions for each rule.\n\nFor a concrete example of this file format, see the example code in the [C Output With Semantic Actions](#c-output-with-semantic-actions) section.\n\n### Included programs\nBefore you use any of these programs, you should use\n```\nshards build --release\n```\nThis will compile all the Pegasus programs in release mode,\nfor optimal performance.\n#### `pegasus`\nThis program reads grammars from standard input, and generates\nJSON descriptions out LALR automata,\nwhich will be read by the other programs. For example:\n```Bash\necho 'token hello = /Hello, world!/; rule S = hello;' \u003e test.grammar\n./bin/pegasus \u003c test.grammar\n```\nThis prints the JSON to the command line. If you'd like to output\nJSON to a file, you can use:\n```Bash\n./bin/pegasus \u003c test.grammar \u003e test.json\n```\n#### `pegasus-dot`\nThis program is used largely for debugging purpose, and generates GraphViz\nDOT output, which can then by converted by the `dot` program into images.\nThis greatly helps with debugging generated automata. `pegasus-dot` simply\nreads the generated JSON file:\n```Bash\n./bin/pegasus-dot \u003c test.json\n```\nTo generate a PNG from the DOT output, you need the `dot` program installed.\nOnce you have that, you can just pipe the output of `pegasus-dot` into `dot`:\n```Bash\n./bin/pegasus-dot \u003c test.json | dot -Tpng -o visual.png\n```\n#### `pegasus-sim`\nThis is another program largely used for debugging. Instead of generating\na parser, it reads a JSON file, then attempts to parse text from STDIN.\nOnce it's done, it prints the result of its attempt. Note that because\nit reads input from STDIN, rather than JSON, the JSON\nfile has to be given as a command-line argument:\n```Bash\necho 'Hello, world!' | ./bin/pegasus-sim -i test.json\n```\n\n#### `pegasus-c`\nFinally, a parser generator! `pegasus-c` takes JSON, and creates C\nheader and source files that can then be integrated into your project.\nTo learn how to use the generated code, please take a look at the\n[C output](#c-output) section.\n```Bash\n./bin/pegasus-c \u003c test.json\n```\n\n#### `pegasus-crystal`\nAnother parser generator. `pegasus-crystal` outputs Crystal code\nwhich can then be integrated into your project.\nTo learn how to use the generated code, lease take a look at the\n[Crystal output](#crystal-output) section.\n```Bash\n./bin/pegasus-crystal \u003c test.json\n```\n\n#### `pegasus-csem`\nAnother C parser generator. The difference between this parser generator and `pegasus-c` is that it uses a separate semantic actions file to mimic the functionality of Yacc/Bison. This means you can specify C code that runs when each rule in the grammar is matched. To learn how to use this parser generator, see the [C Output With Semantic Actions](#c-output-with-semantic-actions) section.\n```\n./bin/pegasus-csem -l test.json -a test.sem\n```\n\n## C Output\nThe pegasus repository contains the source code of a program that converts the JSON output into C source code. It generates a derivation tree, stored in `pgs_tree`, which is made up of nonterminal parent nodes and terminal leaves. Below is a simple example of using the functions generated for a grammar that describes the language of a binary operation applied to two numbers.\nThe grammar:\n```\ntoken op_add = /\\+/;\ntoken op_sub = /-/;\ntoken op_mul = /\\*/;\ntoken op_div = /\\//;\ntoken number = /[0-9]/;\n\nrule S = expr;\nrule expr = number op number;\nrule op = op_add | op_sub | op_div | op_mul;\n```\n_note: backslashes are necessary in the regular expressions because `+` and `*` are operators in the regular expression language._\n\nThe code for the API:\n```C\n/* Include the generated header file */\n#include \"parser.h\"\n#include \u003cstdio.h\u003e\n\nint main(int argc, char** argv) {\n    pgs_state state; /* The state is used for reporting error messages.*/\n    pgs_tree* tree; /* The tree that will be initialized */\n    char buffer[256]; /* Buffer for string input */\n\n    gets(buffer); /* Unsafe function for the sake of example */\n    /* pgs_do_all lexes and parses the text from the buffer. */\n    if(pgs_do_all(\u0026state, \u0026tree, buffer)) {\n        /* A nonzero return code indicates error. Print it.*/\n        printf(\"Error: %s\\n\", state.errbuff);\n    } else {\n        /* Do nothing, free the tree. */\n        /* Tree is not initialized unless parse succeeds. */\n        pgs_free_tree(tree);\n    }\n}\n```\nThis example is boring because nothing is done with the tree. Let's walk the tree and print it out:\n```C\nvoid print_tree(pgs_tree* tree, const char* source, int indent) {\n    size_t i;\n    /* Print an indent. */\n    for(i = 0; i \u003c indent; i++) printf(\"  \");\n    /* If the tree is a terminal (actual token) */\n    if(tree-\u003evariant == PGS_TREE_TERMINAL) {\n        printf(\"Terminal: %.*s\\n\", (int) (PGS_TREE_T_TO(*tree) - PGS_TREE_T_FROM(*tree)),\n                source + PGS_TREE_T_FROM(*tree));\n    } else {\n        /* PGS_TREE_NT gives the nonterminal ID from the given tree. */\n        printf(\"Nonterminal: %s\\n\", pgs_nonterminal_name(PGS_TREE_NT(*tree)));\n        /* PGS_TREE_NT_COUNT returns the number of children a nonterminal\n           node has. */\n        for(i = 0; i \u003c PGS_TREE_NT_COUNT(*tree); i++) {\n            /* PGS_TREE_NT_CHILD gets the nth child of a nonterminal tree. */\n            print_tree(PGS_TREE_NT_CHILD(*tree, i), source, indent + 1);\n        }\n    }\n}\n```\nFor the input string `3+3`, the program will output:\n```\nNonterminal: S\n  Nonterminal: expr\n    Nonterminal: number\n      Terminal: 3\n    Nonterminal: op\n      Terminal: +\n    Nonterminal: number\n      Terminal: 3\n```\nSome more useful C macros for accessing the trees can be found in `parser.h`\n\n## C Output With Semantic Actions\nSay you don't need a parse tree. Instead, you want to construct your own values from Pegasus grammar rules. In this case, you want to use the `pegasus-csem` parser generator. It is best demonstrated using a small example. Let's consider a language of booleans:\n```\ntoken whitespace = /[ \\n\\t]+/ [ skip ];\ntoken true = /true/;\ntoken false = /false/;\ntoken and = /and/;\ntoken or = /or/;\n\nrule S = expr;\nrule expr = tkn | expr and tkn | expr or tkn;\nrule tkn = true | false;\n```\nEasy enough. But why would we want a parse tree from this? Let's operate directly on booleans (which we'll represent as integers in C). We create the semantic actions file step by step. First, we know all our actions  will produce integers (which represent booleans). So we create a boolean type:\n```\ntype boolean = $$ int $$\n```\nNow, we want to assign this type to the nonterminals in our language. We do this as follows:\n```\ntyperules boolean = [ S, expr, tkn ]\n```\nWe don't need any global variables or functions, so we can just leave the `init` block blank:\n```\ninit = $$ $$\n```\nNext, we write actions that correspond to each gramamr rule.\n```\nrule S(0) = $$ $out = $0; $$\n```\n`$out` is the \"output\" variable, and `$0` is the value generated for the first terminal or nonterminal in the rule (in this case, `expr`). This rule just forwards the result of the rules for `expr`. Next, let's write rules for `expr`:\n```\nrule expr(0) = $$ $out = $0; $$\nrule expr(1) = $$ $out = $0 \u0026 $2; $$\nrule expr(2) = $$ $out = $0 | $2; $$\n```\nThe first rule simply forwards the value generated for `tkn`. The other two rules combine the results of their subexpressions using `\u0026` and `|` (we use `\u0026` in the grammar rule that has the `and` token, and `|` in the grammar rule that has the `or` token). Finally, we write the rules for `tkn`:\n```\nrule tkn(0) = $$ $out = 1; $$\nrule tkn(1) = $$ $out = 0; $$\n```\nTime to test this. We need to write a simple program that uses the parser. The main difference from the C output without semantic actions is that we use `pgs_stack_value` union type, with fields named after the types we registered (`boolean`, in this case). The code:\n```C\n#include \"parser.h\"\n\nint main() {\n    pgs_stack_value v; /* Temporary variable into which to store the result */\n    pgs_state s; /* The state used for reporting error message */\n\n    /* Initialize the state */\n    pgs_state_init(\u0026s);\n    /* Tokenize and parse a hardcoded string, ignoring error code */\n    pgs_do_all(\u0026s, \u0026v, \"false or false or true\");\n    /* Print the error generated, if any */\n    printf(\"%s\\n\", s.errbuff);\n    /* Print the boolean value as an integer. */\n    printf(\"%d\\n\", v.boolean);\n}\n```\nThe output is the result of evaluating our expression: \"true\", or 1:\n```\n\n1\n```\n\n## Crystal Output\nJust like with C, this repository contains a program to output Crystal when code given a JSON file.\nBecause Crystal supports exceptions and garbage collection, there is no need to initialize\nany variables, or call corresponding `free` functions. The most basic example of reading\na line from the standard input and parsing it is below:\n```Crystal\nrequire \"./parser.cr\"\n\nPegasus::Generated.process(STDIN.gets.not_nil!)\n```\nOf course, this isn't particularly interesting. Let's add a basic function to print the tree:\n```Crystal\ndef print_tree(tree, indent = 0)\n  indent.times { STDOUT \u003c\u003c \"  \" }\n  case tree\n  when Pegasus::Generated::TerminalTree\n    STDOUT \u003c\u003c \"Terminal: \"\n    STDOUT.puts tree.string\n  when Pegasus::Generated::NonterminalTree\n    STDOUT \u003c\u003c \"Nonterminal: \" \u003c\u003c tree.name\n    STDOUT.puts\n    tree.children.each { |it| print_tree(it, indent + 1) }\n  end\nend\n```\nFor the input string `3+3`, the program will output:\n```\nNonterminal: S\n  Nonterminal: expr\n    Nonterminal: number\n      Terminal: 3\n    Nonterminal: op\n      Terminal: +\n    Nonterminal: number\n      Terminal: 3\n```\n\n## Crystal Output with Semantic Actions\nThis is just like C semantic actions, but with Crystal. Suppose you don't need\na parse tree. Rather, you want to generate your own values from Pegasus grammar\nrules. You can do this with the `pegasus-crystalsem` parser generator. When\nusing this generator, you specify an additional file, which associates Crystal\ncode (_semantic actions_) with each rule. Let's consider a language\nof booleans:\n```\ntoken whitespace = /[ \\n\\t]+/ [ skip ];\ntoken true = /true/;\ntoken false = /false/;\ntoken and = /and/;\ntoken or = /or/;\n\nrule S = expr;\nrule expr = tkn | expr and tkn | expr or tkn;\nrule tkn = true | false;\n```\nNow that we have our grammar, it's time to formulate the additional file\nwe mentioned. The first thing we need to do is figure out what Crystal\ntype each of the nonterminals we generate. Our language is that\nof booleans, so we will be needing a boolean type:\n```\ntype boolean = $$ Bool $$\n```\nHere, the stuff inside the `$$` is Crystal code that is pasted verbatim into the\ngenerated parser. Now, we want to specify which rules evaluate to that type.\nIn our simple language, every rule evaluates to a boolean:\n```\ntyperules boolean = [ S, expr, tkn ]\n```\n`pegasus-crystalsem` also allows you to put some code above the parsing code,\nglobally. We don't use this, so we leave the `init` property blank:\n```\ninit = $$ $$\n```\nIt is now time to assign semantic Crystal actions to each grammar rule. We\nstart with the first rule, `S(0)` (which means the first rule for the\n`S` nonterminal). Since the first rule just matches an `expr`, we\nsimply output the value of that `expr`:\n```\nrule S(0) = $$ $out = $0 $$\n```\nThis means \"set the output to be the value of the first element in the rule's body\".\nWe now implement the actual rules for `expr`. The first rule simply forwards\nthe result of the `tkn`, just like the rule for `S`. The other two rules actually\nimplement the logical operations of `\u0026` and `|`:\n```\nrule expr(0) = $$ $out = $0 $$\nrule expr(1) = $$ $out = $0 \u0026 $2 $$\nrule expr(2) = $$ $out = $0 | $2 $$\n```\nFinally, we use the two rules for `tkn` to actually return a boolean:\n```\nrule tkn(0) = $$ $out = true $$\nrule tkn(1) = $$ $out = false $$\n```\nLet's test this. We include the generated parser, and write the following:\n```Crystal\nrequire \"./parser.cr\"\n\nputs Pegasus::Generated.process(gets.not_nil!)\n```\nLet's now run this with the expression `true or false or true`. The output:\n```\ntrue\n```\nThat's indeed our answer!\n\n## JSON Format\nFor the grammar given by:\n```\ntoken hi = /hi/;\nrule A = hi;\n```\nThe corresponding (pretty-printed) JSON output is:\n```\n{\n  \"lex_state_table\":[[..]..],\n  \"lex_final_table”:[..],\n  \"parse_state_table\":[[..]..],\n  \"parse_action_table\":[[..]..],\n  \"terminals\":{\n    \"hi\":{\n      \"terminal_id\":0\n    }\n  },\n  \"nonterminals\":{\n    \"A\":{\n      \"nonterminal_id\":0\n    }\n  },\n  \"items\":[\n    {\n      \"head\":{\n        \"nonterminal_id\":0\n      },\n      \"body\":[\n        {\n          \"terminal_id\":0\n        }\n      ]\n    }\n  ],\n  \"max_terminal\":0\n}\n```\n## Contributors\n\n- [DanilaFe](https://github.com/DanilaFe) Danila Fedorin - creator, maintainer\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanilafe%2Fpegasus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanilafe%2Fpegasus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanilafe%2Fpegasus/lists"}