{"id":13438778,"url":"https://github.com/goccmack/gogll","last_synced_at":"2025-04-10T03:54:48.430Z","repository":{"id":54304578,"uuid":"206346212","full_name":"goccmack/gogll","owner":"goccmack","description":"Generates generalised LL (GLL) and reduced size LR(1) parsers with matching lexers","archived":false,"fork":false,"pushed_at":"2023-07-31T07:04:09.000Z","size":19504,"stargazers_count":197,"open_issues_count":1,"forks_count":24,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-10T03:54:40.117Z","etag":null,"topics":["compiler-construction","compiler-frontend","context-free-grammars","gll","go","golang","lexer-generator","lr-1","parser-generator","rust","rust-lang","rustlang"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/goccmack.png","metadata":{"files":{"readme":"Readme.md","changelog":"ChangeLog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-09-04T15:00:58.000Z","updated_at":"2025-03-12T03:40:31.000Z","dependencies_parsed_at":"2023-01-29T21:31:17.295Z","dependency_job_id":"acdcf74b-e9a3-4cea-802e-be8cf80ac6f3","html_url":"https://github.com/goccmack/gogll","commit_stats":null,"previous_names":[],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goccmack%2Fgogll","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goccmack%2Fgogll/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goccmack%2Fgogll/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goccmack%2Fgogll/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/goccmack","download_url":"https://codeload.github.com/goccmack/gogll/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248154999,"owners_count":21056542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compiler-construction","compiler-frontend","context-free-grammars","gll","go","golang","lexer-generator","lr-1","parser-generator","rust","rust-lang","rustlang"],"created_at":"2024-07-31T03:01:08.374Z","updated_at":"2025-04-10T03:54:48.407Z","avatar_url":"https://github.com/goccmack.png","language":"Go","funding_links":[],"categories":["HarmonyOS"],"sub_categories":["Windows Manager"],"readme":"![Apache 2.0 License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)\n[![Build Status](https://github.com/goccmack/gogll/workflows/build/badge.svg)](https://github.com/goccmack/gogll/actions)\n\nCopyright 2019 Marius Ackerman. \n\n# Note\nThis version does not support Rust. Please use v3.2.0 for Rust or log an issue if you need the features of this version in Rust.\n\n# GoGLL\nGogll generates a GLL or LR(1) parser and FSA-based lexer for any context-free grammar. \nThe generated code is Go or Rust.\n\n[Click here](https://goccmack.github.io/posts/2020-05-31_gogll/) for an introduction\nto GLL.\n\nSee the [LR(1) documentation](doc/lr1/Readme.md) for generating LR(1) parsers.\n\nThe generated GLL parser is a clustered nonterminal parser (CNP) following \n[[Scott et al 2019](#Scott-et-al-2019)]. \nCNP is a version of generalised LL parsing (GLL) \n[[Scott \u0026 Johnstone 2016](#Scott-et-al-2016)]. \nGLL parsers can parse all context free (CF) languages.\n\nThe generated LR(1) parser is a Pager's PGM or Knuth's original LR(1) \nmachine [[Pager 1977](Pager-1977)].\n\nThe generated lexer is a linear-time finite state automaton FSA \n[[Grune et al 2012](#Grune-et-al-2012)].\nThe lexer ignores whitespace.\n\nGogll accepts grammars in markdown files, which is very useful for documenting the grammar.\nFor example: see\n[gogll's own grammar](gogll.md).\n\nGLL has worst-case cubic time and space complexity but linear complexity for all \nLL productions [[Scott et al 2016](#Scott-et-al-2016)]. \n[See here](https://goccmack.github.io/posts/2020-05-31_gogll/) for space and\nCPU time measurements of an extreme ambiguous example.\nFor comparable grammars tested so far gogll produces faster lexers and parsers \nthan [gocc](https://github.com/goccmack/gocc) (FSA/LR-1).\n\n# News\n## 2022-10-11\nSPPF extraction added to the generated code. See [boolx example](examples/boolx/SPPF.md)\n## 2022-08-09\nGogll is used to build DAU [DASL](https://dau-technology.github.io/dau-blog/post/2022-08-02-dasl/)\n\n## 2020-08-12\nFrom v3.2.0 gogll supports tokens that can be suppressed by the lexer. This is useful, for example, to implement code comments. See [example](examples/comments/comments.md).\n\n## 2020-06-28\n1. Gogll now also generates LR(1) parsers. It supports \n_Pager's Practical General Method, weak compatibility_ as well as \n_Knuths original LR(1) machine_ for\ncomparison. Pager's PGM generates LR(1) parser tables similar is size to LALR. \nThe option to generate a Knuth LR(1) machine is provide for reference.  \nSee [LR(1) documentation](doc/lr1/Readme.md) for details.\n\n2. Please note that the `-t \u003ctarget\u003e` option has been replace by `-go` and `-rust`.\nSee see [usage](#Usage) below.\n\n## 2020-06-01\n[See](https://goccmack.github.io/posts/2020-05-31_gogll/) for an introduction\nto GLL and a performance comparison of the generated Go and Rust code parsers.\n\n## 2020-05-22\nGoGLL v3.1 generates Rust as well as Go parsers with similar performance:\n\n|| Lexer | Parser | Build\n|---|---|---|---|\nGo | 119 μs | 1324 μs | 0.124s\nRust | 71 μs | 1297 μs | 2.932s\n\n1. The duration was averaged over 1000 repetitions.\n1. Build time was measures with the time command.\n    1. For Rust: `time cargo build --release`\n    2. For Go: `time go build`\n\nSee [examples/rust](examples/rust/Readme.md) for the Rust and Go programs used \nfor this comparison.\n\nUse gogll's target option to generate a Rust lexer/parser: `-t rust` (see [usage](#Usage) below). \nGogll generates Go code by default.\n\n\n## 2020-04-24\n1. GoGLL now generates a linear-time FSA lexer matching the CNP parser.\n1. This version of *GoGLL is faster than gocc*. It compiles a sample grammar in  \n0.074 s, which GoCC compiles in 0.118 s. Gogll compiles itself in 0.041s.\n\n# \n\n# Benefits and disadvantages of GLL and LR(1)\nGLL is a parsing technique that can handle any context-free (CF) language. GLL has\nworst case cubic time and space complexity.\n\nLR(1) handles a subset of the context-free languages that can be parsed bottom-up\nwith one token look-ahead. LR(1) has linear time complexity and its table driven\nparser is very efficient. Pager's _Practical General Method_ (PGM) combines\ncompatible states as they are generated, keeping the state space small.\n\nA GLL parser has more expensive bookkeeping than an LR(1) parser, making the \nLR(1) parser more efficient for parsing very large inputs.\n\n## When to use GLL\n1. When the CF grammar that best expresses the problem is not LR(1).\n2. When the LR(1) parser has more than a few conflicts that require additional\nlanguage symbols or complex grammar refactorisation to resolve.\n3. The inputs to be parsed are not too big. GLL works very well for DSLs or \nprogramming languages.\n\n## When to use LR(1)\n1. When the language can be expressed as an LR(1) grammar. A grammar is LR(1) if \ngogll can generate a conflict-free parser for it.\n2. When the input is very big, for example: log files containing tens of thousands\nof lines.\n\n# Motivation for a separate lexer\nThe following observations were made while using GoGLLv2 on a couple of projects.\n\n* Most of the ambiguity in grammars were generated by the lexical rules.\n* Handling token separation explicitly produces messy, hard to maintain grammars.\n* Most of a grammar input file is whitespace, which together with the additional \nambiguity introduced by the lexical rules, causes most of the parse time in a \nscannerless parser.\n* Writing good markdown with the grammar produced slow compilations.\n\n# Input Symbols, Markdown Files\nGogll and lexers generated by gogll accept UTF-8 input strings, which may be in \na markdown file or a plain text file.\n\nIf the input is a markdown file gogll and lexers generated by gogll treat all \ntext outside markdown code blocks as whitespace. Markdown code blocks are \ndelimited by triple backticks. See [gogll.md](gogll.md) for an example.\n\n# Gogll Grammar\nGogll v3 has a BNF grammar. See [gogll.md](gogll.md)\n\n# Installation\n1. Install Go from [https://golang.org](https://golang.org)\n1. `go install github.com/goccmack/gogll/v3@latest` or \n1. Clone this repository and run `go install` in the root of the directory where\nit is installed.\n\n# Usage\nEnter `gogll -h` or `gogll` for the following help:\n\n```\nuse: gogll -h\n    for help, or\n\nuse: gogll -version\n    to display the version of goggl, or\n\nuse: gogll [-a][-v] [-CPUProf] [-o \u003cout dir\u003e] [-go] [-rust] [-gll] [-pager] [-knuth] [-resolve_conflicts] \u003csource file\u003e\n    to generate a lexer and parser.\n\n    \u003csource file\u003e: Mandatory. Name of the source file to be processed. \n        If the file extension is \".md\" the bnf is extracted from markdown code \n        segments enclosed in triple backticks.\n    \n    -a: Optional. Regenerate all files.\n        WARNING: This may destroy user editing in the LR(1) AST.\n        Default: false\n         \n    -v: Optional. Produce verbose output, including first and follow sets,\n        LR(1) sets and lexer FSA sets.\n    \n    -o \u003cout dir\u003e: Optional. The directory to which code will be generated.\n                  Default: the same directory as \u003csource file\u003e.\n                  \n    -go: Optional. Generate Go code.\n          Default: true, but false if -rust is selected\n\n    -rust: Optional. Generate Rust code.\n           Default: false\n           \n    -gll: Optional. Generate a GLL parser.\n          Default true. False if -knuth or -pager is selected.\n                  \n    -knuth: Optional. Generate a Knuth LR(1) parser\n            Default false\n\n    -pager: Optional. Generate a Pager PGM LR(1) parser.\n            Default false\n\n    -resolve_conflicts: Optional. Automatically resolve LR(1) conflicts.\n            Default: false. Only used when generating LR(1) parsers.\n    \n    -bs: Optional. Print BSR statistics (GLL only).\n    \n    -CPUProf : Optional. Generate a CPU profile. Default false.\n        The generated CPU profile is in \u003ccpu.prof\u003e. \n        Use \"go tool pprof cpu.prof\" to analyse the profile.\n```\n\n# Using the generated lexer and parser\n1. Create a lexer:  \nFrom an `[]rune`:\n```\n\tlexer.New(input []rune) *Lexer\n```\n  or from a file. If the file extension us `.md` the lexer will \n  treat all text outside the markdown code blocks as whitespace.\n```\n\tlexer.NewFile(fname string) *Lexer\n```\n2. Parse the lexer:  \n```\n\tif err, errs := parser.Parse(lex); err != nil {...}\n```\n3. Check for ambiguities in the parse forest\n```\n\tif bsr.IsAmbiguous() {\n\t\tfmt.Println(\"Error: Ambiguous parse forest\")\n\t\tbsr.ReportAmbiguous()\n\t\tos.Exit(1)\n\t}\n```\nAmbiguous BSRs must be resolved by walking the parse forest and ignoring\nunwanted children of ambiguous NTs (see [Complete Example](#Complete-Example)).\n4. Use the disambiguated parse tree for the further stages of compilation. \nFor example, see gogll's [AST builder](ast/build.go).\n\n\u003ca name=\"Complete-Example\"\u003e\u003c/a\u003e\n# Complete Example\nThe code of following example can be found at [examples/boolx](examples/boolx/boolx.md). \nThe example has the following grammar: [boolx.md](examples/boolx/boolx.md), which generates boolean expressions such as: `a | b \u0026 c | d \u0026 e`:\n\n```\npackage \"github.com/goccmack/gogll/examples/boolx\"\n\nExpr :   var\n     |   Expr Op Expr\n     ;\n\nvar : letter ;\n\nOp : \"\u0026\" | \"|\" ; \n```\nThe second alternate above, `Expr : Expr Op Expr`, is ambiguous and can produce an ambiguous parse forest.\nThe grammar does not enforce operator precedence, \nthis has to be done during semantic analysis.\n\nThe grammar is compiled by the following command:\n```\ngogll examples/boolx/boolx.md\n```\n\nThe test file, [boolx_test.go](examples/boolx/boolx_test.go) shows the steps\nrequired to parse an input string and produce a disambiguated abstract syntax tree:\n\n```\nconst t1Src = `a | b \u0026 c | d \u0026 e`\n\nfunc Test1(t *testing.T) {\n```\n1. Create a lexer from the input string and parse. Fail if there are parse errors.\n```\n\tif err, errs := parser.Parse(lexer.New([]rune(t1Src))); err != nil {\n\t\tfail(errs)\n\t}\n\n```\n2. Build an abstract syntax tree for each root of the parse forest and print them.\n```\n\tfor i, r := range bsr.GetRoots() {\n\t\tfmt.Printf(\"%d: %s\\n\", i, buildExpr(r))\n\t}\n}\n```\nThe input string produces an ambiguous parse forest, which is partially \ndisambiguated by applying operator precedence.\nWe get the following output from this test:\n```\n\u003e go test -v ./examples/boolx\n=== RUN   Test1\n0: (a | ((b \u0026 c) | (d \u0026 e)))\n1: \u003cnil\u003e\n2: \u003cnil\u003e\n3: ((a | (b \u0026 c)) | (d \u0026 e))\n--- PASS: Test1 (0.00s)\nPASS\n```\nThe output shows that the parse forest has 4 roots, 2 of which produce valid ASTs \nafter disambiguation. The removed trees are syntactically valid by semantically\ninvalid because they give `|` precedence over `\u0026`. \nBoth the remaining ASTs are syntactically and semantically\nvalid. The AST encodes operator precedence as shown by the parentheses. \nThe choice of which valid AST to use for further processing is application specific.\n\nIn this example disambiguation by operator precedence is applied during the\nAST build. \n\nOur AST has only one type of node: `Expr`.\n```\ntype ExprType int\n\nconst (\n\tExpr_Var ExprType = iota\n\tExpr_Expr\n)\n\ntype Expr struct {\n\tType  ExprType\n\tVar   *token.Token\n\tOp    *token.Token\n\tLeft  *Expr\n\tRight *Expr\n}\n\n```\nA node can represent a variable (`Type` = `Expr_Var`) or an expression (`Type` = `Expr_Expr`).\nIf the node represents a variable the field `Var` contains the variable token. \nOtherwise `Op` contains the operator token and `Left` and `Right` contain the nodes\nof the sub-expressions.\n\nThe AST is constructed recursively from each BSR root by the function, `buildExpr`\nin [boolx_test.go](examples/boolx/boolx_test.go).\n\n```\n/*\nExpr :   var\n     |   Expr Op Expr\n     ;\nOp : \"\u0026\" | \"|\" ;\n*/\nfunc buildExpr(b bsr.BSR) *Expr {\n\t/*** Expr :   var ***/\n\tif b.Alternate() == 0 {\n\t\treturn \u0026Expr{\n\t\t\tType: Expr_Var,\n\t\t\tVar:  b.GetTChildI(0),\n\t\t}\n\t}\n\n\t/*** Expr : Expr Op Expr ***/\n\top := b.GetNTChildI(1). // Op is symbol 1 of the Expr rule\n\t\t\t\tGetTChildI(0) // The operator token is symbol 0 for both alternates of the Op rule\n\n\t// Build the left subexpression Node. The subtree for it may be ambiguous.\n\tleft := []*Expr{}\n\t// b.GetNTChildrenI(0) returns all the valid BSRs for symbol 0 of the body of the rule.\n\tfor _, le := range b.GetNTChildrenI(0) {\n\t\t// Add subexpression if it is valid and has precedence over this expression\n\t\tif e := buildExpr(le); e != nil \u0026\u0026 hasPrecedence(e, op) {\n\t\t\tleft = append(left, e)\n\t\t}\n\t}\n\t// No valid subexpressions therefore this whole expression is invalid\n\tif len(left) == 0 {\n\t\treturn nil\n\t}\n\t// Belts and braces\n\tif len(left) \u003e 1 {\n\t\tpanic(fmt.Sprintf(\"%s has %d left children\", b, len(left)))\n\t}\n\t// Do the same for the right subexpression\n\tright := []*Expr{}\n\tfor _, le := range b.GetNTChildrenI(2) {\n\t\tif e := buildExpr(le); e != nil \u0026\u0026 hasPrecedence(e, op) {\n\t\t\tright = append(right, e)\n\t\t}\n\t}\n\tif len(right) == 0 {\n\t\treturn nil\n\t}\n\tif len(right) \u003e 1 {\n\t\tpanic(fmt.Sprintf(\"%s has %d right children\", b, len(right)))\n\t}\n\n\t// return an expression node\n\treturn \u0026Expr{\n\t\tType:  Expr_Expr,\n\t\tOp:    op,\n\t\tLeft:  left[0],\n\t\tRight: right[0],\n\t}\n}\n```\n\n# Status\n* `gogll v3` generates a matching lexer and parser. It generates GLL and LR(1) \nparsers. v3 compiles itself.\nv3 is used in a real-world project.\n* `gogll v2` had the last vestiges of the bootstrap compiler grammar removed from\nits input grammar. v2 compiled itself.\n* `gogll v1` was a GLL scannerless parser, which compiled scannerless GLL parsers.\nv1 compiled itself.\n* `gogll v0` was a bootstrap compiler implemented by a [gocc](https://github.com/goccmack/gocc) lexer and parser.\n\n# Features considered for future implementation\n1. Tokens suppressed by the lexer, e.g.: code comments.\n1. Better error reporting.\n1. Better documentation, including how to traverse the binary subtree representation (BSR [Scott et al 2019](#Scott-et-al-2019)) of the parse forest as well as on disambiguating \nparse forests.\n1. Letting the parser direct which tokens to scan [Scott \u0026 Johnstone 2019](#Scott-et-al-2019a)\n\n# Documentation\nAt the moment this document and the [gogll grammar](gogll.md) are the only documentation. Have a look at \n`gogll/examples/ambiguous` for a simple example and also for simple disambiguation.\n\nAlternatively look at `gogll.md` which is the input grammar and also the grammar\nfrom which the `parser` for this version of `gogll` was generated. `gogll/da` disambiguates the parse forest for an input string.\n\n## LR(1)\nSee the [LR(1) documentation](doc/lr1/Readme.md).\n\n# Changelog\n[see](ChangeLog.md)\n\n# Bibliography\n\u003ca name=Pager-1977\u003e\u003c/a\u003e\n* [Pager 1977] David Pager   \nA Practical General Method for Constructing LR(k) Parsers   \nActa Informatica 7, 1977\n\n\u003ca name=Scott-et-al-2019a\u003e\u003c/a\u003e\n* [Scott \u0026 Johnstone 2019] Elizabeth Scott and Adrian Johnstone  \nMultiple lexicalisation (a Java based study)  \nIn: [Proceedings of Software Language Engineering 2019. ACM, 2019. p. 71-82](https://pure.royalholloway.ac.uk/portal/files/34483813/lcnpSubmitFromEASForPure.pdf)\n\n\u003ca name=\"Scott-et-al-2019\"\u003e\u003c/a\u003e\n* [Scott et al 2019] Elizabeth Scott, Adrian Johnstone and L. Thomas van Binsbergen.  \nDerivation representation using binary subtree sets.  \nIn: Science of Computer Programming (175) 2019\n\n\u003ca name=\"Scott-et-al-2018\"\u003e\u003c/a\u003e\n* [Scott \u0026 Johnstone 2018] Elizabeth Scott and Adrian Johnstone.   \nGLL Syntax Analysers For EBNF Grammars.   \nIn: [Science of Computer Programming\nVolume 166, 15 November 2018](https://pure.royalholloway.ac.uk/portal/en/publications/gll-syntax-analysers-for-ebnf-grammars(58d1ec5e-28df-486a-879e-36d58a9f8abf).html)\n\n\u003ca name=\"Scott-et-al-2016\"\u003e\u003c/a\u003e\n* [Scott \u0026 Johnstone 2016] Elizabeth Scott and Adrian Johnstone.   \nStructuring the GLL parsing algorithm for performance.   \nIn: [Science of Computer Programming\nVolume 125, 1 September 2016](https://pure.royalholloway.ac.uk/portal/en/publications/structuring-the-gll-parsing-algorithm-for-performance(a95fc020-9918-4f17-a87a-845e2aee12b8).html)\n\n\u003ca name=\"Afroozeh-et-al-2013\"\u003e\u003c/a\u003e\n* [Afroozeh et al 2013] Ali Afroozeh, Mark van den Brand, Adrian Johnstone, Elizabeth Scott, Jurgen Vinju.   \nSafe Specification of Operator Precedence Rules.   \nIn: [Erwig M., Paige R.F., Van Wyk E. (eds) Software Language Engineering. SLE 2013. Lecture Notes in Computer Science, vol 8225. Springer, Cham](https://pure.royalholloway.ac.uk/portal/en/publications/safe-specification-of-operator-precedence-rules(0287d70e-92b8-4204-aafb-15a81de84968).html)\n\n\u003ca name=\"Grune-et-al-2012\"\u003e\u003c/a\u003e\n* [Grune et al 2012] Dick Grune, Kees van Reeuwijk, Henri E. Bal, Ceriel J.H. Jacobs and Koen Langendoen.\nModern Compiler Design. Second Edition.\nSpringer 2012\n\n\u003ca name=\"Basten-2012\"\u003e\u003c/a\u003e\n* [Basten \u0026 Vinju 2012] Basten H.J.S., Vinju J.J. (2012) Parse Forest Diagnostics with Dr. Ambiguity. In: Sloane A., Aßmann U. (eds) Software Language Engineering. SLE 2011. [Lecture Notes in Computer Science, vol 6940. Springer, Berlin, Heidelberg](https://homepages.cwi.nl/~jurgenv/papers/SLE2011-2.pdf)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoccmack%2Fgogll","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoccmack%2Fgogll","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoccmack%2Fgogll/lists"}