https://github.com/glebec/left-recursion
Quick explanation of eliminating left recursion in Haskell parsers
https://github.com/glebec/left-recursion
cfg grammars left-recursion-elimination parsing recursive-descent-parser
Last synced: 3 months ago
JSON representation
Quick explanation of eliminating left recursion in Haskell parsers
- Host: GitHub
- URL: https://github.com/glebec/left-recursion
- Owner: glebec
- License: bsd-3-clause
- Created: 2019-10-28T07:22:26.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-09-13T02:06:07.000Z (almost 4 years ago)
- Last Synced: 2025-03-17T23:49:22.671Z (4 months ago)
- Topics: cfg, grammars, left-recursion-elimination, parsing, recursive-descent-parser
- Language: Haskell
- Size: 27.3 KB
- Stars: 48
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Elimination of Left Recursion
> 👉 Note: common parsing libraries include combinators like [`chainl1`](https://hackage.haskell.org/package/parsec-3.1.14.0/docs/Text-Parsec.html#v:chainl1) which explicitly handle left-recursive grammars without needing to refactor the grammar as shown here. I recommend using such combinators where provided / feasible. Leaving this repo up for reference's sake.
Parser combinators are expressive and easy to embed and reuse within an application. However, they implement _recursive descent_ parsing algorithms, which cannot parse _left-recursive_ grammars. Thankfully, there exists a simple technique to _eliminate_ left recursion in most grammars.
These concepts are detailed [here](https://www.csd.uwo.ca/~moreno/CS447/Lectures/Syntax.html/node8.html) and elsewhere, but typically in the academic jargon of context-free grammars and parsing theory. In contrast, this codebase aims to demonstrate the problem and fix for those familiar with Haskell fundamentals.
## The Setup
Imagine we have a [small data structure](src/Expr.hs) representing a potentially recursive tree of subtraction expressions (similar to [Hutton's Razor](http://www.cs.nott.ac.uk/~pszgmh/semantics.pdf)).
```hs
data Expr = Lit Int | Sub Expr ExprexampleExpr = Sub (Lit 4) (Sub (Lit 3) (Lit 0))
```The _string language_ we may want to parse _into_ this data structure could consist of digits, parens, subtraction symbols and so on:
```hs
str1 = "1"
str2 = "1-3"
str3 = "(9)"
str4 = "0-(3-8)-(((2))-(2-1))"
```This is just a toy example, but it demonstrates the idea that our _language_ (the set of legal strings) may include more tokens than are explicitly represented in our _target_ (the result of parsing).
### Grammars
To organize our thoughts we might try to draft a _grammar_, describing the _production rules_ that can generate all legal strings. Grammars are often written in [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) syntax, but this example is hopefully understandable without prior knowledge:
```ebnf
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
GROUP = "(", EXPR, ")"
LIT = "0" | "1" | "2" | ... | "9"
```In other words,
- "An `EXPR`ession is either a `SUB`traction, `GROUP`, or `LIT`eral"
- "A `SUB`traction is an `EXPR`ession, followed by '-', followed by an `EXPR`ession"
- "A `GROUP` is '(', followed by an `EXPR`ession, followed by ')'"
- "A `LIT`eral is either '0', or '1', or... (etc.)"Grammars consist of _terminal_ symbols like "-" and "3", which actually appear in the language strings, and _nonterminal_ placeholders like `LIT`, which do not. To build an arbitrary legal string by hand, start from the `EXPR` placeholder, and replace it with anything on the right side of its corresponding rule (`SUB`, `GROUP`, or `LIT`). Proceed with replacing nonterminal placeholders with permitted substitutions until your string consists solely of terminal symbols. For example:
```
EXPR
LIT
4
```Or:
```
EXPR
GROUP
(EXPR)
(SUB)
(EXPR-EXPR)
(LIT-EXPR)
(5-EXPR)
(5-LIT)
(5-2)
```### Grammars and Parser Combinators
Remarkably, defining a valid _grammar_ for a language (that is, the set of rules that can generate any legal string in the language) is almost the same as defining a working set of _parsers_ for the language (that is, the functions which can analyze an existing string for its structure). Even though these activities (generating vs. consuming strings) are in some ways opposite, their forms are comparable.
So, the context-free _production rule_:
```ebnf
GROUP = "(", EXPR, ")"
```Corresponds directly to the Haskell (via `trifecta`) _parser_:
```hs
group :: Parser Expr
group = char '(' *> expr <* char ')'
```(Or in monadic style with `do` notation, if you prefer:)
```hs
group :: Parser Expr
group = do
char '('
e <- expr
char ')'
pure e
```## The Problem
The grammar shown earlier is in fact a 100% valid grammar for the expression language we wish to parse. That is, the grammar is capable of _producing any arbitrary string_ of the language, including examples like `"0-(3-8)-(((2))-(2-1))"`.
We want to go backwards – analyze an existing string. In [`src/Broken.hs`](src/Broken.hs), we attempt to structure our parser combinator outline according to this grammar. However, if you attempt to use that parser (not recommended!) on a simple string like "1", it will result in an infinite loop. Why? Let's review the first two lines of the grammar, and their corresponding parsers:
```ebnf
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
``````hs
-- Grammar rule: EXPR = SUB | GROUP | LIT
expr :: Parser Expr
expr = sub <|> group <|> lit -- first try `sub`...-- Grammar rule: SUB = EXPR, "-", EXPR
sub :: Parser Expr
sub = do
e1 <- expr -- now do `expr`. WARNING: infinite recursion!
char '-'
e2 <- expr
pure $ Sub e1 e2
```The sequence of events when parsing a string like "1" via the `expr` parser is as follows:
1. Hm, an `expr` might be a `sub`, let's try that parser.
2. Ok, a `sub` begins with an `expr`, let's try that parser. (GOTO 1)At this point the issue becomes quite clear! Even though this grammar is a valid one for _producing_ arbitrary strings, it is not a useful one for _parsing_ strings via recursive descent; it immediately enters into an infinite loop. This is because the grammar is _left-recursive_. Informally, a left-recursive grammar:
- features a production rule of the form `A = A ... | ...` which loops on itself immediately, or...
- a set of production rules `A = B ... | ...`, `B = C ... | ...`, `C = A ... | ...` which loop around eventually.Parsers are allowed to be recursive, so long as there exists the possibility for the parser to exit the loop. A parser cannot **unconditionally** recurse on itself – that is recursion without a base case, a classic programming error.
### Attempting a Fix
A naive attempt at solving the problem might just change the _order_ of rules without modifying their structure. For example, perhaps we place the `SUB` rule at the end of `EXPR`?
```ebnf
EXPR = GROUP | LIT | SUB
GROUP = "(", EXPR, ")"
LIT = "0" | "1" | "2" | ... | "9"
SUB = EXPR, "-", EXPR
``````hs
expr :: Parser Expr
expr = group <|> lit <|> sub -- first try `group`...
```This is again a valid grammar, but does us no good for parsing. A string like `"(1)-2"` would be parsed as the group `"(1)"` yielding `Lit 1`, and then stop - failing to consume the remaining `"-2"` string. Our parser now terminates, but without ever attempting the recursive case! We will need a different approach.
## The Solution
The technique, which will work in most cases, is to identify the left-recursive path `A => A ...` and split it up into two stages: a "start" and "end" step. The "start" step will be mandatory; the "end" step will be effectively optional, by _allowing empty results_.
Before:
```ebnf
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
...
```After:
```ebnf
EXPR = START, END
START = GROUP | LIT
END = "-", EXPR | NOTHING
...
```(This "NOTHING" result is typically written in grammars using the Greek letter epsilon `𝜀`, and it corresponds to the empty string.)
Notice that the misbehaving `SUB` rule disappears entirely! It has instead been _split up_ across the `START` rule (which parses a chunk of information) and the `END` rule (which **might** parse the continuation of a subtraction, with recursive right-hand expression, or might give up).
In Haskell, we can represent this "successful parse of nothing" using the famous `Maybe` datatype.
```hs
-- Grammar rule: EXPR = START, END
expr :: Parser Expr
expr = do
e1 <- start
mE2 <- end
case mE2 of
Nothing -> pure e1
Just e2 -> pure $ Sub e1 e2-- Grammar rule: START = GROUP | LIT
start :: Parser Expr
start = group <|> lit-- Grammar rule: END = "-", EXPR | NOTHING
end :: Parser (Maybe Expr)
end = getEnd <|> pure Nothing where
getEnd = do
char '-'
e <- expr
pure $ Just e
```Because `end` is recursive – the `expr` it parses itself consists of a new `start` and `end` – you can keep parsing an indefinite chain of subtractions, exactly analogous to a cons list. And just like the famous cons list, that chain of nested parses ends when you hit an empty case (`<|> pure Nothing`, when no `-` symbol is encountered).
Bubbling the information back up, our `expr` parser has to now react to both possibilities:
- Either no `end` was encountered (i.e., no `-` symbol), meaning this is NOT a subtraction expression; or,
- An `end` was built, in case this WAS a subtraction expression.### Step-by-Step
Let's trace through parsing the string "1" again:
1. Hm, an `expr` begins with `start`
2. The `start` is either a `group` (nope) or a `lit` (yep!)
3. Continuing where we left off, the `expr` ends with `end`
4. `end` either begins with "-" (nope) or it's nothing (yep!)
5. So we have a successful `e1` expression, and `Nothing` for `e2`; guess we just return `e1` (which is `Lit 1`).What about parsing a subtraction like "1-1"?
1. Hm, an `expr` begins with `start`
2. The `start` is either a `group` (nope) or a `lit` (yep!)
3. Continuing where we left off, the `expr` ends with `end`
4. `end` either begins with "-" (yep!) or it's nothing (nope)
5. Since we matched "-", `end` now continues on with a new `expr`
6. RECURSE: The new `expr` follows the same path as "1" above
7. We have a successful `e1` expression, and also a successful `e2` expression; time to return a `Sub e1 e2`.## Conclusion
This is meant as a Haskeller-approachable introduction to the _elimination of left recursion_ for recursive descent parsers. The full set of techniques as explained [here](https://www.csd.uwo.ca/~moreno/CS447/Lectures/Syntax.html/node8.html) includes additional examples and variations. I hope you find it helpful, and please let me know if I've made any mistakes.