{"id":29122959,"url":"https://github.com/timtadh/lexmachine","last_synced_at":"2025-06-29T18:05:10.485Z","repository":{"id":16497960,"uuid":"19250740","full_name":"timtadh/lexmachine","owner":"timtadh","description":"Lex machinary for go.","archived":false,"fork":false,"pushed_at":"2022-07-15T05:56:46.000Z","size":303,"stargazers_count":406,"open_issues_count":5,"forks_count":28,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-11-17T12:51:05.945Z","etag":null,"topics":["dfa","go","lex","lexer","lexical-analysis-engines","lexical-analysis-framework","nfa","regular-expression","tokenizer"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timtadh.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-04-28T19:16:57.000Z","updated_at":"2024-10-19T07:36:32.000Z","dependencies_parsed_at":"2022-08-09T18:41:41.386Z","dependency_job_id":null,"html_url":"https://github.com/timtadh/lexmachine","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/timtadh/lexmachine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timtadh%2Flexmachine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timtadh%2Flexmachine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timtadh%2Flexmachine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timtadh%2Flexmachine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timtadh","download_url":"https://codeload.github.com/timtadh/lexmachine/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timtadh%2Flexmachine/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262642953,"owners_count":23341817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dfa","go","lex","lexer","lexical-analysis-engines","lexical-analysis-framework","nfa","regular-expression","tokenizer"],"created_at":"2025-06-29T18:05:08.976Z","updated_at":"2025-06-29T18:05:10.450Z","avatar_url":"https://github.com/timtadh.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `lexmachine` - Lexical Analysis Framework for Golang\n\nBy Tim Henderson\n\nCopyright 2014-2017, All Rights Reserved. Made available for public use under\nthe terms of a BSD 3-Clause license.\n\n[![GoDoc](https://godoc.org/github.com/timtadh/lexmachine?status.svg)](https://godoc.org/github.com/timtadh/lexmachine)\n[![ReportCard](https://goreportcard.com/badge/github.com/timtadh/lexmachine)](https://goreportcard.com/report/github.com/timtadh/lexmachine)\n\n## What?\n\n`lexmachine` is a full lexical analysis framework for the Go programming\nlanguage. It supports a restricted but usable set of regular expressions\nappropriate for writing lexers for complex programming languages. The framework\nalso supports sub lexers and non-regular lexing through an \"escape hatch\" which\nallows the users to consume any number of further bytes after a match. So if you\nwant to support nested C-style comments or other paired structures you can do so\nat the lexical analysis stage.\n\nSubscribe to the [mailing\nlist](https://groups.google.com/forum/#!forum/lexmachine-users) to get\nannouncement of major changes, new versions, and important patches.\n\n## Goal\n\n`lexmachine` intends to be the best, fastest, and easiest to use lexical\nanalysis system for Go.\n\n1. [Documentation Links](#documentation)\n1. [Narrative Documentation](#narrative-documentation)\n1. [Regular Expressions in `lexmachine`](#regular-expressions)\n1. [History](#history)\n1. [Complete Example](#complete-example)\n\n## Documentation\n\n-   [Tutorial](http://hackthology.com/writing-a-lexer-in-go-with-lexmachine.html)\n-   [How It Works](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html)\n-   [Narrative Documentation](#narrative-documentation)\n-   [Using `lexmachine` with `goyacc`](https://github.com/timtadh/lexmachine/tree/master/examples/sensors-parser)\n    Required reading if you want to use `lexmachine` with the standard yacc\n    implementation for Go (or its derivatives).\n-   [![GoDoc](https://godoc.org/github.com/timtadh/lexmachine?status.svg)](https://godoc.org/github.com/timtadh/lexmachine)\n\n### What is in Box\n\n`lexmachine` includes the following components\n\n1.  A parser for restricted set of regular expressions.\n2.  A abstract syntax tree (AST) for regular expressions.\n3.  A backpatching code generator which compiles the AST to (NFA) machine code.\n4.  Both DFA (Deterministic Finite Automata) and a NFA (Non-deterministic Finite\n    Automata) simulation based lexical analysis engines. Lexical analysis\n    engines work in a slightly different way from a normal regular expression\n    engine as they tokenize a stream rather than matching one string.\n5.  Match objects which include start and end column and line numbers of the\n    lexemes as well as their associate token name.\n6.  A declarative \"DSL\" for specifying the lexers.\n7.  An \"escape hatch\" which allows one to match non-regular tokens by consuming\n    any number of further bytes after the match.\n\n## Narrative Documentation\n\n`lexmachine` splits strings into substrings and categorizes each substring. In\ncompiler design, the substrings are referred to as *lexemes* and the\ncategories are referred to as *token types* or just *tokens*. The categories are\ndefined by *patterns* which are specified using [regular\nexpressions](#regular-expressions). The process of splitting up a string is\nsometimes called *tokenization*, *lexical analysis*, or *lexing*.\n\n### Defining a Lexer\n\nThe set of patterns (regular expressions) used to *tokenize* (split up and\ncategorize) is called a *lexer*. Lexer's are first class objects in\n`lexmachine`. They can be defined once and re-used over and over-again to\ntokenize multiple strings. After the lexer has been defined it will be compiled\n(either explicitly or implicitly) into either a Non-deterministic Finite\nAutomaton (NFA) or Deterministic Finite Automaton (DFA). The automaton is then\nused (and re-used) to tokenize strings.\n\n#### Creating a new Lexer\n\n```go\nlexer := lexmachine.NewLexer()\n```\n\n#### Adding a pattern\n\nLet's pretend we want a lexer which only recognizes one category: strings which\nmatch the word \"wild\" capitalized or not (eg. Wild, wild, WILD, ...). That\nexpression is denoted: `[Ww][Ii][Ll][Dd]`. Patterns are added using the `Add`\nfunction:\n\n```go\nlexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {\n\treturn 0, nil\n})\n```\n\nAdd takes two arguments: the pattern and a call back function called a *lexing\naction*. The action allows you, the programmer, to transform the low level\n`machines.Match` object (from `github.com/lexmachine/machines`) into a object\nmeaningful for your program. As an example, let's define a few token types, and\na token object. Then we will construct appropriate action functions.\n\n```go\nTokens := []string{\n\t\"WILD\",\n\t\"SPACE\",\n\t\"BANG\",\n}\nTokenIds := make(map[string]int)\nfor i, tok := range Tokens {\n\tTokenIds[tok] = i\n}\n```\n\nNow that we have defined a set of three tokens (WILD, SPACE, BANG), let's create\na token object:\n\n```go\ntype Token struct {\n\tTokenType int\n\tLexeme string\n\tMatch *machines.Match\n}\n```\n\nNow let's make a helper function which takes a `Match` and a token type and\ncreates a Token.\n\n```go\nfunc NewToken(tokenType string, m *machines.Match) *Token {\n\treturn \u0026Token{\n\t\tTokenType: TokenIds[tokenType], // defined above\n\t\tLexeme: string(m.Bytes),\n\t\tMatch: m,\n\t}\n}\n```\n\nNow we write an action for the previous pattern\n\n```go\nlexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {\n\treturn NewToken(\"WILD\", m), nil\n})\n```\n\nWriting the action functions can get tedious, a good idea is to create a helper\nfunction which produces these action functions:\n\n```go\nfunc token(tokenType string) func(*lexmachine.Scanner, *machines.Match) (interface{}, error) {\n\treturn func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {\n\t\treturn NewToken(tokenType, m), nil\n\t}\n}\n```\n\nThen adding patterns for our 3 tokens is concise:\n\n```go\nlexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), token(\"WILD\"))\nlexer.Add([]byte(` `), token(\"SPACE\"))\nlexer.Add([]byte(`!`), token(\"BANG\"))\n```\n\n#### Built-in Token Type\n\nMany programs use similar representations for tokens. `lexmachine` provides a\ncompletely optional `Token` object you can use in lieu of writing your own.\n\n```go\ntype Token struct {\n    Type        int\n    Value       interface{}\n    Lexeme      []byte\n    TC          int\n    StartLine   int\n    StartColumn int\n    EndLine     int\n    EndColumn   int\n}\n```\n\nHere is an example for constructing a lexer Action which turns a machines.Match\nstruct into a token using the scanners Token helper function.\n\n```go\nfunc token(name string, tokenIds map[string]int) lex.Action {\n    return func(s *lex.Scanner, m *machines.Match) (interface{}, error) {\n        return s.Token(tokenIds[name], string(m.Bytes), m), nil\n    }\n}\n```\n\n#### Adding Multiple Patterns\n\nWhen constructing a lexer for a complex computer language often tokens have\npatterns which overlap -- multiple patterns could match the same strings. To\naddress this problem lexical analysis engines follow 2 rules when choosing which\npattern to match:\n\n1. Pick the pattern which matches the longest prefix of unmatched text.\n2. Break ties by picking the pattern which appears earlier in the user supplied\n   list.\n\nFor example, let's pretend we are writing a lexer for Python. Python has a bunch\nof keywords in it such as `class` and `def`. However, it also has identifiers\nwhich match the pattern `[A-Za-z_][A-Za-z0-9_]*`. That pattern also matches the\nkeywords, if we were to define the lexer as:\n\n```go\nlexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`), token(\"ID\"))\nlexer.Add([]byte(`class`), token(\"CLASS\"))\nlexer.Add([]byte(`def`), token(\"DEF\"))\n```\n\nThen, the keywords class and def would never be found because the ID token would\ntake precedence. The correct way to solve this problem is by putting the\nkeywords first:\n\n```go\nlexer.Add([]byte(`class`), token(\"CLASS\"))\nlexer.Add([]byte(`def`), token(\"DEF\"))\nlexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`), token(\"ID\"))\n```\n\n#### Skipping Patterns\n\nSometimes it is advantageous to not emit tokens for certain patterns and to\ninstead skip them. Commonly this occurs for whitespace and comments. To skip a\npattern simply have the action `return nil, nil`:\n\n```go\nlexer.Add(\n\t[]byte(\"( |\\t|\\n)\"),\n\tfunc(scan *Scanner, match *machines.Match) (interface{}, error) {\n\t\t// skip white space\n\t\treturn nil, nil\n\t},\n)\nlexer.Add(\n\t[]byte(\"//[^\\n]*\\n\"),\n\tfunc(scan *Scanner, match *machines.Match) (interface{}, error) {\n\t\t// skip white space\n\t\treturn nil, nil\n\t},\n)\n```\n\n#### Compiling the Lexer\n\n`lexmachine` uses the theory of [finite state\nmachines](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html)\nto efficiently tokenize text. So what is a finite state machine? A finite state\nmachine is a mathematical construct which is made up of a set of states, with a\nlabeled starting state, and accepting states. There is a transition function\nwhich moves from one state to another state based on an input character. In\ngeneral, in lexing there are two usual types of state machines used:\nNon-deterministic and Deterministic.\n\nBefore a lexer (like the ones described above) can be used it must be compiled\ninto either a Non-deterministic Finite Automaton (NFA) or a [Deterministic\nFinite Automaton\n(DFA)](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html).\nThe difference between the two (from a practical perspective) is *construction\ntime* and *match efficiency*.\n\nConstruction time is the amount of time it takes to turn a set of regular\nexpressions into a state machine (also called a finite state automaton). For an\nNFA it is O(`r`) which `r` is the length of the regular expression. However, for\nDFA it could be as bad as O(`2^r`) but in practical terms it is rarely worse\nthan O(`r^3`). The DFA's in `lexmachine` are also automatically *minimized* which\nreduces the amount of memory they consume which takes O(`r*log(log(r))`) steps.\n\nHowever, construction time is an upfront cost. If your program is tokenizing\nmultiple strings it is less important than match efficiency. Let's say a string\nhas length `n`. An NFA can tokenize such a string in O(`n*r`) steps while a DFA\ncan tokenize the string in O(`n`). For larger languages `r` becomes a\nsignificant overhead.\n\nBy default, `lexmachine` uses a DFA. To explicitly invoke compilation call\n`Compile`:\n\n```go\nerr := lexer.Compile()\nif err != nil {\n\t// handle err\n}\n```\n\nTo explicitly compile a DFA (in case of changes to the default behavior of\nCompile):\n\n```go\nerr := lexer.CompileDFA()\nif err != nil {\n\t// handle err\n}\n```\n\nTo explicitly compile a NFA:\n\n```go\nerr := lexer.CompileNFA()\nif err != nil {\n\t// handle err\n}\n```\n\n### Tokenizing a String\n\nTo tokenize (lex) a string construct a `Scanner` object using the lexer. This\nwill compile the lexer if it has not already been compiled.\n\n```go\nscanner, err := lexer.Scanner([]byte(\"some text to lex\"))\nif err != nil {\n\t// handle err\n}\n```\n\nThe scanner object is an iterator which yields the next token (or error) by\ncalling the `Next()` method:\n\n```go\nfor tok, err, eos := scanner.Next(); !eos; tok, err, eos = scanner.Next() {\n\tif ui, is := err.(*machines.UnconsumedInput); is {\n\t\t// skip the error via:\n\t\t// scanner.TC = ui.FailTC\n\t\t//\n\t\treturn err\n\t} else if err != nil {\n\t\treturn err\n\t}\n\tfmt.Println(tok)\n}\n```\n\nLet's break down that first line:\n\n```go\nfor tok, err, eos := scanner.Next();\n```\n\nThe `Next()` method returns three things, the token (`tok`) if there is one, an\nerror (`err`) if there is one, and `eos` which is a boolean which indicates if\nthe End Of String (EOS) has been reached.\n\n```go\n; !eos;\n```\n\nIteration proceeds until the EOS has been reached.\n\n```go\n; tok, err, eos = scanner.Next() {\n```\n\nThe update block calls `Next()` again to get the next token. In each iteration\nof the loop the first thing a client **must** do is check for an error.\n\n```go\n\tif err != nil {\n\t\treturn err\n\t}\n```\n\nThis prevents an infinite loop on an unexpected character or other bad token. To\nskip bad tokens check to see if the `err` is a `*machines.UnconsumedInput`\nobject and reset the scanners text counter (`scanner.TC`) to point to the end of\nthe failed token.\n\n```go\n\tif ui, is := err.(*machines.UnconsumedInput); is {\n\t\tscanner.TC = ui.FailTC\n\t\tcontinue\n\t}\n```\n\nFinally, a client can make use of the token produced by the scanner (if there\nwas no error:\n\n```go\n\tfmt.Println(tok)\n```\n\n### Dealing with Non-regular Tokens\n\n`lexmachine` like most lexical analysis frameworks primarily deals with patterns\nwhich are represented by regular expressions. However, sometimes a language\nhas a token which is \"non-regular.\" A pattern is non-regular if there is no\nregular expression (or finite automata) which can express the pattern. For\ninstance, if you wanted to define a pattern which matches only consecutive\nbalanced parentheses: `()`, `()()()`, `((()()))()()`, ... You would quickly find\nthere is no regular expression which can express this language. The reason is\nsimple: finite automata cannot \"count\" or keep track of how many opening\nparentheses it has seen.\n\nThis problem arises in many programming languages when dealing with nested\n\"c-style\" comments. Supporting the nesting means solving the \"balanced\nparenthesis\" problem. Luckily, `lexmachine` provides an \"escape-hatch\" to deal\nwith these situations in the `Action` functions. All actions receive a pointer\nto the `Scanner`. The scanner (as discussed above) has a public modifiable field\ncalled `TC` which stands for text counter. Any action can *modify* the text\ncounter to point at the desired position it would like the scanner to resume\nscanning from.\n\nAn example of using this feature for tokenizing nested \"c-style\" comments is\nbelow:\n\n```go\nlexer.Add(\n\t[]byte(\"/\\\\*\"),\n\tfunc(scan *Scanner, match *machines.Match) (interface{}, error) {\n\t\tfor tc := scan.TC; tc \u003c len(scan.Text); tc++ {\n\t\t\tif scan.Text[tc] == '\\\\' {\n\t\t\t\t// the next character is skipped\n\t\t\t\ttc++\n\t\t\t} else if scan.Text[tc] == '*' \u0026\u0026 tc+1 \u003c len(scan.Text) {\n\t\t\t\tif scan.Text[tc+1] == '/' {\n\t\t\t\t\t// set the text counter to point to after the\n\t\t\t\t\t// end of the comment. This will cause the\n\t\t\t\t\t// scanner to resume after the comment instead\n\t\t\t\t\t// of picking up in the middle.\n\t\t\t\t\tscan.TC = tc + 2\n\t\t\t\t\t// don't return a token to skip the comment\n\t\t\t\t\treturn nil, nil\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\treturn nil,\n\t\t\tfmt.Errorf(\"unclosed comment starting at %d, (%d, %d)\",\n\t\t\t\tmatch.TC, match.StartLine, match.StartColumn)\n\t},\n)\n```\n\n## Regular Expressions\n\nLexmachine (like most lexical analysis frameworks) uses [Regular\nExpressions](https://en.wikipedia.org/wiki/Regular_expression) to specify the\n*patterns* to match when splitting the string up into categorized *tokens.*\nFor a more advanced introduction to regular expressions engines see Russ Cox's\n[articles](https://swtch.com/~rsc/regexp/). To learn more about how regular\nexpressions are used to *tokenize* string take a look at Alex Aiken's [video\nlectures](https://youtu.be/SRhkfvqeA1M) on the subject. Finally, Aho *et al.*\ngive a through treatment of the subject in the [Dragon\nBook](http://www.worldcat.org/oclc/951336275) Chapter 3.\n\nA regular expression is a *pattern* which *matches* a set of strings. It is made\nup of *characters* such as `a` or `b`, characters with special meanings (such as\n`.` which matches any character), and operators. The regular expression `abc`\nmatches exactly one string `abc`.\n\n### Character Expressions\n\nIn lexmachine most characters (eg. `a`, `b` or `#`) represent themselves. Some\nhave special meanings (as detailed below in operators). However, all characters\ncan be represented by prefixing the character with a `\\`.\n\n#### Any Character\n\n`.` matches any character.\n\n#### Special Characters\n\n1. `\\` use `\\\\` to match\n2. newline use `\\n` to match\n3. carriage return use `\\r` to match\n4. tab use `\\t` to match\n5. `.` use `\\.` to match\n6. operators: {`|`, `+`, `*`, `?`, `(`, `)`, `[`, `]`, `^`} prefix with a `\\` to\n   match.\n\n#### Character Classes\n\nSometimes it is advantages to match a variety of characters. For instance, if\nyou want to ignore capitalization for the work `Capitol` you could write the\nexpression `[Cc]apitol` which would match both `Capitol` or `capitol`. There are\ntwo forms of character ranges:\n\n1. `[abcd]` matches all the letters inside the `[]` (eg. that pattern matches\n   the strings `a`, `b`, `c`, `d`).\n2. `[a-d]` matches the range of characters between the character before the dash\n   (`a`) and the character after the dash (`d`) (eg. that pattern matches\n   the strings `a`, `b`, `c`, `d`).\n\nThese two forms may be combined:\n\nFor instance, `[a-zA-Z123]` matches the strings {`a`, `b`, ..., `z`, `A`, `B`,\n... `Z`, `1`, `2`, `3`}\n\n#### Inverted Character Classes\n\nSometimes it is easier to specify the characters you don't want to match than\nthe characters you do. For instance, you want to match any character but a lower\ncase one. This can be achieved using an inverted class: `[^a-z]`. An inverted\nclass is specified by putting a `^` just after the opening bracket.\n\n#### Built-in Character Classes\n\n1. `\\d` = `[0-9]` (the digit class)\n2. `\\D` = `[^0-9]` (the not a digit class)\n3. `\\s` = `[ \\t\\n\\r\\f]` (the space class). where \\f is a form feed (note: \\f is\n   not a special sequence in lexmachine, if you want to specify the form feed\n   character (ascii 0x0c) use []byte{12}.\n4. `\\S` = `[^ \\t\\n\\r\\f]` (the not a space class)\n5. `\\w` = `[0-9a-zA-Z_]` (the letter class)\n5. `\\W` = `[^0-9a-zA-Z_]` (the not a letter class)\n\n### Operators\n\n1. The pipe operator `|` indicates alternative choices. For instance the\n   expression `a|b` matches either the string `a` or the string `b` but not `ab`\n   or `ba` or the empty string.\n\n2. The parenthesis operator `()` groups a subexpression together. For instance\n   the expression `a(b|c)d` matches `abd` or `acd` but not `abcd`.\n\n3. The star operator `*` indicates the \"starred\" subexpression should match zero\n   or more times. For instance, `a*` matches the empty string, `a`, `aa`, `aaa`\n   and so on.\n\n4. The plus operator `+` indicates the \"plussed\" subexpression should match one\n   or more times. For instance, `a+` matches `a`, `aa`, `aaa` and so on.\n\n5. The maybe operator `?` indicates the \"questioned\" subexpression should match\n   zero or one times. For instance, `a?` matches the empty string and `a`.\n\n### Grammar\n\nThe canonical grammar is found in the handwritten recursive descent\n[parser](https://github.com/timtadh/lexmachine/blob/master/frontend/parser.go).\nThis section should be considered documentation not specification.\n\nNote: e stands for the empty string\n\n```\nRegex -\u003e Alternation\n\nAlternation -\u003e AtomicOps Alternation'\n\nAlternation' -\u003e `|` AtomicOps Alternation'\n              | e\n\nAtomicOps -\u003e AtomicOp AtomicOps\n           | e\n\nAtomicOp -\u003e Atomic\n          | Atomic Ops\n\nOps -\u003e Op Ops\n     | e\n\nOp -\u003e `+`\n    | `*`\n    | `?`\n\nAtomic -\u003e Char\n        | Group\n\nGroup -\u003e `(` Alternation `)`\n\nChar -\u003e CHAR\n      | CharClass\n\nCharClass -\u003e `[` Range `]`\n           | `[` `^` Range `]`\n\nRange -\u003e CharClassItem Range'\n\nRange' -\u003e CharClassItem Range'\n        | e\n\nCharClassItem -\u003e BYTE\n              -\u003e BYTE `-` BYTE\n\nCHAR -\u003e matches any character except '|', '+', '*', '?', '(', ')', '[', ']', '^'\n        unless escaped. Additionally '.' is returned as the wildcard character\n        which matches any character. Built-in character classes are also handled\n        here.\n\nBYTE -\u003e matches any byte\n```\n\n## History\n\nThis library was started when I was teaching EECS 337 *Compiler Design and\nImplementation* at Case Western Reserve University in Fall of 2014. It wrote two\ncompilers one was \"hidden\" from the students as the language implemented was\ntheir project language. The other was [tcel](https://github.com/timtadh/tcel)\nwhich was written initially as an example of how to do type checking. That\ncompiler was later expanded to explain AST interpretation, intermediate code\ngeneration, and x86 code generation.\n\n## Complete Example\n\n### Using the Lexer\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"log\"\n)\n\nimport (\n\t\"github.com/timtadh/lexmachine\"\n\t\"github.com/timtadh/lexmachine/machines\"\n)\n\nfunc main() {\n    s, err := Lexer.Scanner([]byte(`digraph {\n  rankdir=LR;\n  a [label=\"a\" shape=box];\n  c [\u003clabel\u003e=\u003c\u003cu\u003eC\u003c/u\u003e\u003e];\n  b [label=\"bb\"];\n  a -\u003e c;\n  c -\u003e b;\n  d -\u003e c;\n  b -\u003e a;\n  b -\u003e e;\n  e -\u003e f;\n}`))\n    if err != nil {\n        log.Fatal(err)\n    }\n    fmt.Println(\"Type    | Lexeme     | Position\")\n    fmt.Println(\"--------+------------+------------\")\n    for tok, err, eof := s.Next(); !eof; tok, err, eof = s.Next() {\n        if ui, is := err.(*machines.UnconsumedInput); is{\n            // to skip bad token do:\n            // s.TC = ui.FailTC\n            log.Fatal(err) // however, we will just fail the program\n        } else if err != nil {\n            log.Fatal(err)\n        }\n        token := tok.(*lexmachine.Token)\n        fmt.Printf(\"%-7v | %-10v | %v:%v-%v:%v\\n\",\n            Tokens[token.Type],\n            string(token.Lexeme),\n            token.StartLine,\n            token.StartColumn,\n            token.EndLine,\n            token.EndColumn)\n    }\n}\n```\n\n### Lexer Definition\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\t\"strings\"\n)\n\nimport (\n\t\"github.com/timtadh/lexmachine\"\n\t\"github.com/timtadh/lexmachine/machines\"\n)\n\nvar Literals []string       // The tokens representing literal strings\nvar Keywords []string       // The keyword tokens\nvar Tokens []string         // All of the tokens (including literals and keywords)\nvar TokenIds map[string]int // A map from the token names to their int ids\nvar Lexer *lexmachine.Lexer // The lexer object. Use this to construct a Scanner\n\n// Called at package initialization. Creates the lexer and populates token lists.\nfunc init() {\n\tinitTokens()\n\tvar err error\n\tLexer, err = initLexer()\n\tif err != nil {\n\t\tpanic(err)\n\t}\n}\n\nfunc initTokens() {\n\tLiterals = []string{\n\t\t\"[\",\n\t\t\"]\",\n\t\t\"{\",\n\t\t\"}\",\n\t\t\"=\",\n\t\t\",\",\n\t\t\";\",\n\t\t\":\",\n\t\t\"-\u003e\",\n\t\t\"--\",\n\t}\n\tKeywords = []string{\n\t\t\"NODE\",\n\t\t\"EDGE\",\n\t\t\"GRAPH\",\n\t\t\"DIGRAPH\",\n\t\t\"SUBGRAPH\",\n\t\t\"STRICT\",\n\t}\n\tTokens = []string{\n\t\t\"COMMENT\",\n\t\t\"ID\",\n\t}\n\tTokens = append(Tokens, Keywords...)\n\tTokens = append(Tokens, Literals...)\n\tTokenIds = make(map[string]int)\n\tfor i, tok := range Tokens {\n\t\tTokenIds[tok] = i\n\t}\n}\n\n// Creates the lexer object and compiles the NFA.\nfunc initLexer() (*lexmachine.Lexer, error) {\n\tlexer := lexmachine.NewLexer()\n\n\tfor _, lit := range Literals {\n\t\tr := \"\\\\\" + strings.Join(strings.Split(lit, \"\"), \"\\\\\")\n\t\tlexer.Add([]byte(r), token(lit))\n\t}\n\tfor _, name := range Keywords {\n\t\tlexer.Add([]byte(strings.ToLower(name)), token(name))\n\t}\n\n\tlexer.Add([]byte(`//[^\\n]*\\n?`), token(\"COMMENT\"))\n\tlexer.Add([]byte(`/\\*([^*]|\\r|\\n|(\\*+([^*/]|\\r|\\n)))*\\*+/`), token(\"COMMENT\"))\n\tlexer.Add([]byte(`([a-z]|[A-Z]|[0-9]|_)+`), token(\"ID\"))\n\tlexer.Add([]byte(`[0-9]*\\.[0-9]+`), token(\"ID\"))\n\tlexer.Add([]byte(`\"([^\\\\\"]|(\\\\.))*\"`),\n\t\tfunc(scan *lexmachine.Scanner, match *machines.Match) (interface{}, error) {\n\t\t\tx, _ := token(\"ID\")(scan, match)\n\t\t\tt := x.(*lexmachine.Token)\n\t\t\tv := t.Value.(string)\n\t\t\tt.Value = v[1 : len(v)-1]\n\t\t\treturn t, nil\n\t\t})\n\tlexer.Add([]byte(\"( |\\t|\\n|\\r)+\"), skip)\n\tlexer.Add([]byte(`\\\u003c`),\n\t\tfunc(scan *lexmachine.Scanner, match *machines.Match) (interface{}, error) {\n\t\t\tstr := make([]byte, 0, 10)\n\t\t\tstr = append(str, match.Bytes...)\n\t\t\tbrackets := 1\n\t\t\tmatch.EndLine = match.StartLine\n\t\t\tmatch.EndColumn = match.StartColumn\n\t\t\tfor tc := scan.TC; tc \u003c len(scan.Text); tc++ {\n\t\t\t\tstr = append(str, scan.Text[tc])\n\t\t\t\tmatch.EndColumn += 1\n\t\t\t\tif scan.Text[tc] == '\\n' {\n\t\t\t\t\tmatch.EndLine += 1\n\t\t\t\t}\n\t\t\t\tif scan.Text[tc] == '\u003c' {\n\t\t\t\t\tbrackets += 1\n\t\t\t\t} else if scan.Text[tc] == '\u003e' {\n\t\t\t\t\tbrackets -= 1\n\t\t\t\t}\n\t\t\t\tif brackets == 0 {\n\t\t\t\t\tmatch.TC = scan.TC\n\t\t\t\t\tscan.TC = tc + 1\n\t\t\t\t\tmatch.Bytes = str\n\t\t\t\t\tx, _ := token(\"ID\")(scan, match)\n\t\t\t\t\tt := x.(*lexmachine.Token)\n\t\t\t\t\tv := t.Value.(string)\n\t\t\t\t\tt.Value = v[1 : len(v)-1]\n\t\t\t\t\treturn t, nil\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil,\n\t\t\t\tfmt.Errorf(\"unclosed HTML literal starting at %d, (%d, %d)\",\n\t\t\t\t\tmatch.TC, match.StartLine, match.StartColumn)\n\t\t},\n\t)\n\n\terr := lexer.Compile()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn lexer, nil\n}\n\n// a lexmachine.Action function which skips the match.\nfunc skip(*lexmachine.Scanner, *machines.Match) (interface{}, error) {\n\treturn nil, nil\n}\n\n// a lexmachine.Action function with constructs a Token of the given token type by\n// the token type's name.\nfunc token(name string) lexmachine.Action {\n\treturn func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) {\n\t\treturn s.Token(TokenIds[name], string(m.Bytes), m), nil\n\t}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimtadh%2Flexmachine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimtadh%2Flexmachine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimtadh%2Flexmachine/lists"}