{"id":17877565,"url":"https://github.com/osa1/lexgen","last_synced_at":"2025-04-13T00:47:42.021Z","repository":{"id":43122772,"uuid":"358519564","full_name":"osa1/lexgen","owner":"osa1","description":"A fully-featured lexer generator, implemented as a proc macro","archived":false,"fork":false,"pushed_at":"2024-12-24T12:27:26.000Z","size":387,"stargazers_count":69,"open_issues_count":19,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-13T00:47:29.655Z","etag":null,"topics":["compilers","lexer-generator","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/osa1.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-16T07:51:47.000Z","updated_at":"2025-02-04T01:35:37.000Z","dependencies_parsed_at":"2024-12-03T15:19:29.175Z","dependency_job_id":"5d095e04-333e-4c58-84cc-8eb790bcf6fa","html_url":"https://github.com/osa1/lexgen","commit_stats":{"total_commits":278,"total_committers":5,"mean_commits":55.6,"dds":"0.017985611510791366","last_synced_commit":"8b5795230b3d4599e57048a7934713eaaceb6947"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osa1%2Flexgen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osa1%2Flexgen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osa1%2Flexgen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/osa1%2Flexgen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/osa1","download_url":"https://codeload.github.com/osa1/lexgen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248650435,"owners_count":21139672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compilers","lexer-generator","rust"],"created_at":"2024-10-28T11:52:19.464Z","updated_at":"2025-04-13T00:47:42.000Z","avatar_url":"https://github.com/osa1.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lexgen: A fully-featured lexer generator, implemented as a proc macro\n\n```rust\nuse lexgen::lexer;\nuse lexgen_util::Loc;\n\nlexer! {\n    // First line specifies name of the lexer and the token type returned by\n    // semantic actions\n    Lexer -\u003e Token;\n\n    // Regular expressions can be named with `let` syntax\n    let init = ['a'-'z'];\n    let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];\n\n    // Rule sets have names. Each rule set is compiled to a separate DFA.\n    // Switching between rule sets is done explicitly in semantic actions.\n    rule Init {\n        // Rules without a right-hand side for skipping whitespace,\n        // comments, etc.\n        [' ' '\\t' '\\n']+,\n\n        // Rule for matching identifiers\n        $init $subseq* =\u003e |lexer| {\n            let token = Token::Id(lexer.match_().to_owned());\n            lexer.return_(token)\n        },\n    }\n}\n\n// The token type\n#[derive(Debug, PartialEq, Eq)]\nenum Token {\n    // An identifier\n    Id(String),\n}\n\n// Generated lexers are initialized with a `\u0026str` for the input\nlet mut lexer = Lexer::new(\" abc123Q-t  z9_9\");\n\n// Lexers implement `Iterator\u003cItem=Result\u003c(Loc, T, Loc), LexerError\u003e\u003e`,\n// where `T` is the token type specified in the lexer definition (`Token` in\n// this case), and `Loc`s indicate line, column, and byte indices of\n// beginning and end of the lexemes.\nassert_eq!(\n    lexer.next(),\n    Some(Ok((\n        Loc { line: 0, col: 1, byte_idx: 1 },\n        Token::Id(\"abc123Q-t\".to_owned()),\n        Loc { line: 0, col: 10, byte_idx: 10 }\n    )))\n);\nassert_eq!(\n    lexer.next(),\n    Some(Ok((\n        Loc { line: 0, col: 12, byte_idx: 12 },\n        Token::Id(\"z9_9\".to_owned()),\n        Loc { line: 0, col: 16, byte_idx: 16 }\n    )))\n);\nassert_eq!(lexer.next(), None);\n```\n\nSee also:\n\n- [Simple lexer definitions in tests][1]\n- [A full Lua 5.1 lexer][2]\n- [An example that uses lexgen with LALRPOP][3]\n- [A lexer for a simpler version of OCaml][4]\n- [A Rust lexer][5]\n- [A parse event generator][6]\n\n## Motivation\n\nImplementing lexing is often (along with parsing) the most tedious part of\nimplementing a language. Lexer generators make this much easier, but in Rust\nexisting lexer generators miss essential features for practical use, and/or\nrequire a pre-processing step when building.\n\nMy goal with lexgen is to have a feature-complete and easy to use lexer\ngenerator.\n\n## Usage\n\nlexgen doesn't require a build step. Add same versions of `lexgen` and\n`lexgen_util` as dependencies in your `Cargo.toml`.\n\n## Lexer syntax\n\nlexgen lexers start with the name of the generated lexer struct, optional user\nstate part, and the token type (type of values returned by semantic actions).\nExample:\n\n```rust\nlexer! {\n    Lexer(LexerState) -\u003e Token;\n    ...\n}\n```\n\nHere the generated lexer type will be named `Lexer`. User state type is\n`LexerState` (this type should be defined by the user). The token type is\n`Token`.\n\nAfter the lexer name and user state and token types we define the rules:\n\n```rust\nrule Init {\n    ...\n}\n\nrule SomeOtherRule {\n    ...\n}\n```\n\nThe first rule set will be defining the initial state of the lexer and needs to\nbe named `Init`.\n\nIn the body of a `rule` block we define the rules for that lexer state. The\nsyntax for a rule is `\u003cregex\u003e =\u003e \u003csemantic action\u003e,`. Regex syntax is described\nbelow. A semantic action is any Rust code with the type `fn(LexerHandle) -\u003e\nLexerAction` where `LexerHandle` and `LexerAction` are generated names derived\nfrom the lexer name (`Lexer` in our example). More on these types below.\n\nRegular expressions can be named with `let \u003cname\u003e = \u003cregex\u003e;` syntax. Example:\n\n```rust\nlet init = ['a'-'z'];\nlet subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];\n\n// Named regexes can be used with the `$` prefix\n$init $subseq* =\u003e |lexer| { ... }\n```\n\nYou can omit the `rule Init { ... }` part and have all of your rules at the top\nlevel if you don't need rule sets.\n\nIn summary:\n\n- First line is in form `\u003clexer name\u003e(\u003cuser state type\u003e) -\u003e \u003ctoken type name\u003e`.\n  The `(\u003cuser state type\u003e)` part can be omitted for stateless lexers.\n\n- Next is the rule sets. There should be at least one rule set with the name\n  `Init`, which is the name of the initial state.\n\n- `let` bindings can be added at the top-level or in `rule`s.\n\n## Regex syntax\n\nRegex syntax can be used in right-hand side of let bindings and left-hand side\nof rules. The syntax is:\n\n- `$var` for variables defined in the let binding section. Variables need to be\n  defined before used.\n- `$$var` for built-in regexes (see \"Built-in regular expressions\" section\n  below).\n- Rust character syntax for characters, e.g. `'a'`.\n- Rust string syntax for strings, e.g. `\"abc\"`.\n- `[...]` for character sets. Inside the brackets you can have one or more of:\n\n  - Characters\n  - Character ranges: e.g. `'a'-'z'`\n\n  Here's an example character set for ASCII alphanumerics: `['a'-'z' 'A'-'Z'\n  '0'-'9']`\n- `_` for matching any character\n- `$` for matching end-of-input\n- `\u003cregex\u003e*` for zero or more repetitions of `\u003cregex\u003e`\n- `\u003cregex\u003e+` for one or more repetitions of `\u003cregex\u003e`\n- `\u003cregex\u003e?` for zero or one repetitions of `\u003cregex\u003e`\n- `\u003cregex\u003e \u003cregex\u003e` for concatenation\n- `\u003cregex\u003e | \u003cregex\u003e` for alternation: match the first one, or the second one.\n- `\u003cregex\u003e # \u003cregex\u003e` for difference: match characters in the first regex that\n  are not in the second regex. Note that regexes on the left and right of `#`\n  should be \"characters sets\", i.e. `*`, `+`, `?`, `\"...\"`, `$`, and\n  concatenation are not allowed. Variables that are bound to character sets are\n  allowed.\n\nBinding powers (precedences), from higher to lower:\n\n- `*`, `+`, `?`\n- `#`\n- Concatenation\n- `|`\n\nYou can use parenthesis for grouping, e.g. `('a' | 'b')*`.\n\nExample: `'a' 'b' | 'c'+` is the same as `(('a' 'b') | ('c'+))`.\n\n## Right context (lookahead)\n\nA rule in a rule set can be followed by another regex using `\u003e \u003cregex\u003e` syntax,\nfor right context. Right context is basically a limited form of lookahead: they\ncan only appear after a top-level regex for a rule. They cannot be used nested\nin a regex.\n\nFor example, the rule left-hand side `'a' \u003e (_ # 'b')` matches `'a'` as long as\nit's not followed by `'b'`.\n\nSee also [right context tests] for more examples.\n\n[right context tests]: https://github.com/osa1/lexgen/blob/main/crates/lexgen/tests/right_ctx.rs\n\n## Built-in regular expressions\n\nlexgen comes with a set of built-in regular expressions. Regular\nexpressions listed below match the same set of characters as their Rust\ncounterparts. For example, `$$alphabetic` matches the same set of characters as\nRust's [`char::is_alphabetic`]:\n\n- `$$alphabetic`\n- `$$alphanumeric`\n- `$$ascii`\n- `$$ascii_alphabetic`\n- `$$ascii_alphanumeric`\n- `$$ascii_control`\n- `$$ascii_digit`\n- `$$ascii_graphic`\n- `$$ascii_hexdigit`\n- `$$ascii_lowercase`\n- `$$ascii_punctuation`\n- `$$ascii_uppercase`\n- `$$ascii_whitespace`\n- `$$control`\n- `$$lowercase`\n- `$$numeric`\n- `$$uppercase`\n- `$$whitespace`\n\n(Note that in the generated code we don't use Rust `char` methods. For simple\ncases like `$$ascii` we generate simple range checks. For more complicated\ncases like `$$lowercase` we generate a binary search table and run binary\nsearch when checking a character)\n\nIn addition, these two built-in regular expressions match Unicode [XID_Start and\nXID_Continue]:\n\n- `$$XID_Start`\n- `$$XID_Continue`\n\n[`char::is_alphabetic`]: https://doc.rust-lang.org/std/primitive.char.html#method.is_alphabetic\n[XID_Start and XID_Continue]: http://www.unicode.org/reports/tr31/\n\n## Rule syntax\n\n- `\u003cregex\u003e =\u003e \u003csemantic action\u003e,`: `\u003cregex\u003e` syntax is as described above.\n  `\u003csemantic action\u003e` is any Rust code with type `fn(\u0026mut Lexer) -\u003e\n  SemanticActionResult\u003cToken\u003e`. More on `SemanticActionResult` type in the next\n  section.\n\n- `\u003cregex\u003e =? \u003csemantic action\u003e,`: fallible actions. This syntax is similar to\n  the syntax above, except `\u003csemantic action\u003e` has type `fn(\u0026mut Lexer) -\u003e\n  LexerAction\u003cResult\u003cToken, UserError\u003e\u003e`. When using rules of this kind, the\n  error type needs to be declared at the beginning of the lexer with the `type\n  Error = UserError;` syntax.\n\n  When a rule of this kind returns an error, the error is returned to the\n  caller of the lexer's `next` method.\n\n- `\u003cregex\u003e,`: Syntactic sugar for `\u003cregex\u003e =\u003e |lexer| { lexer.reset_match();\n  lexer.continue_() },`. Useful for skipping characters (e.g. whitespace).\n\n- `\u003cregex\u003e = \u003ctoken\u003e,`: Syntactic sugar for `\u003cregex\u003e =\u003e |lexer|\n  lexer.return_(\u003ctoken\u003e),`. Useful for matching keywords, punctuation\n  (operators) and delimiters (parens, brackets).\n\n## End-of-input handling in rule sets\n\nThe `Init` rule set terminates lexing successfully on end-of-input (i.e.\n`lexer.next()` returns `None`). Other rule sets fail on end-of-input (i.e.\nreturn `Some(Err(...))`). This is because generally the states other than the\ninitial one are for complicated tokens (strings, raw strings, multi-line\ncomments) that need to be terminated and handled, and end-of-input in those\nstates usually means the token did not terminate properly.\n\n(To handle end-of-input in a rule set you can use `$` as described in section\n\"Regex syntax\" above.)\n\n## Handle, rule, error, and action types\n\nThe `lexer` macro generates a struct with the name specified by the user in the\nfirst line of the lexer definition. In the example at the beginning (`Lexer -\u003e\nToken;`), name of the struct is `Lexer`.\n\nA mut reference to this type is passed to semantic action functions. In the\nimplementation of a semantic action, you should use one of the methods below\ndrive the lexer and return tokens:\n\n- `fn match_(\u0026self) -\u003e \u0026str`: returns the current match. Note that when the\n  lexer is constructed with `new_from_iter` or `new_from_iter_with_state`, this\n  method panics. It should only be called when the lexer is initialized with\n  `new` or `new_with_state`.\n- `fn match_loc(\u0026self) -\u003e (lexgen_util::Loc, lexgen_util::Loc)`: returns the\n  bounds of the current match\n- `fn peek(\u0026mut self) -\u003e Option\u003cchar\u003e`: looks ahead one character\n- `fn state(\u0026mut self) -\u003e \u0026mut \u003cuser state type\u003e`: returns a mutable reference\n  to the user state\n- `fn return_(\u0026self, token: \u003cuser token type\u003e) -\u003e SemanticActionResult`:\n  returns the passed token as a match.\n- `fn continue_(\u0026self) -\u003e SemanticActionResult`: ignores the current match and\n  continues lexing in the same lexer state. Useful for skipping characters.\n- `fn switch(\u0026mut self, rule: LexerRule) -\u003e SemanticActionResult`: used for\n  switching between lexer states. The `LexerRule` (where `Lexer` part is the\n  name of the lexer as specified by the user) is an enum with a variant for\n  each rule set name, for example, `LexerRule::Init`. See the stateful lexer\n  example below.\n- `fn switch_and_return(\u0026mut self, rule: LexerRule, token: \u003cuser token type\u003e)\n  -\u003e SemanticActionResult`: switches to the given lexer state and returns the\n  given token.\n- `fn reset_match(\u0026mut self)`: resets the current match. E.g. if you call\n  `match_()` right after `reset_match()` it will return an empty string.\n\nSemantic action functions should return a `SemanticActionResult` value obtained\nfrom one of the methods listed above.\n\n## Initializing lexers\n\nlexgen generates 4 constructors:\n\n- `fn new(input: \u0026str) -\u003e Self`: Used when the lexer does not have user state,\n  or user state implements `Default`.\n\n- `fn new_with_state(input: \u0026str, user_state: S) -\u003e Self`: Used when the lexer\n  has user state that does not implement `Default`, or you want to initialize\n  the state with something other than the default. `S` is the user state type\n  specified in lexer definition. See stateful lexer example below.\n\n- `fn new_from_iter\u003cI: Iterator\u003cItem = char\u003e + Clone\u003e(iter: I) -\u003e Self`: Used\n  when the input isn't a flat string, but something like a rope or zipper. Note\n  that the `match_` method panics when this constructor is used. Instead use\n  `match_loc` to get the location of the current match.\n\n- `fn new_from_iter_with_state\u003cI: Iterator\u003cItem = char\u003e + Clone, S\u003e(iter: I,\n  user_state: S) -\u003e Self`: Same as above, but doesn't require user state to\n  implement `Default`.\n\n## Stateful lexer example\n\nHere's an example lexer that counts number of `=`s appear between two `[`s:\n\n```rust\nlexer! {\n    // `usize` in parenthesis is the user state type, `usize` after the arrow\n    // is the token type\n    Lexer(usize) -\u003e usize;\n\n    rule Init {\n        $$ascii_whitespace,                             // line 7\n\n        '[' =\u003e |lexer| {\n            *lexer.state() = 0;                         // line 10\n            lexer.switch(LexerRule::Count)              // line 11\n        },\n    }\n\n    rule Count {\n        '=' =\u003e |lexer| {\n            *lexer.state() += 1;                        // line 17\n            lexer.continue_()                           // line 18\n        },\n\n        '[' =\u003e |lexer| {\n            let n = *lexer.state();\n            lexer.switch_and_return(LexerRule::Init, n) // line 23\n        },\n    }\n}\n\nlet mut lexer = Lexer::new(\"[[ [=[ [==[\");\nassert_eq!(\n    lexer.next(),\n    Some(Ok((\n        Loc { line: 0, col: 0, byte_idx: 0 },\n        0,\n        Loc { line: 0, col: 2, byte_idx: 2 },\n    )))\n);\nassert_eq!(\n    lexer.next(),\n    Some(Ok((\n        Loc { line: 0, col: 3, byte_idx: 3 },\n        1,\n        Loc { line: 0, col: 6, byte_idx: 6 },\n    )))\n);\nassert_eq!(\n    lexer.next(),\n    Some(Ok((\n        Loc { line: 0, col: 7, byte_idx: 7 },\n        2,\n        Loc { line: 0, col: 11, byte_idx: 11 },\n    )))\n);\nassert_eq!(lexer.next(), None);\n```\n\nInitially (the `Init` rule set) we skip spaces (line 7). When we see a `[` we\ninitialize the user state (line 10) and switch to the `Count` state (line 11).\nIn `Count`, each `=` increments the user state by one (line 17) and skips the\nmatch (line 18). A `[` in `Count` state returns the current number and switches\nto the `Init` state (line 23).\n\n[1]: https://github.com/osa1/lexgen/blob/main/crates/lexgen/tests/tests.rs\n[2]: https://github.com/osa1/lexgen/blob/main/crates/lexgen/tests/lua_5_1.rs\n[3]: https://github.com/osa1/lexgen/tree/main/crates/lexgen_lalrpop_example\n[4]: https://github.com/osa1/mincaml/blob/master/src/lexer.rs\n[5]: https://github.com/osa1/lexgen_rust/blob/main/crates/lexgen_rust/src/lib.rs\n[6]: https://github.com/osa1/how-to-parse/blob/4f40236b1f9eca5b67d2193ef0f55fffdc06bffb/src/lexgen_event_parser.rs\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosa1%2Flexgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fosa1%2Flexgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fosa1%2Flexgen/lists"}