https://github.com/bsfishy/zig-lexer
lexer generator written in zig
https://github.com/bsfishy/zig-lexer
Last synced: 2 months ago
JSON representation
lexer generator written in zig
- Host: GitHub
- URL: https://github.com/bsfishy/zig-lexer
- Owner: BSFishy
- License: other
- Created: 2025-01-27T23:54:58.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-03-19T03:31:45.000Z (3 months ago)
- Last Synced: 2025-03-19T04:20:01.897Z (3 months ago)
- Language: Zig
- Size: 204 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# zig lexer
> [!CAUTION]
> I don't recommend that you use this project. I have built it for myself, and
> as such there are certainly holes in the implementation that I am aware about
> but will be very difficult to reason about in the spaghetti that I have built.
>
> I have open sourced the code for the culture. If you ARE interested in
> investing some blood, sweat, and tears you are free to fork the project and
> patch the wholes that I have left. This is and will remain a project that I
> have built for myself, and as such I won't be paying attention to issues and
> PRs in this repo.This is the home of my lexer. It uses Zig's comptime to generate jump tables at
compile time for super easy maintenance and pretty decent performance.I primarily made this project because I want to use lexers but I am bad at
writing them and I don't really care about the implementation. I kinda just care
about the tokens and their patterns. And this is exactly that.```zig
const lexer = @import("lexer");const token_patterns = [_]lexer.TokenPattern{
.Comment = .{ .pattern = "//([^\n])*", .skip = true },
.Division = .{ .pattern = "/" },
.Func = .{ .pattern = "func" },
.String = .{ .pattern = "\"([^\"]|\\\\\")*\"" },
.Newline = .{ .pattern = "(\n|\r\n)", .skip = true },
.Ident = .{ .pattern = "\\w\\W*" },
};const l = lexer.Lexer(&token_patterns).init(allocator);
const tokens = try l.lex(input, .{});
for (tokens) |token| {
_ = token.token_type; // enum of all the variant names
_ = token.source; // slice into the input
}
```Features:
- Generates jump tables at compile time
- Supports UTF-8 out of the box
- Allows matching specific token variants into exhaustive enums
- Regular expression-esque patterns for matching
- Linear time complexity lexing
- Token variants that can be skipped and won't appear in the token enum## Implementation details
The first step is parsing the regular expressions. This is a small pattern
language taken after regular expressions, that only has the functionality that I
find useful in making token patterns. You can see every feature that it supports
in the example above.Once tokens are parsed into a tree that represents the structure of the pattern,
it is passed into the parser. The parser creates a jump table, which is a list
of tables. The tables are a few maps of kvs. These match a specific character to
either a next table or a leaf token. There are separate maps for direct matches,
sequence matches (something like a `\w`), and fallthroughs (if the character
doesn't match anything else and isn't a specific character).These are the available sequences:
- **`\w`** - any alphabetic character
- **`\W`** - any alphanumeric character
- **`\0`** - any numeric characterThis structure must be recursively constructed due to the fact that the regular
expression grammar returns a hierarchical structure. All that code is super
complicated and I quite frankly only have a loose grasp on what is actually
going on. But it works, and that's all I really care about.Once the jump table has been generate, we convert it into static tables. This is
super important at the comptime -> runtime boundary. After that, once we have
UTF-8 input to lex, we just jump around the jump tables, reading through the
input.## Visualizing the tables
Because the tables are just data in memory, we also have the ability to export
graphs representing the tables. This is available through the following API:```zig
const writer = std.io.getStdOut().writer();
try l.to_graph(writer);
```I found this tremendously useful in building this project. I implemented lexer
for a small example language to test out its features. Here is the graphviz of
the lexer:[][graphviz]
## License
This project is licensed under either:
- MIT License ([LICENSE-MIT](./LICENSE-MIT))
- Apache License 2.0 ([LICENSE-APACHE](./LICENSE-APACHE))You may choose either license.
[graphviz]: https://edotor.net/?engine=dot#digraph%20%7B%0A%20%20start%20-%3E%201%20%5Blabel%3D%22%2F%22%5D%3B%0A%20%201%20-%3E%20Division%20%5Blabel%3D%22super%20%2F%20leaf%22%20color%3Dblue%5D%3B%0A%20%20start%20-%3E%203%20%5Blabel%3D%22f%22%5D%3B%0A%20%203%20-%3E%20Ident%20%5Blabel%3D%22super%20f%20leaf%22%20color%3Dblue%5D%3B%0A%20%20start%20-%3E%206%20%5Blabel%3D%22%5C%22%22%5D%3B%0A%20%20start%20-%3E%20Newline%20%5Blabel%3D%22%5C%5Cn%20leaf%22%20color%3Dblue%5D%3B%0A%20%20start%20-%3E%208%20%5Blabel%3D%22%5C%5Cr%22%5D%3B%0A%20%20start%20-%3E%209%20%5Blabel%3D%22%5C%5Cw%22%20color%3Dorange%5D%3B%0A%20%209%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5Cw%20leaf%22%20color%3Dgreen%5D%3B%0A%20%201%20-%3E%202%20%5Blabel%3D%22%2F%22%5D%3B%0A%20%202%20-%3E%20Comment%20%5Blabel%3D%22super%20%2F%20leaf%22%20color%3Dblue%5D%3B%0A%20%202%20-%3E%202%20%5Blabel%3D%22%5C%5Cn%22%20color%3Dred%5D%3B%0A%20%202%20-%3E%20Comment%20%5Blabel%3D%22super%20%5C%5Cn%20leaf%22%20color%3Dpurple%5D%3B%0A%20%203%20-%3E%204%20%5Blabel%3D%22u%22%5D%3B%0A%20%204%20-%3E%20Ident%20%5Blabel%3D%22super%20u%20leaf%22%20color%3Dblue%5D%3B%0A%20%203%20-%3E%2010%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2010%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%204%20-%3E%205%20%5Blabel%3D%22n%22%5D%3B%0A%20%205%20-%3E%20Ident%20%5Blabel%3D%22super%20n%20leaf%22%20color%3Dblue%5D%3B%0A%20%204%20-%3E%2011%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2011%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%205%20-%3E%2013%20%5Blabel%3D%22c%22%5D%3B%0A%20%2013%20-%3E%20Func%20%5Blabel%3D%22super%20c%20leaf%22%20color%3Dblue%5D%3B%0A%20%205%20-%3E%2012%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2012%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%206%20-%3E%207%20%5Blabel%3D%22%5C%5C%22%5D%3B%0A%20%206%20-%3E%20%22String%22%20%5Blabel%3D%22%5C%22%20leaf%22%20color%3Dblue%5D%3B%0A%20%206%20-%3E%206%20%5Blabel%3D%22%5C%22%22%20color%3Dred%5D%3B%0A%20%207%20-%3E%206%20%5Blabel%3D%22%5C%22%22%5D%3B%0A%20%207%20-%3E%206%20%5Blabel%3D%22%5C%22%22%20color%3Dred%5D%3B%0A%20%208%20-%3E%20Newline%20%5Blabel%3D%22%5C%5Cn%20leaf%22%20color%3Dblue%5D%3B%0A%20%209%20-%3E%209%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%209%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%2010%20-%3E%2010%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2010%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%2011%20-%3E%2011%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2011%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%2012%20-%3E%2012%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2012%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%2013%20-%3E%2014%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2014%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%20%2014%20-%3E%2014%20%5Blabel%3D%22%5C%5CW%22%20color%3Dorange%5D%3B%0A%20%2014%20-%3E%20Ident%20%5Blabel%3D%22super%20%5C%5CW%20leaf%22%20color%3Dgreen%5D%3B%0A%7D%0A