https://github.com/ariaandika/tokenizer
tokenizer, lexer, parser, or whatever in rust
https://github.com/ariaandika/tokenizer
parser rust tokenizer
Last synced: about 1 year ago
JSON representation
tokenizer, lexer, parser, or whatever in rust
- Host: GitHub
- URL: https://github.com/ariaandika/tokenizer
- Owner: ariaandika
- Created: 2024-10-11T11:11:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-20T02:52:28.000Z (over 1 year ago)
- Last Synced: 2025-03-24T12:32:45.892Z (over 1 year ago)
- Topics: parser, rust, tokenizer
- Language: Rust
- Homepage:
- Size: 40 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Basic Tokenizer, Lexer, Parser, or whatever
inspired by rust `syn` and `proc_macro`
## Workspace
- `tokenizer`, convert bytes to tokens
- `parser`, more extensible parser
- `buf-iter`, more byte oriented parser instead of token
- `html-parser`, the first attempt of parser
## Tokenizer
Tokenize a stream of bytes, into collection of token trees
Every tokens does not contain the actual value, but instead it holds a `Span`. Span contains 'pointer' to
the actual value in source code. To get the actual value, we can `evaluate` based on source code. This required
the caller to hold the source reference themself. In exchange, we only allocate numbers when tokenizing.
This is not a general tokenizer, because other kind of tokens can have other rules that cannot overlap,
and its not worth to creating another abstraction layer. Instead, specialized tokenizer usually created
on its own, which also can derived from this tokenizer. That also make this tokenizer infallible.
### `TokenTree`
possible types of token:
- `Ident`
- `Punct`
- `Whitespace`
for more detail, see the generated documentation
```bash
cargo doc -p tokenizer --open
```
## Parser
More extensible parser, moving out of rust's `Iterator` trait, and make api more like `syn`.
## BufIter
byte oriented parser, good for piping buffer without abstracting into tokens.
see example in `buf-iter/examples`, the test in `buf-iter/tests` is also an example.
## HTML Parser
The first attempt of parser. Derived from `tokenizer`. HTML tokens itself is pretty simple, so this package is not
really design of extensibility, most of its is hard coded.
Here, we parse open or close element, not the whole element with its children. This is to avoid allocating
new vector when iterating. So the result is a one dimensional tokens. Attributes also not parsed, only validated,
with same the reason above, to avoid allocating new vector. We can iterate attribute on its own if needed.
### `SyntaxTree`
possible types of token:
- `DOCTYPE`, html doctype ``
- `Comment`, html comment, ``
- `Element`, open or close html element, attributes are only validated
- `Text`, others