{"id":22147646,"url":"https://github.com/ariaandika/tokenizer","last_synced_at":"2025-06-26T12:31:39.370Z","repository":{"id":258474235,"uuid":"871145727","full_name":"ariaandika/tokenizer","owner":"ariaandika","description":"tokenizer, lexer, parser, or whatever in rust","archived":false,"fork":false,"pushed_at":"2024-10-20T02:52:28.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-24T12:32:45.892Z","etag":null,"topics":["parser","rust","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ariaandika.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-11T11:11:04.000Z","updated_at":"2024-10-20T02:52:25.000Z","dependencies_parsed_at":"2025-01-29T17:41:24.672Z","dependency_job_id":"43814c69-50eb-4d17-81d6-5a68cfcdb569","html_url":"https://github.com/ariaandika/tokenizer","commit_stats":null,"previous_names":["ariaandika/tokenizer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ariaandika/tokenizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ariaandika%2Ftokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ariaandika%2Ftokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ariaandika%2Ftokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ariaandika%2Ftokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ariaandika","download_url":"https://codeload.github.com/ariaandika/tokenizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ariaandika%2Ftokenizer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262067912,"owners_count":23253698,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parser","rust","tokenizer"],"created_at":"2024-12-01T23:19:42.562Z","updated_at":"2025-06-26T12:31:39.289Z","avatar_url":"https://github.com/ariaandika.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Basic Tokenizer, Lexer, Parser, or whatever\n\ninspired by rust `syn` and `proc_macro`\n\n## Workspace\n\n- `tokenizer`, convert bytes to tokens\n- `parser`, more extensible parser\n- `buf-iter`, more byte oriented parser instead of token\n- `html-parser`, the first attempt of parser\n\n## Tokenizer\n\nTokenize a stream of bytes, into collection of token trees\n\nEvery tokens does not contain the actual value, but instead it holds a `Span`. Span contains 'pointer' to\nthe actual value in source code. To get the actual value, we can `evaluate` based on source code. This required\nthe caller to hold the source reference themself. In exchange, we only allocate numbers when tokenizing.\n\nThis is not a general tokenizer, because other kind of tokens can have other rules that cannot overlap,\nand its not worth to creating another abstraction layer. Instead, specialized tokenizer usually created\non its own, which also can derived from this tokenizer. That also make this tokenizer infallible.\n\n### `TokenTree`\n\npossible types of token:\n\n- `Ident`\n- `Punct`\n- `Whitespace`\n\nfor more detail, see the generated documentation\n\n```bash\ncargo doc -p tokenizer --open\n```\n\n## Parser\n\nMore extensible parser, moving out of rust's `Iterator` trait, and make api more like `syn`.\n\n## BufIter\n\nbyte oriented parser, good for piping buffer without abstracting into tokens.\n\nsee example in `buf-iter/examples`, the test in `buf-iter/tests` is also an example.\n\n## HTML Parser\n\nThe first attempt of parser. Derived from `tokenizer`. HTML tokens itself is pretty simple, so this package is not\nreally design of extensibility, most of its is hard coded.\n\nHere, we parse open or close element, not the whole element with its children. This is to avoid allocating\nnew vector when iterating. So the result is a one dimensional tokens. Attributes also not parsed, only validated,\nwith same the reason above, to avoid allocating new vector. We can iterate attribute on its own if needed.\n\n### `SyntaxTree`\n\npossible types of token:\n\n- `DOCTYPE`, html doctype `\u003c!DOCTYPE html\u003e`\n- `Comment`, html comment, `\u003c!-- any value --\u003e`\n- `Element`, open or close html element, attributes are only validated\n- `Text`, others\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fariaandika%2Ftokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fariaandika%2Ftokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fariaandika%2Ftokenizer/lists"}