{"id":26279033,"url":"https://github.com/joshday/tokeniterators.jl","last_synced_at":"2025-07-19T10:39:34.232Z","repository":{"id":277429776,"uuid":"932408153","full_name":"joshday/TokenIterators.jl","owner":"joshday","description":"Easy syntax for writing fast tokenizers/lexers","archived":false,"fork":false,"pushed_at":"2025-03-14T00:18:37.000Z","size":90,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T00:49:01.814Z","etag":null,"topics":["julia","lexical-analysis","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joshday.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-13T21:34:06.000Z","updated_at":"2025-03-13T14:53:12.000Z","dependencies_parsed_at":"2025-03-07T00:28:55.247Z","dependency_job_id":null,"html_url":"https://github.com/joshday/TokenIterators.jl","commit_stats":null,"previous_names":["joshday/tokenizers.jl","joshday/tokeniterators.jl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshday%2FTokenIterators.jl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshday%2FTokenIterators.jl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshday%2FTokenIterators.jl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshday%2FTokenIterators.jl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joshday","download_url":"https://codeload.github.com/joshday/TokenIterators.jl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243581059,"owners_count":20314167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["julia","lexical-analysis","tokenization"],"created_at":"2025-03-14T13:18:52.132Z","updated_at":"2025-03-14T13:18:52.582Z","avatar_url":"https://github.com/joshday.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TokenIterators\n\n[![Build Status](https://github.com/joshday/TokenIterators.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/joshday/TokenIterators.jl/actions/workflows/CI.yml?query=branch%3Amain)\n\n\nTokenIterators.jl provides easy syntax for writing lexers/tokenizers, with a few built-ins.  It's super fast and easy to use.\n\n\u003e [!IMPORTANT]\n\u003e - This package is not designed for validating syntax.\n\u003e - There's no guarantee that a token is the \"smallest meaningful unit\"\n\n\n## Usage\n\n```julia\nusing TokenIterators\n\nt = JSONTokens(b\"\"\"{ \"key\": \"value\", \"key2\": -1e-7}\"\"\")\n\ncollect(t)\n```\n\n\u003cimg src=\"https://github.com/user-attachments/assets/2f225a34-35ea-4c2b-8389-53c00ae4de5d\" alt=\"TokenIterators.jl example\" height=\"300px\"\u003e\n\n\n## `TokenIterator` and `Token`\n\nA `TokenIterator` (abstract type) iterates over `Token`s (smallest meaningful unit of text/data) from any input `T::AbstractVector{UInt8}`.\n\nBoth `TokenIterator{T,K,S}` and `Token{T,K,S}` are parameterized by the input data `T`, type of token kind `K`, and type of state `S`.\n\n```julia\nstruct Token{T \u003c: AbstractVector{UInt8}, K, S} \u003c: AbstractVector{UInt8}\n    data::T     # Input Data\n    kind::K     # Kind of Token (e.g. Symbol or an Enum type)\n    i::Int      # first index of token\n    j::Int      # last index of token\n    state::S    # Any additional state we wish to store\nend\n```\n\n\u003e [!TIP]\n\u003e [StringViews.jl](https://github.com/JuliaStrings/StringViews.jl) can be used to provide lightweight AbstractString views of the token.\n\n## Rules\n\nA `Token` can be created by a `Rule`, which is created with the syntax:\n\n```julia\nstarting_pattern --\u003e ending_pattern\n```\n\nwhere:\n\n1. `starting_pattern` is used to identify the beginnign of a token.\n2. `ending_pattern` is used to identify the end of a token.\n\n\u003e [!NOTE]\n\u003e A `Token` with indexes `i` and `j` will be created if `starting_pattern == data[i]` and `j == findnext(ending_pattern, data, i + 1)`.\n\u003e We use `TokenIterators.isfirst(starting_pattern, dataview)` and `TokenIterators._findnext(ending_pattern, dataview, i + 1)` (to avoid piracy with `Base.findnext`) to determine `i` and `j` of a `Token` where `dataview` is a view of the data *after* the previous token, e.g. `@view data[prev_token.j + 1:end]`.\n\n## An Example: JSONTokens\n\n- Here is the full implementation of `TokenIterators.JSONTokens`:\n\n```julia\nstruct JSONTokens{T \u003c: Data} \u003c: TokenIterator{T, Symbol, Nothing}\n    data::T\nend\n\nnext(o::JSONTokens, n::Token) = @tryrules begin\n    curly_open      = '{' --\u003e 1\n    curly_close     = '}' --\u003e 1\n    square_open     = '[' --\u003e 1\n    square_close    = ']' --\u003e 1\n    comma           = ',' --\u003e 1\n    colon           = ':' --\u003e 1\n    var\"true\"       = 't' --\u003e 4\n    var\"false\"      = 'f' --\u003e 5\n    var\"null\"       = 'n' --\u003e 4\n    string          = STRING            # '\"'                --\u003e  Unescaped('\"')\n    number          = NUMBER            # ∈(b\"-0123456789\")  --\u003e  ≺(∉(b\"-+eE.0123456789\"))\n    whitespace      = ASCII_WHITESPACE  #  ∈(b\" \\t\\r\\n\")     --\u003e  ≺(∉(b\" \\t\\r\\n\"))\nend\n```\n\n- `next` is the core part of `Base.iterate(::TokenIterator)`.\n- Here `n::Token` is a view into the data after the previous token, e.g. `Token(prev.data, prev.kind, prev.j + 1, length(prev.data), prev.state)`.\n- The `@tryrules` macro will attempt to match the rules in order.  If a rule matches, the token is created and the function returns.\n\n## Mini-DSL\n\nPatterns in the DSL determine how the token is identified in `data::AbstractVector{UInt8}`.\n\n### Starting Patterns\n\n| Type | Example |Symbol | Tab-Completion | Description |\n|------|---------|-------|----------------|-------------|\n`UInt8` | `0x7b` |   |  | `x == data[i]`\n`Char` | `'{'` |   |  | `UInt8(x) == data[i]`\n`Function` | `∈(b\" \\t\\r\\n\") ` |  |  | `f(data[1]) == true`\n`Tuple` | `(a,b)` | | | `data` starts with sequence of patterns\n`AbstractVector{UInt8}` | `b\"null\"` | | | `x == data[1:length(x)]`\n`UseStringView` | `𝑠('😄')` | `𝑠` | `\\its` | Use pattern with `data::StringView`\n\n### Ending Patterns\n\n| Type | Example |Symbol | Tab-Completion | Description |\n|------|---------|-------|----------------|-------------|\n`UInt8` | `0x7b` |   |  | `findnext(==(x), data, i + 1)`\n`Char` | `'{'` |   |  | `findnext(==(UInt8(x)), data, i + 1)`\n`Function` | `∈(b\" \\t\\r\\n\") ` |  |  | `findnext(f, data, i + 1)`\n`Int` |  `1` |   |  | `return x` (fixed length)\n`Tuple` | `(a,b)` | | | return range in which indexes of `data` match sequence of patterns\n`Last` | `Last((a,b))` | `→` | `\\rightarrow` | return last index of match indexes\n`First` | `First((a,b))` | `←` | `\\leftarrow` | return first index of match indexes\n`Before` | `Before(x -\u003e x == a)` | `≺` | `\\prec` | return index before the match index\n`UseStringView` | `𝑠(isascii)` | `𝑠` | `\\its` | Use pattern with `data::StringView`\n`Not` | `¬('\\\\')` | `¬` | `\\neg` | Matching anything but the given pattern\n\n### Mini-DSL Examples\n\nThe `STRING`, `NUMBER`, and `ASCII_WHITESPACE` patterns from the JSONTokens example are interpreted as:\n\n```julia\n# STRING: Token begins with UInt8('\"') and ends at UInt8('\"') (excluding escaped '\"')\n'\"' --\u003e Unescaped('\"')\n\n# NUMBER: Token begins with any byte in b\"-0123456789\" and ends at the index before the first byte that is not in b\"-+eE.0123456789\"\n∈(b\"-0123456789\")  --\u003e  ≺(∉(b\"-+eE.0123456789\"))\n\n# ASCII_WHITESPACE: Token begins with any byte in b\" \\t\\r\\n\" and ends at the index before the first byte that is not in b\" \\t\\r\\n\"\n∈(b\" \\t\\r\\n\")  --\u003e  ≺(∉(b\" \\t\\r\\n\"))\n```\n\n## Performance\n\nTokenIterators is very fast with minimal allocations:\n\n```julia\nusing TokenIterators, BenchmarkTools\n\nversioninfo()\n# Julia Version 1.11.3\n# Commit d63adeda50d (2025-01-21 19:42 UTC)\n# Build Info:\n#   Official https://julialang.org/ release\n# Platform Info:\n#   OS: macOS (arm64-apple-darwin24.0.0)\n#   CPU: 10 × Apple M1 Pro\n#   WORD_SIZE: 64\n#   LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)\n# Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)\n\ndata = read(download(\"https://github.com/plotly/plotly.js/raw/v3.0.1/dist/plot-schema.json\"));\n\nBase.format_bytes(length(data))\n# \"3.648 MiB\"\n\nt = JSONTokens(data)\n# JSONTokens (3824728-element Vector{UInt8})\n\n@benchmark sum(t.kind == :string for t in $t)\n# BenchmarkTools.Trial: 497 samples with 1 evaluation per sample.\n#  Range (min … max):   9.834 ms …  31.878 ms  ┊ GC (min … max): 0.00% … 0.00%\n#  Time  (median):     10.001 ms               ┊ GC (median):    0.00%\n#  Time  (mean ± σ):   10.059 ms ± 993.690 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%\n\n#     ▁▁▄▄▂▂▄▁    ▃▄ ▂▄▅█▁▄▃ ▄▃▃▆▂▁▂▂\n#   ▆▃████████▆▆▅▆██████████▆█████████▆▅▆▆▆▆▅▆▃▃▄▃▃▅▄▃▅▁▃▃▃▃▃▁▃▃ ▅\n#   9.83 ms         Histogram: frequency by time         10.3 ms \u003c\n\n#  Memory estimate: 0 bytes, allocs estimate: 0.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshday%2Ftokeniterators.jl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoshday%2Ftokeniterators.jl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshday%2Ftokeniterators.jl/lists"}