https://github.com/joshday/tokeniterators.jl
Easy syntax for writing fast tokenizers/lexers
https://github.com/joshday/tokeniterators.jl
julia lexical-analysis tokenization
Last synced: 11 months ago
JSON representation
Easy syntax for writing fast tokenizers/lexers
- Host: GitHub
- URL: https://github.com/joshday/tokeniterators.jl
- Owner: joshday
- License: mit
- Created: 2025-02-13T21:34:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-14T00:18:37.000Z (over 1 year ago)
- Last Synced: 2025-03-14T00:49:01.814Z (over 1 year ago)
- Topics: julia, lexical-analysis, tokenization
- Language: Julia
- Homepage:
- Size: 87.9 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TokenIterators
[](https://github.com/joshday/TokenIterators.jl/actions/workflows/CI.yml?query=branch%3Amain)
TokenIterators.jl provides easy syntax for writing lexers/tokenizers, with a few built-ins. It's super fast and easy to use.
> [!IMPORTANT]
> - This package is not designed for validating syntax.
> - There's no guarantee that a token is the "smallest meaningful unit"
## Usage
```julia
using TokenIterators
t = JSONTokens(b"""{ "key": "value", "key2": -1e-7}""")
collect(t)
```

## `TokenIterator` and `Token`
A `TokenIterator` (abstract type) iterates over `Token`s (smallest meaningful unit of text/data) from any input `T::AbstractVector{UInt8}`.
Both `TokenIterator{T,K,S}` and `Token{T,K,S}` are parameterized by the input data `T`, type of token kind `K`, and type of state `S`.
```julia
struct Token{T <: AbstractVector{UInt8}, K, S} <: AbstractVector{UInt8}
data::T # Input Data
kind::K # Kind of Token (e.g. Symbol or an Enum type)
i::Int # first index of token
j::Int # last index of token
state::S # Any additional state we wish to store
end
```
> [!TIP]
> [StringViews.jl](https://github.com/JuliaStrings/StringViews.jl) can be used to provide lightweight AbstractString views of the token.
## Rules
A `Token` can be created by a `Rule`, which is created with the syntax:
```julia
starting_pattern --> ending_pattern
```
where:
1. `starting_pattern` is used to identify the beginnign of a token.
2. `ending_pattern` is used to identify the end of a token.
> [!NOTE]
> A `Token` with indexes `i` and `j` will be created if `starting_pattern == data[i]` and `j == findnext(ending_pattern, data, i + 1)`.
> We use `TokenIterators.isfirst(starting_pattern, dataview)` and `TokenIterators._findnext(ending_pattern, dataview, i + 1)` (to avoid piracy with `Base.findnext`) to determine `i` and `j` of a `Token` where `dataview` is a view of the data *after* the previous token, e.g. `@view data[prev_token.j + 1:end]`.
## An Example: JSONTokens
- Here is the full implementation of `TokenIterators.JSONTokens`:
```julia
struct JSONTokens{T <: Data} <: TokenIterator{T, Symbol, Nothing}
data::T
end
next(o::JSONTokens, n::Token) = @tryrules begin
curly_open = '{' --> 1
curly_close = '}' --> 1
square_open = '[' --> 1
square_close = ']' --> 1
comma = ',' --> 1
colon = ':' --> 1
var"true" = 't' --> 4
var"false" = 'f' --> 5
var"null" = 'n' --> 4
string = STRING # '"' --> Unescaped('"')
number = NUMBER # ∈(b"-0123456789") --> ≺(∉(b"-+eE.0123456789"))
whitespace = ASCII_WHITESPACE # ∈(b" \t\r\n") --> ≺(∉(b" \t\r\n"))
end
```
- `next` is the core part of `Base.iterate(::TokenIterator)`.
- Here `n::Token` is a view into the data after the previous token, e.g. `Token(prev.data, prev.kind, prev.j + 1, length(prev.data), prev.state)`.
- The `@tryrules` macro will attempt to match the rules in order. If a rule matches, the token is created and the function returns.
## Mini-DSL
Patterns in the DSL determine how the token is identified in `data::AbstractVector{UInt8}`.
### Starting Patterns
| Type | Example |Symbol | Tab-Completion | Description |
|------|---------|-------|----------------|-------------|
`UInt8` | `0x7b` | | | `x == data[i]`
`Char` | `'{'` | | | `UInt8(x) == data[i]`
`Function` | `∈(b" \t\r\n") ` | | | `f(data[1]) == true`
`Tuple` | `(a,b)` | | | `data` starts with sequence of patterns
`AbstractVector{UInt8}` | `b"null"` | | | `x == data[1:length(x)]`
`UseStringView` | `𝑠('😄')` | `𝑠` | `\its` | Use pattern with `data::StringView`
### Ending Patterns
| Type | Example |Symbol | Tab-Completion | Description |
|------|---------|-------|----------------|-------------|
`UInt8` | `0x7b` | | | `findnext(==(x), data, i + 1)`
`Char` | `'{'` | | | `findnext(==(UInt8(x)), data, i + 1)`
`Function` | `∈(b" \t\r\n") ` | | | `findnext(f, data, i + 1)`
`Int` | `1` | | | `return x` (fixed length)
`Tuple` | `(a,b)` | | | return range in which indexes of `data` match sequence of patterns
`Last` | `Last((a,b))` | `→` | `\rightarrow` | return last index of match indexes
`First` | `First((a,b))` | `←` | `\leftarrow` | return first index of match indexes
`Before` | `Before(x -> x == a)` | `≺` | `\prec` | return index before the match index
`UseStringView` | `𝑠(isascii)` | `𝑠` | `\its` | Use pattern with `data::StringView`
`Not` | `¬('\\')` | `¬` | `\neg` | Matching anything but the given pattern
### Mini-DSL Examples
The `STRING`, `NUMBER`, and `ASCII_WHITESPACE` patterns from the JSONTokens example are interpreted as:
```julia
# STRING: Token begins with UInt8('"') and ends at UInt8('"') (excluding escaped '"')
'"' --> Unescaped('"')
# NUMBER: Token begins with any byte in b"-0123456789" and ends at the index before the first byte that is not in b"-+eE.0123456789"
∈(b"-0123456789") --> ≺(∉(b"-+eE.0123456789"))
# ASCII_WHITESPACE: Token begins with any byte in b" \t\r\n" and ends at the index before the first byte that is not in b" \t\r\n"
∈(b" \t\r\n") --> ≺(∉(b" \t\r\n"))
```
## Performance
TokenIterators is very fast with minimal allocations:
```julia
using TokenIterators, BenchmarkTools
versioninfo()
# Julia Version 1.11.3
# Commit d63adeda50d (2025-01-21 19:42 UTC)
# Build Info:
# Official https://julialang.org/ release
# Platform Info:
# OS: macOS (arm64-apple-darwin24.0.0)
# CPU: 10 × Apple M1 Pro
# WORD_SIZE: 64
# LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
# Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
data = read(download("https://github.com/plotly/plotly.js/raw/v3.0.1/dist/plot-schema.json"));
Base.format_bytes(length(data))
# "3.648 MiB"
t = JSONTokens(data)
# JSONTokens (3824728-element Vector{UInt8})
@benchmark sum(t.kind == :string for t in $t)
# BenchmarkTools.Trial: 497 samples with 1 evaluation per sample.
# Range (min … max): 9.834 ms … 31.878 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 10.001 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 10.059 ms ± 993.690 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▁▁▄▄▂▂▄▁ ▃▄ ▂▄▅█▁▄▃ ▄▃▃▆▂▁▂▂
# ▆▃████████▆▆▅▆██████████▆█████████▆▅▆▆▆▆▅▆▃▃▄▃▃▅▄▃▅▁▃▃▃▃▃▁▃▃ ▅
# 9.83 ms Histogram: frequency by time 10.3 ms <
# Memory estimate: 0 bytes, allocs estimate: 0.
```