https://github.com/jubnzv/tsgen
Tree-sitter based fuzzing corpus generator.
https://github.com/jubnzv/tsgen
fuzzing fuzzing-compilers structure-aware-fuzzing tree-sitter
Last synced: about 2 months ago
JSON representation
Tree-sitter based fuzzing corpus generator.
- Host: GitHub
- URL: https://github.com/jubnzv/tsgen
- Owner: jubnzv
- License: apache-2.0
- Created: 2026-04-05T12:56:02.000Z (2 months ago)
- Default Branch: master
- Last Pushed: 2026-04-17T15:37:19.000Z (about 2 months ago)
- Last Synced: 2026-04-17T17:34:16.968Z (about 2 months ago)
- Topics: fuzzing, fuzzing-compilers, structure-aware-fuzzing, tree-sitter
- Language: Rust
- Homepage:
- Size: 41 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tsgen
A grammar-based program generator driven by tree-sitter grammars. Point it at
a `grammar.json` from any tree-sitter parser and it produces a corpus of
syntactically-structured programs in that language.
It is useful for setting up fuzzing or differential-testing campaigns and building fuzzing corpora.
The tool only follows the tree-sitter grammar, so output often has syntax or semantic errors the target compiler rejects — useful for fuzzing, not for compilable code.
## What it does
Given a tree-sitter grammar, `tsgen`:
1. Walks the grammar rules recursively from the root.
2. At each `CHOICE` node, picks one alternative
3. At each `REPEAT`/`REPEAT1`, picks a random count up to `--max-repeat`.
4. At each `PATTERN` (a terminal regex), samples from a pre-built dictionary
of candidates — or, if you give it one, from your own identifier dict
and/or harvested values from real source code.
5. Optionally validates each candidate program with the compiled tree-sitter
parser and drops anything that doesn't parse.
6. Tracks per-choice coverage and keeps going until either `--count` programs
are collected and `--coverage-target` is met, or `--max-attempts` runs out.
The output is a directory of files, one program per file.
## Quick start
Minimum inputs: a tree-sitter `grammar.json` and optionally the compiled
parser `.so` for validation.
```
cargo run --release -- \
--grammar path/to/grammar.json \
--parser path/to/parser.so \
--count 200 \
--output-dir corpus \
--ext .lang
```
`--parser` argument is optional, but without it the tool will generate more garbage.
## How generation actually works
### Min-depth pre-pass
Before generation, tsgen computes the minimum syntax-tree depth required to
finish expanding every rule. During generation, when the current depth
approaches `--max-depth`, the rule-expander avoids `CHOICE` alternatives and
`REPEAT` expansions that would blow past the budget. This is what keeps
generation from infinitely recursing into `expression → binary_op →
expression → ...`.
### Terminal dictionary
tsgen scans the grammar for every `PATTERN` regex, classifies it as one of
`{Identifier, DecimalNumber, HexNumber, StringLit, Whitespace, Unknown}`, and
pre-builds a candidate list using a small set of built-in defaults plus
`rand_regex` for anything weird. Identifier candidates are filtered against
the grammar's keyword set so you don't get `let = let`.
### Harvest pool and dict
For realistic output you can feed tsgen real material:
- **`--dict `** (repeatable): newline-delimited identifier list. Blank
lines and `#` comments are skipped. Loaded directly into the
`Identifier` pool, no regex filtering, no length minimum.
- **`--harvest-dir `** (repeatable): recursively scans files and scrapes
identifiers (≥3 chars), decimals, hex, and string literals into their
respective pools. Use `--harvest-ext .cairo` or a glob like
`--harvest-ext "generated_*.move"` to filter files.
Both sources merge into one pool. At every terminal-expansion site, tsgen
flips a coin weighted by `--harvest-weight` (default 0.5):
- **heads** → classify the pattern, look up that kind in the pool, return a
random value.
- **tails** (or pool miss for that kind) → fall back to the pre-built
candidate list.
Per-slot, independent. Higher weight = more realistic-looking output, less
regex-weird gibberish.
### Validation loop
If `--parser` is provided, every generated program is run through the
tree-sitter parser. Programs with parse errors are discarded and do **not**
count toward `--count` or `valid_coverage`. Unvalidated attempts still
contribute to *exploration* coverage.
### Coverage
Two counters are tracked:
- **exploration coverage** — which CHOICE alternatives have been *attempted*
across all generated programs (valid or not).
- **valid coverage** — which alternatives have been reached inside programs
that actually parsed.
The loop stops when both `programs.len() >= --count` **and** `valid_coverage
>= --coverage-target` (default 0.95). Set `--coverage-target 0.0` to make
`--count` a hard ceiling and stop the moment you have enough programs.
## CLI flags
### Core
| flag | default | notes |
|---|---|---|
| `--grammar ` | required | path to `grammar.json` |
| `--parser ` | *(none)* | compiled tree-sitter `.so`; without it, no validation |
| `--count ` | 100 | minimum valid programs to collect |
| `--output-dir ` | `corpus` | where files get written |
| `--ext <.ext>` | `.txt` | file extension for generated files |
| `--seed ` | 0 | RNG seed for reproducibility |
| `--dry-run` | off | print programs to stdout, don't write files |
| `--dump-grammar` | off | dump rule/min-depth/terminal debug info and exit |
### Shape of the generated tree
| flag | default | notes |
|---|---|---|
| `--max-depth ` | 15 | upper bound on syntax-tree depth |
| `--max-repeat ` | 5 | upper bound for `REPEAT` expansions |
| `--complexity-bias ` | 0.0 | 0 = uniform CHOICE, 1 = strongly prefer complex alternatives when there's depth budget |
| `--top-level-rule ` | *(none)* | repeatable; forces the top-level expansion to pick only from these rules. e.g. skip expression-statements at file scope in C/Rust-like grammars |
### Terminal content
| flag | default | notes |
|---|---|---|
| `--dict ` | *(none)* | repeatable; newline-delimited identifier list |
| `--harvest-dir ` | *(none)* | repeatable; scrapes ids/numbers/strings from real source |
| `--harvest-ext ` | *(any)* | extension (`.cairo`) or glob (`"generated_*.move"`) |
| `--harvest-weight ` | 0.5 | probability of pulling from dict/harvest vs. built-ins |
| `--unicode` | off | allow non-ASCII in output (default replaces non-ASCII with `z`) |
### Stopping conditions
| flag | default | notes |
|---|---|---|
| `--coverage-target ` | 0.95 | valid-coverage ratio that triggers early stop once `--count` is also met |
| `--max-attempts ` | 10000 | hard cap on generation attempts (valid + discarded + dupes) |
| `--no-cleanup` | off | disable whitespace post-processing |
## Limitations
- **External tokens (`externals: [...]`) are opaque.** Their matching logic
lives in a hand-written `scanner.c`; we only see the symbol name and emit
`` at those slots. Affects Python, Ruby, Haskell. Cairo,
Yul, most Rust-family DSLs have empty externals and are unaffected.
- **Parser-only grammar fields are ignored:** `conflicts`, `precedences`,
`inline`, `supertypes`, `word`, `reserved`. The compiled `.so` still
enforces them during validation.
- **Regex is JS-flavoured.** We strip lookarounds and backrefs before
handing patterns to `rand_regex`. Unicode-property escapes and weirder
JS-only constructs may fall through to the `"UNKNOWN"` fallback.
- **Syntactic, not semantic.** Output breaks type rules, scoping, and
references — intentional, exercises later compiler stages.
- **`--count` is a floor, not a ceiling.** Loop keeps going until
`coverage-target` is also met. Pass `--coverage-target 0.0` for hard stop.