{"id":49101054,"url":"https://github.com/jubnzv/tsgen","last_synced_at":"2026-04-20T23:33:22.864Z","repository":{"id":352046365,"uuid":"1201990820","full_name":"jubnzv/tsgen","owner":"jubnzv","description":"Tree-sitter based fuzzing corpus generator.","archived":false,"fork":false,"pushed_at":"2026-04-17T15:37:19.000Z","size":42,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-04-17T17:34:16.968Z","etag":null,"topics":["fuzzing","fuzzing-compilers","structure-aware-fuzzing","tree-sitter"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jubnzv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-05T12:56:02.000Z","updated_at":"2026-04-17T15:37:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jubnzv/tsgen","commit_stats":null,"previous_names":["jubnzv/tsgen"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/jubnzv/tsgen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jubnzv%2Ftsgen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jubnzv%2Ftsgen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jubnzv%2Ftsgen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jubnzv%2Ftsgen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jubnzv","download_url":"https://codeload.github.com/jubnzv/tsgen/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jubnzv%2Ftsgen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32070656,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T21:26:33.338Z","status":"ssl_error","status_checked_at":"2026-04-20T21:26:22.081Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fuzzing","fuzzing-compilers","structure-aware-fuzzing","tree-sitter"],"created_at":"2026-04-20T23:33:22.317Z","updated_at":"2026-04-20T23:33:22.856Z","avatar_url":"https://github.com/jubnzv.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tsgen\n\nA grammar-based program generator driven by tree-sitter grammars. Point it at\na `grammar.json` from any tree-sitter parser and it produces a corpus of\nsyntactically-structured programs in that language.\n\nIt is useful for setting up fuzzing or differential-testing campaigns and building fuzzing corpora.\n\nThe tool only follows the tree-sitter grammar, so output often has syntax or semantic errors the target compiler rejects — useful for fuzzing, not for compilable code.\n\n## What it does\n\nGiven a tree-sitter grammar, `tsgen`:\n\n1. Walks the grammar rules recursively from the root.\n2. At each `CHOICE` node, picks one alternative\n3. At each `REPEAT`/`REPEAT1`, picks a random count up to `--max-repeat`.\n4. At each `PATTERN` (a terminal regex), samples from a pre-built dictionary\n   of candidates — or, if you give it one, from your own identifier dict\n   and/or harvested values from real source code.\n5. Optionally validates each candidate program with the compiled tree-sitter\n   parser and drops anything that doesn't parse.\n6. Tracks per-choice coverage and keeps going until either `--count` programs\n   are collected and `--coverage-target` is met, or `--max-attempts` runs out.\n\nThe output is a directory of files, one program per file.\n\n## Quick start\n\nMinimum inputs: a tree-sitter `grammar.json` and optionally the compiled\nparser `.so` for validation.\n\n```\ncargo run --release -- \\\n  --grammar path/to/grammar.json \\\n  --parser  path/to/parser.so \\\n  --count 200 \\\n  --output-dir corpus \\\n  --ext .lang\n```\n\n`--parser` argument is optional, but without it the tool will generate more garbage.\n\n\n## How generation actually works\n\n### Min-depth pre-pass\nBefore generation, tsgen computes the minimum syntax-tree depth required to\nfinish expanding every rule. During generation, when the current depth\napproaches `--max-depth`, the rule-expander avoids `CHOICE` alternatives and\n`REPEAT` expansions that would blow past the budget. This is what keeps\ngeneration from infinitely recursing into `expression → binary_op →\nexpression → ...`.\n\n### Terminal dictionary\ntsgen scans the grammar for every `PATTERN` regex, classifies it as one of\n`{Identifier, DecimalNumber, HexNumber, StringLit, Whitespace, Unknown}`, and\npre-builds a candidate list using a small set of built-in defaults plus\n`rand_regex` for anything weird. Identifier candidates are filtered against\nthe grammar's keyword set so you don't get `let = let`.\n\n### Harvest pool and dict\nFor realistic output you can feed tsgen real material:\n\n- **`--dict \u003cfile\u003e`** (repeatable): newline-delimited identifier list. Blank\n  lines and `#` comments are skipped. Loaded directly into the\n  `Identifier` pool, no regex filtering, no length minimum.\n- **`--harvest-dir \u003cdir\u003e`** (repeatable): recursively scans files and scrapes\n  identifiers (≥3 chars), decimals, hex, and string literals into their\n  respective pools. Use `--harvest-ext .cairo` or a glob like\n  `--harvest-ext \"generated_*.move\"` to filter files.\n\nBoth sources merge into one pool. At every terminal-expansion site, tsgen\nflips a coin weighted by `--harvest-weight` (default 0.5):\n\n- **heads** → classify the pattern, look up that kind in the pool, return a\n  random value.\n- **tails** (or pool miss for that kind) → fall back to the pre-built\n  candidate list.\n\nPer-slot, independent. Higher weight = more realistic-looking output, less\nregex-weird gibberish.\n\n### Validation loop\nIf `--parser` is provided, every generated program is run through the\ntree-sitter parser. Programs with parse errors are discarded and do **not**\ncount toward `--count` or `valid_coverage`. Unvalidated attempts still\ncontribute to *exploration* coverage.\n\n### Coverage\nTwo counters are tracked:\n\n- **exploration coverage** — which CHOICE alternatives have been *attempted*\n  across all generated programs (valid or not).\n- **valid coverage** — which alternatives have been reached inside programs\n  that actually parsed.\n\nThe loop stops when both `programs.len() \u003e= --count` **and** `valid_coverage\n\u003e= --coverage-target` (default 0.95). Set `--coverage-target 0.0` to make\n`--count` a hard ceiling and stop the moment you have enough programs.\n\n## CLI flags\n\n### Core\n| flag | default | notes |\n|---|---|---|\n| `--grammar \u003cfile\u003e` | required | path to `grammar.json` |\n| `--parser \u003cfile\u003e` | *(none)* | compiled tree-sitter `.so`; without it, no validation |\n| `--count \u003cN\u003e` | 100 | minimum valid programs to collect |\n| `--output-dir \u003cdir\u003e` | `corpus` | where files get written |\n| `--ext \u003c.ext\u003e` | `.txt` | file extension for generated files |\n| `--seed \u003cN\u003e` | 0 | RNG seed for reproducibility |\n| `--dry-run` | off | print programs to stdout, don't write files |\n| `--dump-grammar` | off | dump rule/min-depth/terminal debug info and exit |\n\n### Shape of the generated tree\n| flag | default | notes |\n|---|---|---|\n| `--max-depth \u003cN\u003e` | 15 | upper bound on syntax-tree depth |\n| `--max-repeat \u003cN\u003e` | 5 | upper bound for `REPEAT` expansions |\n| `--complexity-bias \u003cf\u003e` | 0.0 | 0 = uniform CHOICE, 1 = strongly prefer complex alternatives when there's depth budget |\n| `--top-level-rule \u003cname\u003e` | *(none)* | repeatable; forces the top-level expansion to pick only from these rules. e.g. skip expression-statements at file scope in C/Rust-like grammars |\n\n### Terminal content\n| flag | default | notes |\n|---|---|---|\n| `--dict \u003cfile\u003e` | *(none)* | repeatable; newline-delimited identifier list |\n| `--harvest-dir \u003cdir\u003e` | *(none)* | repeatable; scrapes ids/numbers/strings from real source |\n| `--harvest-ext \u003cfilter\u003e` | *(any)* | extension (`.cairo`) or glob (`\"generated_*.move\"`) |\n| `--harvest-weight \u003cf\u003e` | 0.5 | probability of pulling from dict/harvest vs. built-ins |\n| `--unicode` | off | allow non-ASCII in output (default replaces non-ASCII with `z`) |\n\n### Stopping conditions\n| flag | default | notes |\n|---|---|---|\n| `--coverage-target \u003cf\u003e` | 0.95 | valid-coverage ratio that triggers early stop once `--count` is also met |\n| `--max-attempts \u003cN\u003e` | 10000 | hard cap on generation attempts (valid + discarded + dupes) |\n| `--no-cleanup` | off | disable whitespace post-processing |\n\n## Limitations\n\n- **External tokens (`externals: [...]`) are opaque.** Their matching logic\n  lives in a hand-written `scanner.c`; we only see the symbol name and emit\n  `\u003cMISSING:name\u003e` at those slots. Affects Python, Ruby, Haskell. Cairo,\n  Yul, most Rust-family DSLs have empty externals and are unaffected.\n- **Parser-only grammar fields are ignored:** `conflicts`, `precedences`,\n  `inline`, `supertypes`, `word`, `reserved`. The compiled `.so` still\n  enforces them during validation.\n- **Regex is JS-flavoured.** We strip lookarounds and backrefs before\n  handing patterns to `rand_regex`. Unicode-property escapes and weirder\n  JS-only constructs may fall through to the `\"UNKNOWN\"` fallback.\n- **Syntactic, not semantic.** Output breaks type rules, scoping, and\n  references — intentional, exercises later compiler stages.\n- **`--count` is a floor, not a ceiling.** Loop keeps going until\n  `coverage-target` is also met. Pass `--coverage-target 0.0` for hard stop.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjubnzv%2Ftsgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjubnzv%2Ftsgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjubnzv%2Ftsgen/lists"}