{"id":48796958,"url":"https://github.com/aallan/vera-bench","last_synced_at":"2026-06-08T01:01:23.482Z","repository":{"id":347856562,"uuid":"1195419502","full_name":"aallan/vera-bench","owner":"aallan","description":"VeraBench: a benchmark suite for LLM code generation in Vera","archived":false,"fork":false,"pushed_at":"2026-06-03T10:16:44.000Z","size":2799,"stargazers_count":14,"open_issues_count":15,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-03T10:24:05.046Z","etag":null,"topics":["benchmark","code-generation","contracts","formal-verification","llm","programming-language","vera","verification","verified-programming"],"latest_commit_sha":null,"homepage":"https://veralang.dev","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aallan.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-29T16:48:00.000Z","updated_at":"2026-06-03T09:01:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"34dcf2d0-46bf-4da8-8a45-d897e5d597aa","html_url":"https://github.com/aallan/vera-bench","commit_stats":null,"previous_names":["aallan/vera-bench"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/aallan/vera-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aallan%2Fvera-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aallan%2Fvera-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aallan%2Fvera-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aallan%2Fvera-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aallan","download_url":"https://codeload.github.com/aallan/vera-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aallan%2Fvera-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34043822,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","code-generation","contracts","formal-verification","llm","programming-language","vera","verification","verified-programming"],"created_at":"2026-04-14T00:01:19.712Z","updated_at":"2026-06-08T01:01:23.434Z","avatar_url":"https://github.com/aallan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VeraBench\n\n[![VeraBench — Benchmarks for code the machines write](assets/vera-bench-social-preview.png)](https://veralang.dev)\n\n[![CI](https://github.com/aallan/vera-bench/actions/workflows/ci.yml/badge.svg)](https://github.com/aallan/vera-bench/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/aallan/vera-bench/graph/badge.svg)](https://codecov.io/gh/aallan/vera-bench)\n\nA benchmark for evaluating LLM code generation in [Vera](https://github.com/aallan/vera), a programming language designed for large language models (LLMs) to write.\n\n## Results\n\n![VeraBench v0.0.7 Results](assets/results-graph.png)\n\nResults from [VeraBench v0.0.7](https://github.com/aallan/vera-bench/releases/tag/v0.0.7) against [Vera v0.0.108](https://github.com/aallan/vera/releases/tag/v0.0.108) across 50 problems, 6 models, and 4 modes per model.\n\n### run_correct by model (Vera full-spec vs Python vs TypeScript)\n\n**Flagship tier:**\n\n| Model | Vera | Python | TypeScript |\n|-------|------|--------|------------|\n| **Kimi K2.5** | **100%** | 86% | 91% |\n| GPT-4.1 | 91% | 96% | 96% |\n| Claude Opus 4 | 88% | 96% | 96% |\n\n**Sonnet tier:**\n\n| Model | Vera | Python | TypeScript |\n|-------|------|--------|------------|\n| **Kimi K2 Turbo** | **83%** | 88% | 79% |\n| Claude Sonnet 4 | 79% | 96% | 88% |\n| GPT-4o | 78% | 93% | 83% |\n\n### Key findings\n\n**Kimi K2.5 writes perfect Vera code** — 100% run_correct on both full-spec and spec-from-NL modes, beating Python (86%) and TypeScript (91%). This is the first model where Vera is the best language across the board.\n\n**Three models beat TypeScript on Vera.** Kimi K2.5 (+9pp), Kimi K2 Turbo (+4pp), and in our [initial v0.0.4 benchmark](https://github.com/aallan/vera-bench/releases/tag/v0.0.4) Claude Sonnet 4 also beat TypeScript (83% vs 79%). The pattern is consistent across providers: Vera's mandatory contracts and typed slot references provide enough structure to compensate for zero training data.\n\n**Python remains the strongest target for most models.** Claude, OpenAI, and Moonshot all hit 86–96% run_correct on Python. The gap between Python and Vera varies from 0pp (Kimi K2.5 spec-from-NL: both 100%) to 17pp (Claude Sonnet 4: 96% vs 79%).\n\n**Spec-from-NL is the harder test.** When models must infer their own contracts from natural language, performance drops for most models — GPT-4.1 falls from 91% to 50%. But Kimi K2.5 holds at 100%, suggesting it has internalised Vera's type system well enough to author specifications from scratch.\n\n\u003e **Note:** These are single-run results. LLM outputs are non-deterministic — individual problems can flip between pass and fail across runs. The v0.0.4 Claude Sonnet 4 result (83% Vera, 79% TypeScript) shifted to 79%/88% in the v0.0.7 re-run, illustrating this variance. Stable rates will require [pass@k](https://arxiv.org/abs/2107.03374) evaluation with multiple trials. This is early days — 50 problems, one run per model.\n\n### Why this matters: zero training data\n\nNo LLM has ever been trained on Vera. There are no Vera examples on GitHub, no Stack Overflow answers, no tutorials — the language was created after these models' training cutoffs. Every token of Vera code in these results was written by a model that learned the language entirely from a single document ([SKILL.md](https://veralang.dev/SKILL.md)) provided in the prompt at evaluation time.\n\nPython and TypeScript, by contrast, are among the most heavily represented languages in LLM training data — billions of lines of code, documentation, and Q\u0026A. The fact that multiple models write *better* Vera than TypeScript despite this asymmetry suggests that language design matters more than training data volume. Vera's mandatory contracts, typed slot references, and explicit effect annotations give models enough structural guardrails that in-context instruction alone is sufficient — no pre-training required.\n\n## Overview\n\nVeraBench measures whether LLMs write better code in a language designed for them. Vera uses typed slot references instead of variable names, mandatory contracts, and explicit algebraic effects — all features that should make LLM-generated code more verifiable.\n\nThe benchmark covers five difficulty tiers:\n\n| Tier | Focus | What it tests |\n|------|-------|--------------|\n| 1 | Pure arithmetic | Basic syntax, `@T.n` slot references, simple contracts |\n| 2 | String \u0026 array ops | Built-in function discovery (`domain_verb` naming) |\n| 3 | ADTs \u0026 match | Data type definition, De Bruijn indices in match arms |\n| 4 | Recursion \u0026 termination | `decreases` clauses, Z3 verification |\n| 5 | Multi-function \u0026 effects | IO, State, Exn, effect propagation across functions |\n\nFor each problem, we measure:\n\n- **check@1** — Does the code pass `vera check` on first attempt?\n- **verify@1** — Does it pass `vera verify` (Z3 contract verification)?\n- **fix@1** — Given the error message, can the model fix it in one turn?\n- **run_correct** — Does execution produce the correct output?\n\nThe same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis.\n\n\u003e **Cross-language comparison:** For cross-language headline rates, use the T1–T4 aggregate. Tier 5 tests Vera's algebraic effect handlers, which other languages solve with fundamentally different native idioms. See [#50](https://github.com/aallan/vera-bench/issues/50).\n\n## Prerequisites\n\n* Python 3.11+\n* Git\n* Node.js 22+ *(optional, for TypeScript baselines and generation)*\n* [Aver](https://github.com/jasisz/aver) *(optional, for Aver baselines and generation)*\n* [AILANG](https://ailang.sunholo.com/) *(optional, for AILANG baselines and generation)*\n\n## Installation\n\n```bash\ngit clone https://github.com/aallan/vera-bench.git\ncd vera-bench\npython -m venv .venv\nsource .venv/bin/activate\npip install -e \".[llm]\"\n```\n\nThe `[llm]` extra installs the Anthropic and OpenAI SDKs. Use `pip install -e .` if you only need validation (no model evaluation).\n\n### Install the Vera compiler\n\nThe `vera` command must be available on `$PATH`. Install it anywhere into the same environment, either from a local clone,\n\n```bash\npip install -e /path/to/vera          \n```\n\nor directly from GitHub.\n\n```bash\npip install git+https://github.com/aallan/vera.git   \n```\nAfterwards you should be able to print the Vera version from the terminal,\n\n```bash\nvera version   \n```\n\nthis should return v0.0.108 or later.\n\n## Quick start\n\nOnce Vera is installed you can run the benchmark from the terminal,\n\n```bash\n# Validate all 60 problems and canonical solutions\nvera-bench validate\n\n# Run benchmark against a model\nexport ANTHROPIC_API_KEY=sk-ant-...\nvera-bench run --model claude-sonnet-4-6\n\n# Run a single tier\nvera-bench run --model claude-sonnet-4-6 --tier 1\n\n# Run a single problem\nvera-bench run --model claude-sonnet-4-6 --problem VB-T1-001\n\n# Spec-from-NL mode (agent writes its own contracts)\nvera-bench run --model claude-sonnet-4-6 --mode spec-from-nl\n\n# Ask the same model to write Python, TypeScript, Aver, or AILANG for comparison\nvera-bench run --model claude-sonnet-4-6 --language python\nvera-bench run --model claude-sonnet-4-6 --language typescript\nvera-bench run --model claude-sonnet-4-6 --language aver\nvera-bench run --model claude-sonnet-4-6 --language ailang\n\n# Slow model? Dispatch problems concurrently (default is sequential)\nvera-bench run --model kimi-k2.5 --parallel 10\n\n# Run canonical baselines as a reference\nvera-bench baselines\nvera-bench baselines --language typescript\nvera-bench baselines --language aver\nvera-bench baselines --language ailang\n\n# Generate a combined report\nvera-bench report results/\n\n# Or run the full benchmark suite (all 8 targets) with one command\npython scripts/run_full_benchmark.py\n```\n\nSupported providers: [Anthropic](https://anthropic.com) (Claude), [OpenAI](https://openai.com) (GPT), [Kimi](https://platform.kimi.ai) (Moonshot), and [OpenRouter](https://openrouter.ai/) (used for AILANG-capable models). Set the appropriate API key environment variable (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `MOONSHOT_API_KEY`, or `OPENROUTER_API_KEY`).\n\nThe Vera language reference ([SKILL.md](https://veralang.dev/SKILL.md)) is fetched automatically from veralang.dev when running Vera benchmarks. To use a local copy instead (e.g., for testing unreleased language features):\n\n```bash\nvera-bench run --model claude-sonnet-4-6 --skill-md /path/to/SKILL.md\n```\n\n## Report generation\n\nRunning `vera-bench report results/` generates `results/summary.md` with a summary table, per-tier breakdowns, and per-problem detail. Each `vera-bench run` writes incremental JSONL results (one line per problem attempt), so partial runs are resumable and always reportable. Results files are in `.gitignore` — they are generated artifacts, not checked in.\n\n## Prior art\n\nVeraBench is inspired by:\n\n- [HumanEval](https://github.com/openai/human-eval) — 164 Python function completion problems\n- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) — 974 Python problems from natural language\n- [DafnyBench](https://github.com/sun-wendy/DafnyBench) — 782 Dafny verification annotation problems\n\nDafnyBench demonstrated that tracking verification success rates over time attracts genuine research attention — success rates went from 68% to 96% across model generations in under two years. VeraBench aims to create the same longitudinal story for a language designed from scratch for LLM code generation.\n\n## Citation\n\n```bibtex\n@software{verabench2026,\n  author = {Allan, Alasdair},\n  title = {VeraBench: a benchmark suite for LLM code generation in Vera},\n  year = {2026},\n  url = {https://github.com/aallan/vera-bench}\n}\n```\n\n## License\n\nVeraBench is licensed under the [MIT License](LICENSE).\n\nCopyright © 2026 Alasdair Allan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faallan%2Fvera-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faallan%2Fvera-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faallan%2Fvera-bench/lists"}