{"id":50708497,"url":"https://github.com/fraware/cta-benchmark","last_synced_at":"2026-06-09T13:31:55.546Z","repository":{"id":353050125,"uuid":"1217361126","full_name":"fraware/cta-benchmark","owner":"fraware","description":"CTA-Bench: research benchmark and toolkit for studying how well systems turn problem descriptions and reference code into Lean 4 proof obligations, and how faithful those obligations are to the intended algorithm.","archived":false,"fork":false,"pushed_at":"2026-05-27T16:41:05.000Z","size":2298,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T18:23:28.770Z","etag":null,"topics":["ai-evaluation","autoformalization","benchmark","evaluation-methodology","formal-verification","lean","program-verification","semantic-faithfulness","theorem-proving"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fraware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":"docs/maintainers.md","copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-21T20:08:08.000Z","updated_at":"2026-05-27T16:41:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fraware/cta-benchmark","commit_stats":null,"previous_names":["fraware/cta-benchmark"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/fraware/cta-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fraware%2Fcta-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fraware%2Fcta-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fraware%2Fcta-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fraware%2Fcta-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fraware","download_url":"https://codeload.github.com/fraware/cta-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fraware%2Fcta-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34110011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-evaluation","autoformalization","benchmark","evaluation-methodology","formal-verification","lean","program-verification","semantic-faithfulness","theorem-proving"],"created_at":"2026-06-09T13:31:54.636Z","updated_at":"2026-06-09T13:31:55.541Z","avatar_url":"https://github.com/fraware.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cpre\u003e\n###############################################################################################\n#                                                                                             #\n#                      ____ _____  _        ____                  _                           #\n#                     / ___|_   _|/ \\      | __ )  ___ _ __   ___| |__                        #\n#                    | |     | | / _ \\     |  _ \\ / _ \\ '_ \\ / __| '_ \\                       #\n#                    | |___  | |/ ___ \\    | |_) |  __/ | | | (__| | | |                      #\n#                     \\____| |_/_/   \\_\\   |____/ \\___|_| |_|\\___|_| |_|                      #\n#                                                                                             #\n#                                                                                             #\n###############################################################################################\n\u003c/pre\u003e\n\n**Classical Algorithm Tasks (CTA)** — a research benchmark and toolkit for studying how well systems turn problem descriptions and reference code into **Lean 4** proof obligations, and how faithful those obligations are to the intended algorithm.\n\n\u003c/div\u003e\n\n---\n\n## In one minute\n\n| | |\n|--|--|\n| **What** | Curated programming tasks (sorting, graphs, dynamic programming, …), reference implementations, checks, and evaluation outputs. |\n| **Why** | Measure and compare **semantic faithfulness** and related properties—not a general theorem prover or a full verification pipeline. |\n| **How** | Rust tooling for runs and validation, Lean for checking generated obligations, Python scripts for tables and paper artifacts. |\n\n**Scale (current headline release):** 84 task instances · 12 algorithm families · evaluation artifacts shipped in-repo for reproducibility.\n\n---\n\n## Who this is for\n\n- **Researchers** reproducing or extending benchmark numbers or submitting follow-up work.  \n- **Reviewers** checking claims against artifacts (see [Reproducibility](docs/reproducibility.md), [Reviewer map](docs/reviewer_map.md), and [REPRODUCE.md](REPRODUCE.md)).  \n- **Contributors** improving tasks, tooling, or docs ([Contributing](CONTRIBUTING.md), [Code of conduct](CODE_OF_CONDUCT.md)).\n\n---\n\n## Quick start\n\n**Requirements:** Rust **1.88.0** ([`rust-toolchain.toml`](rust-toolchain.toml)) · Lean **4.12.0** ([`lean/lean-toolchain`](lean/lean-toolchain)), e.g. via [elan](https://github.com/leanprover/elan).\n\n```bash\ncargo build --workspace\ncargo test --workspace --all-targets\n```\n\n**Validate the benchmark and schemas** (after build):\n\n```bash\ncargo run -p cta_cli -- validate schemas\ncargo run -p cta_cli -- validate benchmark --version v0.3 --release\n```\n\n**Check Lean scaffolds:**\n\n```bash\ncd lean \u0026\u0026 lake build\n```\n\nFor a full command reference (generate, experiments, metrics, reports), see [**Contributing**](CONTRIBUTING.md) and [**Architecture**](docs/architecture.md).\n\n---\n\n## Reproducing paper-ready results\n\nHeadline tables and checks are produced by an **ordered** script pipeline. Do not skip steps or mix partial recipes—counts and strictness guarantees depend on the full sequence.\n\n**Start here:** [`REPRODUCE.md`](REPRODUCE.md) · one-page env pins: [`docs/reproducibility.md`](docs/reproducibility.md) · artifact meanings: [`docs/reviewer_map.md`](docs/reviewer_map.md)\n\n**Gate scripts** (run from repository root; order matters—see `REPRODUCE.md` and `scripts/run_paper_readiness_gate.*`):\n\n```bash\npython scripts/materialize_v03_adjudication_artifacts.py\npython scripts/materialize_repair_hotspot_artifacts.py\npython scripts/reproduce_agreement_report.py\npython scripts/implement_evidence_hardening.py\npython scripts/repair_counterfactual_metrics.py\npython scripts/ci_reviewer_readiness.py\npython scripts/check_paper_claim_sources.py\n```\n\n**Shell helpers:** `scripts/run_paper_readiness_gate.ps1` or `scripts/run_paper_readiness_gate.sh` · submission checks: `scripts/verify_submission_readiness.ps1` / `scripts/verify_submission_readiness.sh`\n\n**NeurIPS 2026 E\u0026D (Hugging Face):** public dataset card at\n[`fraware/cta-bench`](https://huggingface.co/datasets/fraware/cta-bench). Build and upload\nfrom a frozen branch with `pip install -r requirements-hf.txt`, `hf auth login` (or `HF_TOKEN`), then\n`make hf-release` (see [`docs/reproducibility.md`](docs/reproducibility.md) for the ordered steps).\n\n---\n\n## What this project does *not* claim\n\n- Full formal verification of arbitrary Rust programs  \n- Solving all of interactive theorem proving  \n- Ranking commercial models as a product leaderboard  \n- Guaranteeing semantic correctness from Lean checks alone  \n\nFor precise limits and threats to validity, read [**Architecture**](docs/architecture.md) (non-goals and scope) and [**Evaluation contract**](docs/evaluation_contract.md).\n\n---\n\n## Repository layout\n\n```\ncta-benchmark/\n├── benchmark/          # Versioned tasks (v0.1 … v0.3), instances, annotations\n├── configs/            # Experiments, prompts, provider settings\n├── schemas/            # JSON schemas for artifacts\n├── crates/             # Rust library and CLI (`cta` binary)\n├── lean/               # Lean 4 project tied to tasks\n├── scripts/            # Python automation for tables and gates\n├── docs/               # Specifications, evaluation contract, paper maps\n├── tests/              # Integration and fixtures\n├── runs/               # Local experiment outputs (gitignored by default)\n└── reports/            # Generated reports (gitignored by default)\n```\n\n**Rust packages** (workspace):\n\n| Package | Role |\n|---------|------|\n| `cta_core` | Shared types, IDs, versions |\n| `cta_schema` | Load and validate JSON against schemas |\n| `cta_benchmark` | Load tasks, lint, build manifests |\n| `cta_rust_extract` | Signals from reference Rust code |\n| `cta_generate` | Build candidate obligations from configs |\n| `cta_lean` | Write Lean files, run checks |\n| `cta_behavior` | Behavioral tests against specs |\n| `cta_annotations` | Human and machine annotation flows |\n| `cta_metrics` | Deterministic metrics |\n| `cta_reports` | Tables and exports |\n| `cta_cli` | Command-line entrypoint `cta` |\n\n---\n\n## Versions of the benchmark\n\n- **v0.1** — Small pilot (immutable once released).  \n- **v0.2** — Paper-oriented track with richer annotation and review packets.  \n- **v0.3** — Larger grid (84 instances); primary surface for current headline numbers.\n\nReleased task definitions are **not rewritten in place**; new work adds a new version folder. Details: [`docs/architecture.md`](docs/architecture.md), [`CONTRIBUTING.md`](CONTRIBUTING.md).\n\n---\n\n## Documentation index\n\n| Document | Contents |\n|----------|----------|\n| [Architecture](docs/architecture.md) | Components, data flow, and non-goals |\n| [Evaluation contract](docs/evaluation_contract.md) | Metrics and definitions |\n| [Reviewer map](docs/reviewer_map.md) | Paper sections ↔ files ↔ commands |\n| [Annotation manual](docs/annotation_manual.md) | Rubric and review workflow |\n| [Reproducibility](docs/reproducibility.md) | Toolchain pins and regen index |\n| [REPRODUCE.md](REPRODUCE.md) | Ordered regeneration checklist |\n| [Maintainers](docs/maintainers.md) | Security contact |\n| [Citation](CITATION.cff) | Cite this repository |\n| [Security](SECURITY.md) | Reporting issues, scope, scans |\n| [Contributing](CONTRIBUTING.md) | PR expectations and deep CLI |\n| [Code of conduct](CODE_OF_CONDUCT.md) | Community norms |\n\n---\n\n## License\n\nMIT — see [`LICENSE`](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffraware%2Fcta-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffraware%2Fcta-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffraware%2Fcta-benchmark/lists"}