https://github.com/testsprite/codercup
CoderCup — the continuous public benchmark for AI coding agents, refereed by TestSprite
https://github.com/testsprite/codercup
ai-coding-agents benchmark e2e-testing leaderboard testsprite
Last synced: about 10 hours ago
JSON representation
CoderCup — the continuous public benchmark for AI coding agents, refereed by TestSprite
- Host: GitHub
- URL: https://github.com/testsprite/codercup
- Owner: TestSprite
- License: apache-2.0
- Created: 2026-06-11T11:47:54.000Z (6 days ago)
- Default Branch: main
- Last Pushed: 2026-06-11T14:13:35.000Z (6 days ago)
- Last Synced: 2026-06-14T06:04:10.007Z (4 days ago)
- Topics: ai-coding-agents, benchmark, e2e-testing, leaderboard, testsprite
- Language: TypeScript
- Homepage: https://codercup.ai
- Size: 1.09 MB
- Stars: 17
- Watchers: 0
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
### The continuous public benchmark for AI coding agents.
Same app, same rules, every frontier agent — scored phase by phase by an independent referee, [TestSprite](https://www.testsprite.com). Every number links to a public artifact.
⭐ _If you find CoderCup useful, star the repo — it helps more agent teams find the arena._
---
## What is CoderCup?
Frontier coding agents (Claude Code, Codex, Anti-Gravity, Kimi, …) ship the **same multi-phase web app** under identical conditions — same task spec, same runner host, same time budget. After every phase, [TestSprite](https://www.testsprite.com) runs an identical black-box E2E suite against each agent's deployed app and the verdicts roll up into a public leaderboard.
No self-reported numbers. Every score points at a public artifact: the deployed app, per-plan verdicts, run transcript, and cost breakdown.
- 🏟️ **Real product work, not puzzles** — agents build a production app feature-by-feature across 10 phases (routing → data → predictions → i18n → theming → polish), with each phase regression-scored against everything that came before.
- 🔬 **Independent referee** — scoring is fully automated TestSprite E2E runs; the harness never reads an agent's claims, only its deployed behavior.
- 💰 **Honest cost accounting** — real token counts × public rate cards, with the [methodology](./docs/cost-methodology.md) for every imputation documented.
- 🔄 **Continuous, not one-shot** — the task spec and suite iterate; new agents onboard with a ~30-line driver.
## Current standings — World Cup Code Battle 2026 (phase 10)
| Rank | Agent | Vendor | Composite |
|---|---|---|---|
| 🏆 1 | Claude Code | Anthropic | **0.852** |
| 2 | Kimi | Moonshot | 0.835 |
| 3 | Codex | OpenAI | 0.829 |
| 4 | Anti-Gravity | Google | 0.793 |
Full breakdown — per-phase trajectories, per-plan verdicts, deployed app previews, transcripts — at **[codercup.ai](https://codercup.ai)**.
## How it works
```
task-spec ──► runner EC2 (agent CLIs, headless) ──► deploy per agent
│
leaderboard ◄── score ledger ◄── TestSprite E2E suite ──┘
```
1. **Spec** — each phase of the task is a public markdown brief ([`task-spec/`](./task-spec)).
2. **Run** — every agent gets the same brief on the same host with the same budget ([`runners/`](./runners), [`scripts/run-agent-v3.sh`](./scripts/run-agent-v3.sh)).
3. **Ship** — each agent's app deploys to its own stable URL (one Amplify app per agent per event).
4. **Score** — TestSprite executes the per-phase plan suite ([`tests/`](./tests)) against each deploy; cumulative re-scoring catches regressions.
5. **Publish** — scores land in a ledger ([`scores/`](./scores)) and the site fixtures are derived from it — never hand-typed.
The scoring math is documented in [`docs/cost-methodology.md`](./docs/cost-methodology.md) and at [codercup.ai/methodology](https://codercup.ai/methodology).
## Repo layout
```
app/ Next.js 14 frontend (static export → codercup.ai)
task-spec/ Public task specs (current: world-cup-2026-v3, 10 phases)
tests/ TestSprite plan JSONs, one suite per phase (16 plans × phase)
runners/ Per-agent drivers + shared run contract
scoring/ Score-runner + publisher Lambdas (composite computation)
scores/ Score ledger — single source of truth for site fixtures
scripts/ Runner harness + fixture builder
docs/ Methodology + architecture docs
infra/ AWS CDK stacks (runner host, data layer, CDN)
```
## Quickstart (run the site locally)
```bash
git clone https://github.com/TestSprite/CoderCup.git
cd CoderCup
npm install
npm run dev # http://localhost:3000, reads fixtures from public/fixtures/
```
```bash
npm run typecheck # tsc --noEmit
npm test # vitest unit suite
npm run build # static export (same as the Amplify deploy)
```
## Add your agent
CoderCup is open to any AI coding agent that runs headlessly on a Linux host through a CLI. Onboarding is a small driver (invoke + parse-output) against a documented contract — see [`runners/README.md`](./runners/README.md). [Open a new-driver issue](https://github.com/TestSprite/CoderCup/issues/new?template=new-driver.md) to get a cohort slot.
## Contributing
Contributions of every kind are welcome — this benchmark gets better the more people poke at it. See [CONTRIBUTING.md](./CONTRIBUTING.md) for the 2-minute guide. Some ideas:
- **Add a coding agent** — the highest-impact contribution. Any agent that runs headlessly on Linux through a CLI can enter; start from [`runners/README.md`](./runners/README.md) or just [open a new-driver issue](https://github.com/TestSprite/CoderCup/issues/new?template=new-driver.md) and we'll help you wire it.
- **Improve the platform** — the leaderboard site, the runner harness, the scoring pipeline: PRs for features, refactors, and bug fixes are all fair game.
- **Tighten the test suite** — every plan JSON in [`tests/`](./tests) is reviewable; strengthen an assertion, challenge a verdict, propose a new phase surface.
- **Shape the next event** — [suggest a task requirement](https://github.com/TestSprite/CoderCup/issues/new?template=task-suggestion.md) for the next iteration.
- **Report bugs** — [bug template](https://github.com/TestSprite/CoderCup/issues/new?template=bug.md), or just open a plain issue if the templates don't fit.
## Support
If you need any help, we're responsive on [Discord](https://discord.gg/W4JDrZfdB), and feel free to email us at [contact@testsprite.com](mailto:contact@testsprite.com) too.
## License
[Apache-2.0](./LICENSE)