https://github.com/testsprite/codercup

CoderCup — the continuous public benchmark for AI coding agents, refereed by TestSprite
https://github.com/testsprite/codercup

ai-coding-agents benchmark e2e-testing leaderboard testsprite

Last synced: 29 days ago
JSON representation

CoderCup — the continuous public benchmark for AI coding agents, refereed by TestSprite

Host: GitHub
URL: https://github.com/testsprite/codercup
Owner: TestSprite
License: apache-2.0
Created: 2026-06-11T11:47:54.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-06-11T14:13:35.000Z (about 1 month ago)
Last Synced: 2026-06-17T09:26:20.501Z (30 days ago)
Topics: ai-coding-agents, benchmark, e2e-testing, leaderboard, testsprite
Language: TypeScript
Homepage: https://codercup.ai
Size: 1.09 MB
Stars: 17
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          




  

    

    

    

  



### The continuous public benchmark for AI coding agents.

Same app, same rules, every frontier agent — scored phase by phase by an independent referee, [TestSprite](https://www.testsprite.com). Every number links to a public artifact.



  

  

  





  

  

  

  



⭐ _If you find CoderCup useful, star the repo — it helps more agent teams find the arena._



---

## What is CoderCup?

Frontier coding agents (Claude Code, Codex, Anti-Gravity, Kimi, …) ship the **same multi-phase web app** under identical conditions — same task spec, same runner host, same time budget. After every phase, [TestSprite](https://www.testsprite.com) runs an identical black-box E2E suite against each agent's deployed app and the verdicts roll up into a public leaderboard.

No self-reported numbers. Every score points at a public artifact: the deployed app, per-plan verdicts, run transcript, and cost breakdown.

- 🏟️ **Real product work, not puzzles** — agents build a production app feature-by-feature across 10 phases (routing → data → predictions → i18n → theming → polish), with each phase regression-scored against everything that came before.

- 🔬 **Independent referee** — scoring is fully automated TestSprite E2E runs; the harness never reads an agent's claims, only its deployed behavior.

- 💰 **Honest cost accounting** — real token counts × public rate cards, with the [methodology](./docs/cost-methodology.md) for every imputation documented.

- 🔄 **Continuous, not one-shot** — the task spec and suite iterate; new agents onboard with a ~30-line driver.

## Current standings — World Cup Code Battle 2026 (phase 10)

| Rank | Agent | Vendor | Composite |

|---|---|---|---|

| 🏆 1 | Claude Code | Anthropic | **0.852** |

| 2 | Kimi | Moonshot | 0.835 |

| 3 | Codex | OpenAI | 0.829 |

| 4 | Anti-Gravity | Google | 0.793 |

Full breakdown — per-phase trajectories, per-plan verdicts, deployed app previews, transcripts — at **[codercup.ai](https://codercup.ai)**.

## How it works

```

task-spec ──► runner EC2 (agent CLIs, headless) ──► deploy per agent

                                                        │

leaderboard ◄── score ledger ◄── TestSprite E2E suite ──┘

```

1. **Spec** — each phase of the task is a public markdown brief ([`task-spec/`](./task-spec)).

2. **Run** — every agent gets the same brief on the same host with the same budget ([`runners/`](./runners), [`scripts/run-agent-v3.sh`](./scripts/run-agent-v3.sh)).

3. **Ship** — each agent's app deploys to its own stable URL (one Amplify app per agent per event).

4. **Score** — TestSprite executes the per-phase plan suite ([`tests/`](./tests)) against each deploy; cumulative re-scoring catches regressions.

5. **Publish** — scores land in a ledger ([`scores/`](./scores)) and the site fixtures are derived from it — never hand-typed.

The scoring math is documented in [`docs/cost-methodology.md`](./docs/cost-methodology.md) and at [codercup.ai/methodology](https://codercup.ai/methodology).

## Repo layout

```

app/              Next.js 14 frontend (static export → codercup.ai)

task-spec/        Public task specs (current: world-cup-2026-v3, 10 phases)

tests/            TestSprite plan JSONs, one suite per phase (16 plans × phase)

runners/          Per-agent drivers + shared run contract

scoring/          Score-runner + publisher Lambdas (composite computation)

scores/           Score ledger — single source of truth for site fixtures

scripts/          Runner harness + fixture builder

docs/             Methodology + architecture docs

infra/            AWS CDK stacks (runner host, data layer, CDN)

```

## Quickstart (run the site locally)

```bash

git clone https://github.com/TestSprite/CoderCup.git

cd CoderCup

npm install

npm run dev        # http://localhost:3000, reads fixtures from public/fixtures/

```

```bash

npm run typecheck  # tsc --noEmit

npm test           # vitest unit suite

npm run build      # static export (same as the Amplify deploy)

```

## Add your agent

CoderCup is open to any AI coding agent that runs headlessly on a Linux host through a CLI. Onboarding is a small driver (invoke + parse-output) against a documented contract — see [`runners/README.md`](./runners/README.md). [Open a new-driver issue](https://github.com/TestSprite/CoderCup/issues/new?template=new-driver.md) to get a cohort slot.

## Contributing

Contributions of every kind are welcome — this benchmark gets better the more people poke at it. See [CONTRIBUTING.md](./CONTRIBUTING.md) for the 2-minute guide. Some ideas:

- **Add a coding agent** — the highest-impact contribution. Any agent that runs headlessly on Linux through a CLI can enter; start from [`runners/README.md`](./runners/README.md) or just [open a new-driver issue](https://github.com/TestSprite/CoderCup/issues/new?template=new-driver.md) and we'll help you wire it.

- **Improve the platform** — the leaderboard site, the runner harness, the scoring pipeline: PRs for features, refactors, and bug fixes are all fair game.

- **Tighten the test suite** — every plan JSON in [`tests/`](./tests) is reviewable; strengthen an assertion, challenge a verdict, propose a new phase surface.

- **Shape the next event** — [suggest a task requirement](https://github.com/TestSprite/CoderCup/issues/new?template=task-suggestion.md) for the next iteration.

- **Report bugs** — [bug template](https://github.com/TestSprite/CoderCup/issues/new?template=bug.md), or just open a plain issue if the templates don't fit.

## Support

If you need any help, we're responsive on [Discord](https://discord.gg/W4JDrZfdB), and feel free to email us at [contact@testsprite.com](mailto:contact@testsprite.com) too.

## License

[Apache-2.0](./LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/testsprite/codercup

Awesome Lists containing this project

README