https://github.com/sst/opencode-bench

Last synced: 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/sst/opencode-bench
Owner: sst
Created: 2025-10-11T09:25:10.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-10-31T18:53:35.000Z (2 months ago)
Last Synced: 2025-10-31T20:10:17.892Z (2 months ago)
Language: TypeScript
Size: 264 KB
Stars: 14
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          > opencode bench

A benchmarking framework for evaluating opencode's AI coding agents across real-world GitHub repositories. The framework runs agents against target repositories and scores their outputs using multiple LLM judges, measuring code quality across dimensions like readability, functionality, adherence to best practices, and efficiency.

```bash

orvl opencode --model opencode/gpt-5-codex --eval noworneverev/graphrag-visualizer

orvl opencode --model opencode/claude-sonnet-4-5 --eval prismicio-community/course-fizzi-next --output results.json

```

Both `--model` and `--eval` are required; the CLI now runs a single agent/model/eval pairing at a time. Each invocation executes three isolated `[episode X/3]` runs (fresh clones) and aggregates the judge scores before exporting results.

## Setup

```bash

bun install

bun run build

```

During development the CLI can be executed directly with Bun:

```bash

bun run dev --  --model  --eval 

```

## Continuous Releases

Install the [pkg.pr.new GitHub App](https://github.com/apps/pkg-pr-new) on your repository to enable preview packages for every push or pull request. The workflow in `.github/workflows/pkg-pr-new.yml` installs dependencies with Bun, builds the project, and runs `bunx pkg-pr-new publish` to publish previews automatically.

## Scores

a score is a function that returns a score (0 to 1).

`scores/ui.ts`

```typescript

export default createScore(() => {

	// here's where the judge would operate and give a score

	// ...

	return {

		score: 0.43,

		rationale: "Baseline UI rationale"

	}

})

```

## TODO

- [ ] Stabilize scoring by replacing flaky LLM judges for logic-equivalence, integration-points, test-coverage, and checks with deterministic analysis (see `benchmark-observations.md` for details).

`scores/code-quality.ts`

```typescript

export default createScore(() => {

	// ...

	return {

		score: 0.12,

		rationale: "Baseline code quality rationale"

	}

})

```

// --- setup --------------------------------------------------

// Assessors and their weights

const assessors = ["Claude", "GPT", "Kimi"];

const w = [0.5, 0.3, 0.2]; // must sum to 1

// Score types and their weights

const scoreTypes = ["readability", "cases", "bugs"];

const v = [0.4, 0.3, 0.3]; // must sum to 1

// Scores matrix S[i][j] = score from assessor i on score type j

const S = [

  [0.80, 0.60, 0.70], // Claude

  [0.90, 0.70, 0.60], // GPT

  [0.70, 0.50, 0.80], // Kimi

];

// --- functions ---------------------------------------------

// weighted mean for a single score type j

function meanForScoreType(j) {

  return S.reduce((acc, row, i) => acc + w[i] * row[j], 0);

}

// weighted variance for a single score type j

function varianceForScoreType(j) {

  const mean = meanForScoreType(j);

  return S.reduce((acc, row, i) => acc + w[i] * (row[j] - mean) ** 2, 0);

}

// --- compute ------------------------------------------------

const means = scoreTypes.map((_, j) => meanForScoreType(j));

const R = scoreTypes.reduce((acc, _, j) => acc + v[j] * means[j], 0);

// disagreement penalty

const variances = scoreTypes.map((_, j) => varianceForScoreType(j));

const lambda = 0.5;

const R_pen = R - lambda * variances.reduce((acc, varj, j) => acc + v[j] * varj, 0);

// --- output -------------------------------------------------

console.log("Per-score-type means:", means);

console.log("Overall R:", R.toFixed(3));

console.log("Per-score-type variances:", variances);

console.log("Penalized R_pen:", R_pen.toFixed(3));

```

```

Per-score-type means: [ 0.81, 0.61, 0.69 ]

Overall R: 0.714

Per-score-type variances: [ 0.005, 0.005, 0.005 ]

Penalized R_pen: 0.712

```

#### Judges

Potential scores across three judges.

- UI

- functionality (computer-use models? playwright access?)

- UX (similar to functionality)

- code readability

- adherence to best practices and project configs

	- respecting AGENTS.md, CLAUDE.md, ...

	- `.eslintrc` / `.prettierrc` / ...

- token consumption, speed, tool calls number

	- do we incentivize everyone to do less tool calls? or more? maybe we should remove it, just a thought.

	- the less tokens and the faster the agent is, the better.

	- this score does not need an LLM judge.

## Agents

`agents/opencode.ts`

```typescript

export const models = ["openai/gpt-4o", "anthropic/claude-sonnet-4"] // useful for assertions and matrix testing

export default createAgent((model, prompt) => {

	void prompt

	return `opencode run -m ${model}`

})

```

### Dummy agents

To test out the the benchmark itself, we can have a dummy agent that we measure how the judges behave on those dummy outputs.

`agents/dummy-bad.ts`

```typescript

export const models = ["openai/gpt-4o", "anthropic/claude-sonnet-4"] // useful for assertions and matrix testing

export default createAgent((model, prompt) => {

	// fs.writeFile to write dummy files

	return `echo ...`

})

```

the variance between this dummy and `agents/dummy-good.ts` should be high to validate that the judges produce _fair_ scores.

## Scoring Methodology

All current scores are produced by LLM judges (`claude-4.5`, `gpt-5-codex`, `kimi`). For each assignment we gather their outputs into a matrix \(S \in [0,1]^{m \times k}\), where rows index judges and columns index score types. Given judge weights \(w \in \Delta^{m-1}\) (currently uniform) and assignment weights \(v \in \Delta^{k-1}\), the base score is

\[

R = v^\top S^\top w = \sum_{j=1}^k v_j \left( \sum_{i=1}^m w_i s_{ij} \right).

\]

To discourage disagreement we subtract a variance penalty (see `lib/utils/scoreAggregation.ts`):

\[

R_{\text{pen}} = R - \lambda \sum_{j=1}^k v_j \operatorname{Var}_j, \qquad \operatorname{Var}_j = \sum_{i=1}^m w_i (s_{ij} - \bar{s}_j)^2, \quad \bar{s}_j = \sum_{i=1}^m w_i s_{ij}.

\]

The tests in `tests/scoreAggregation.test.ts` exercise this aggregation. The TODO above tracks the plan to replace noisy LLM scorers with deterministic checks while keeping the same aggregation pipeline.

  rank  repo                                      stars  forks

  1     noworneverev/graphrag-visualizer           375     46

  2     KwokKwok/Silo                              240     25

  3     prismicio-community/course-fizzi-next      180     77

  4     mylofi/local-vault                         118      3

  5     Rasalas/msg-reader                          74     14

  6     halitsever/nest-cloudflare-turnstile        62     16

  7     psyko-gh/overcrawlrr                        60      1

  8     googleworkspace/drive-picker-element        46      6

  9     pbstar/fitview                              37      0

  10    ekoln/nextdaily                             33     20

  Forks Leaderboard

  rank  repo                                      stars  forks

  1     prismicio-community/course-fizzi-next      180     77

  2     noworneverev/graphrag-visualizer           375     46

  3     KwokKwok/Silo                              240     25

  4     Cefalo/quick-meet                           32     22

  5     ekoln/nextdaily                             33     20

  6     halitsever/nest-cloudflare-turnstile        62     16

  7     BhuwanSKumar/refrain-addiction-main         11     16

  8     Rasalas/msg-reader                          74     14

  9     AlaminPu1007/algorithm-visualizer           22      7

  10    mohitchandel/AI-APP-Template                12      7
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sst/opencode-bench

Awesome Lists containing this project

README