{"id":32789551,"url":"https://github.com/sst/opencode-bench","last_synced_at":"2025-11-05T11:01:41.182Z","repository":{"id":320267713,"uuid":"1074172137","full_name":"sst/opencode-bench","owner":"sst","description":null,"archived":false,"fork":false,"pushed_at":"2025-10-31T18:53:35.000Z","size":270,"stargazers_count":14,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-31T20:10:17.892Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sst.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-10-11T09:25:10.000Z","updated_at":"2025-10-31T18:53:37.000Z","dependencies_parsed_at":"2025-10-22T22:18:15.264Z","dependency_job_id":"dcf45687-9c3e-4332-afb4-593efa03898c","html_url":"https://github.com/sst/opencode-bench","commit_stats":null,"previous_names":["sst/opencode-bench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sst/opencode-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sst%2Fopencode-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sst%2Fopencode-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sst%2Fopencode-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sst%2Fopencode-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sst","download_url":"https://codeload.github.com/sst/opencode-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sst%2Fopencode-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":282807270,"owners_count":26730414,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-05T02:00:05.946Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-05T11:00:41.943Z","updated_at":"2025-11-05T11:01:41.176Z","avatar_url":"https://github.com/sst.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003e opencode bench\n\nA benchmarking framework for evaluating opencode's AI coding agents across real-world GitHub repositories. The framework runs agents against target repositories and scores their outputs using multiple LLM judges, measuring code quality across dimensions like readability, functionality, adherence to best practices, and efficiency.\n\n```bash\norvl opencode --model opencode/gpt-5-codex --eval noworneverev/graphrag-visualizer\norvl opencode --model opencode/claude-sonnet-4-5 --eval prismicio-community/course-fizzi-next --output results.json\n```\n\nBoth `--model` and `--eval` are required; the CLI now runs a single agent/model/eval pairing at a time. Each invocation executes three isolated `[episode X/3]` runs (fresh clones) and aggregates the judge scores before exporting results.\n\n## Setup\n```bash\nbun install\nbun run build\n```\n\nDuring development the CLI can be executed directly with Bun:\n\n```bash\nbun run dev -- \u003cagent\u003e --model \u003cmodel\u003e --eval \u003cowner/name\u003e\n```\n\n## Continuous Releases\nInstall the [pkg.pr.new GitHub App](https://github.com/apps/pkg-pr-new) on your repository to enable preview packages for every push or pull request. The workflow in `.github/workflows/pkg-pr-new.yml` installs dependencies with Bun, builds the project, and runs `bunx pkg-pr-new publish` to publish previews automatically.\n\n## Scores\na score is a function that returns a score (0 to 1).\n\n`scores/ui.ts`\n```typescript\nexport default createScore(() =\u003e {\n\t// here's where the judge would operate and give a score\n\t// ...\n\treturn {\n\t\tscore: 0.43,\n\t\trationale: \"Baseline UI rationale\"\n\t}\n})\n```\n\n## TODO\n- [ ] Stabilize scoring by replacing flaky LLM judges for logic-equivalence, integration-points, test-coverage, and checks with deterministic analysis (see `benchmark-observations.md` for details).\n\n`scores/code-quality.ts`\n```typescript\nexport default createScore(() =\u003e {\n\t// ...\n\treturn {\n\t\tscore: 0.12,\n\t\trationale: \"Baseline code quality rationale\"\n\t}\n})\n```\n\n// --- setup --------------------------------------------------\n\n// Assessors and their weights\nconst assessors = [\"Claude\", \"GPT\", \"Kimi\"];\nconst w = [0.5, 0.3, 0.2]; // must sum to 1\n\n// Score types and their weights\nconst scoreTypes = [\"readability\", \"cases\", \"bugs\"];\nconst v = [0.4, 0.3, 0.3]; // must sum to 1\n\n// Scores matrix S[i][j] = score from assessor i on score type j\nconst S = [\n  [0.80, 0.60, 0.70], // Claude\n  [0.90, 0.70, 0.60], // GPT\n  [0.70, 0.50, 0.80], // Kimi\n];\n\n// --- functions ---------------------------------------------\n\n// weighted mean for a single score type j\nfunction meanForScoreType(j) {\n  return S.reduce((acc, row, i) =\u003e acc + w[i] * row[j], 0);\n}\n\n// weighted variance for a single score type j\nfunction varianceForScoreType(j) {\n  const mean = meanForScoreType(j);\n  return S.reduce((acc, row, i) =\u003e acc + w[i] * (row[j] - mean) ** 2, 0);\n}\n\n// --- compute ------------------------------------------------\n\nconst means = scoreTypes.map((_, j) =\u003e meanForScoreType(j));\nconst R = scoreTypes.reduce((acc, _, j) =\u003e acc + v[j] * means[j], 0);\n\n// disagreement penalty\nconst variances = scoreTypes.map((_, j) =\u003e varianceForScoreType(j));\nconst lambda = 0.5;\nconst R_pen = R - lambda * variances.reduce((acc, varj, j) =\u003e acc + v[j] * varj, 0);\n\n// --- output -------------------------------------------------\nconsole.log(\"Per-score-type means:\", means);\nconsole.log(\"Overall R:\", R.toFixed(3));\nconsole.log(\"Per-score-type variances:\", variances);\nconsole.log(\"Penalized R_pen:\", R_pen.toFixed(3));\n```\n\n```\nPer-score-type means: [ 0.81, 0.61, 0.69 ]\nOverall R: 0.714\nPer-score-type variances: [ 0.005, 0.005, 0.005 ]\nPenalized R_pen: 0.712\n```\n\n\n#### Judges\nPotential scores across three judges.\n\n- UI\n- functionality (computer-use models? playwright access?)\n- UX (similar to functionality)\n- code readability\n- adherence to best practices and project configs\n\t- respecting AGENTS.md, CLAUDE.md, ...\n\t- `.eslintrc` / `.prettierrc` / ...\n- token consumption, speed, tool calls number\n\t- do we incentivize everyone to do less tool calls? or more? maybe we should remove it, just a thought.\n\t- the less tokens and the faster the agent is, the better.\n\t- this score does not need an LLM judge.\n\n## Agents\n\n`agents/opencode.ts`\n```typescript\nexport const models = [\"openai/gpt-4o\", \"anthropic/claude-sonnet-4\"] // useful for assertions and matrix testing\n\nexport default createAgent((model, prompt) =\u003e {\n\tvoid prompt\n\treturn `opencode run -m ${model}`\n})\n```\n\n### Dummy agents\n\nTo test out the the benchmark itself, we can have a dummy agent that we measure how the judges behave on those dummy outputs.\n\n`agents/dummy-bad.ts`\n```typescript\nexport const models = [\"openai/gpt-4o\", \"anthropic/claude-sonnet-4\"] // useful for assertions and matrix testing\n\nexport default createAgent((model, prompt) =\u003e {\n\t// fs.writeFile to write dummy files\n\treturn `echo ...`\n})\n```\n\nthe variance between this dummy and `agents/dummy-good.ts` should be high to validate that the judges produce _fair_ scores.\n\n## Scoring Methodology\n\nAll current scores are produced by LLM judges (`claude-4.5`, `gpt-5-codex`, `kimi`). For each assignment we gather their outputs into a matrix \\(S \\in [0,1]^{m \\times k}\\), where rows index judges and columns index score types. Given judge weights \\(w \\in \\Delta^{m-1}\\) (currently uniform) and assignment weights \\(v \\in \\Delta^{k-1}\\), the base score is\n\n\\[\nR = v^\\top S^\\top w = \\sum_{j=1}^k v_j \\left( \\sum_{i=1}^m w_i s_{ij} \\right).\n\\]\n\nTo discourage disagreement we subtract a variance penalty (see `lib/utils/scoreAggregation.ts`):\n\n\\[\nR_{\\text{pen}} = R - \\lambda \\sum_{j=1}^k v_j \\operatorname{Var}_j, \\qquad \\operatorname{Var}_j = \\sum_{i=1}^m w_i (s_{ij} - \\bar{s}_j)^2, \\quad \\bar{s}_j = \\sum_{i=1}^m w_i s_{ij}.\n\\]\n\nThe tests in `tests/scoreAggregation.test.ts` exercise this aggregation. The TODO above tracks the plan to replace noisy LLM scorers with deterministic checks while keeping the same aggregation pipeline.\n\n\n  rank  repo                                      stars  forks\n  1     noworneverev/graphrag-visualizer           375     46\n  2     KwokKwok/Silo                              240     25\n  3     prismicio-community/course-fizzi-next      180     77\n  4     mylofi/local-vault                         118      3\n  5     Rasalas/msg-reader                          74     14\n  6     halitsever/nest-cloudflare-turnstile        62     16\n  7     psyko-gh/overcrawlrr                        60      1\n  8     googleworkspace/drive-picker-element        46      6\n  9     pbstar/fitview                              37      0\n  10    ekoln/nextdaily                             33     20\n\n  Forks Leaderboard\n\n  rank  repo                                      stars  forks\n  1     prismicio-community/course-fizzi-next      180     77\n  2     noworneverev/graphrag-visualizer           375     46\n  3     KwokKwok/Silo                              240     25\n  4     Cefalo/quick-meet                           32     22\n  5     ekoln/nextdaily                             33     20\n  6     halitsever/nest-cloudflare-turnstile        62     16\n  7     BhuwanSKumar/refrain-addiction-main         11     16\n  8     Rasalas/msg-reader                          74     14\n  9     AlaminPu1007/algorithm-visualizer           22      7\n  10    mohitchandel/AI-APP-Template                12      7\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsst%2Fopencode-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsst%2Fopencode-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsst%2Fopencode-bench/lists"}