{"id":49281837,"url":"https://github.com/beowolve/clawbattle","last_synced_at":"2026-04-25T19:01:28.132Z","repository":{"id":353613674,"uuid":"1204304997","full_name":"Beowolve/ClawBattle","owner":"Beowolve","description":"AI CSS Battle Benchmark that measures how well LLMs can reproduce pixel-perfect CSS targets.","archived":false,"fork":false,"pushed_at":"2026-04-24T16:45:12.000Z","size":4052,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-24T18:23:44.143Z","etag":null,"topics":["benchmark","cssbattle","llm"],"latest_commit_sha":null,"homepage":"https://beowolve.github.io/ClawBattle/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Beowolve.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-04-07T22:16:19.000Z","updated_at":"2026-04-24T16:45:15.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Beowolve/ClawBattle","commit_stats":null,"previous_names":["beowolve/clawbattle"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/Beowolve/ClawBattle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Beowolve%2FClawBattle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Beowolve%2FClawBattle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Beowolve%2FClawBattle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Beowolve%2FClawBattle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Beowolve","download_url":"https://codeload.github.com/Beowolve/ClawBattle/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Beowolve%2FClawBattle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32273223,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-25T18:29:39.964Z","status":"ssl_error","status_checked_at":"2026-04-25T18:29:32.149Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","cssbattle","llm"],"created_at":"2026-04-25T19:01:27.216Z","updated_at":"2026-04-25T19:01:28.117Z","avatar_url":"https://github.com/Beowolve.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ClawBattle\n**AI CSS Battle Benchmark**\n\nMeasures how well LLMs can reproduce pixel-perfect CSS targets from [CSS Battle](https://cssbattle.dev). Run multiple models against the same targets and compare scores, match rates, and cost on the dashboard.\n\n## Prerequisites\n\n- [Docker Desktop](https://www.docker.com/products/docker-desktop) (running, Linux containers mode)\n- API key for at least one provider (OpenRouter, OpenAI, or Ollama)\n\n## Quick Start\n\n```bash\ncp .env.example .env\n# Add your API key(s) to .env\n\nnpm run dev\n```\n\nOpen `http://localhost:5173` for the dashboard.\n\n## Running a Benchmark\n\nThe easiest way is the **+ Run** tab in the dashboard — pick a model, provider, and hit Start. You can launch multiple runs in parallel, and the model field autocompletes previously used models filtered by the selected provider.\n\nAlternatively via CLI:\n\n```bash\ndocker compose run runner \\\n  --model openai/gpt-4o \\\n  --provider openrouter \\\n  --attempts 3\n```\n\nCLI options:\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--model` | — | Model ID (required), e.g. `openai/gpt-4o` |\n| `--provider` | `openrouter` | `openrouter` \\| `openai` \\| `ollama` |\n| `--targets` | `battle` | `battle` \\| `daily` |\n| `--target-id` | — | Run a single target by ID |\n| `--attempts` | `3` | Attempts per target (best score counts) |\n| `--prompt` | `v1`* | Prompt version (`v1`, `v2`, …) |\n| `--concurrency` | `1` | Run N targets in parallel |\n| `--retries` | `0` | Retry a target if all attempts error |\n| `--reasoning` | `default` | Provider/model-specific reasoning effort from `config/model-reasoning.json`; `default` sends no reasoning parameter |\n\n*Set `PROMPT_VERSION=v2` in `.env` to change the default.\n\nResume and target-range controls are available in the dashboard (+ Run tab).\n\n## OpenRouter Provider Forcing\n\nYou can force OpenRouter provider routing per model via a local config file:\n\n```bash\ncp config/openrouter.providers.example.json config/openrouter.providers.json\n```\n\nDefault lookup path: `./config/openrouter.providers.json`  \nOptional override: `OPENROUTER_PROVIDER_CONFIG_PATH=...`\n\nConfig shape:\n\n```json\n{\n  \"modelProviderOverrides\": {\n    \"openai/gpt-5-mini\": \"openai\",\n    \"moonshotai/*\": \"io.net\",\n    \"anthropic/claude-3.7-sonnet\": {\n      \"order\": [\"anthropic\"],\n      \"allow_fallbacks\": false\n    }\n  }\n}\n```\n\nRules:\n- `\"\u003cprovider\u003e\"` forces a single provider (`allow_fallbacks: false`).\n- `[\"a\", \"b\"]` sets provider order (`allow_fallbacks: false`).\n- `{ ... }` passes a raw OpenRouter `provider` object through unchanged.\n- `\"vendor/*\"` applies to all models with that prefix (for example `moonshotai/*`).\n- Matching priority is: exact model \u003e longest `vendor/*` prefix.\n\n## How it Works\n\n1. The model receives the target image + canvas size + colors as context\n2. It generates an HTML/CSS solution (no JS, SVG, or external resources)\n3. The solution is rendered in headless Chromium at the exact canvas size (Quirks Mode)\n4. The render is pixel-diffed against the target using pixelmatch (threshold 0.01)\n5. A score is calculated from pixel match rate and code length\n\n## Scoring\n\nScore formula (CSS Battle): `399.99725 × 0.9905144^charCount + 599.9987`\n\nFor imperfect matches the score is multiplied by `match³`:\n\n| Match | Multiplier |\n|-------|-----------|\n| 100 % | 1.000× — full score |\n| 99 %  | 0.970× |\n| 95 %  | 0.857× |\n| 80 %  | 0.512× |\n| 50 %  | 0.125× |\n\nColor accuracy matters far more than code length. Only 100 % pixel matches count as perfect.\n\n## Project Structure\n\n```\npackages/\n  core/        Renderer (Puppeteer) + Scorer (pixelmatch) + LLM adapters\n  runner/      CLI benchmark orchestrator\n  api/         Express REST API + SSE progress stream\n  dashboard/   React + Vite dashboard (local + public build)\n  db/          SQLite adapter (built-in node:sqlite) + Supabase sync\ntargets/\n  images/      PNG reference images (battle + daily)\n  definitions/ Target metadata (colors, dimensions)\nbaselines/\n  human.json   Human expert top scores (reference baseline)\n  human_stats.json  Enriched per-target human leaderboard stats\nprompts/\n  v1/          Original benchmark prompt\n  v2/          Improved prompt (better color accuracy guidance)\nscripts/\n  upload-results.js       Upload local SQLite results → Supabase (done rows only)\n  download-results.js     Download results Supabase → local SQLite\n  audit-reasoning-runs.js Report invalid legacy reasoning_effort groups; --apply fixes safe cases\n  upload-targets.js       Seed battle/daily targets in Supabase\n  sync-targets.js         Sync target definitions + images from Supabase\n  export-human-stats.js   Export compact human baseline stats from Supabase leaderboard rows\n  recalculate-scores.js   Recompute match% + scores for all stored runs\n```\n\n## Supabase Sync\n\nResults can be synced bidirectionally between local SQLite and Supabase via the **⇅ Sync** tab or CLI scripts:\n\n```bash\nnpm run upload          # local SQLite → Supabase (only rows with status='done')\nnpm run download        # Supabase → local SQLite\nnpm run upload-targets  # seed battle_targets / daily_targets in Supabase\nnpm run sync            # sync targets + images from Supabase locally\nnpm run export-human-stats  # export baselines/human_stats.json from Supabase leaderboard rows\n```\n\n### Export Human Baseline Stats\n\nGenerate `baselines/human_stats.json` from a Supabase leaderboard relation:\n\n```bash\nnpm run export-human-stats\n```\n\nOptional overrides:\n\n```bash\nnode --env-file=.env scripts/export-human-stats.js \\\n  --source=battle_target_leaderboard_current_entries \\\n  --output=baselines/human_stats.json \\\n  --max-per-target=100\n```\n\nThe export stores `top1`, `top10Avg`, `rank100`, `p50`, and `p90` as paired\n`score + charCount` values per target. The local leaderboard can optionally\nshow synthetic human comparison rows (`human/top1`, `human/top10`,\n`human/rank100`, `human/expert-player`, `human/avg-player`) via the\n**Human Scores** checkbox. These rows are aggregated over battle targets\n`1..MAX(target_id)` from the current local benchmark scope, so partial runs\ncompare against the same target range.\n\nQueue state (pending / running / waiting / paused / error attempts) never\nleaves the local process — only completed `done` rows are synced.\n\nConfigure `SUPABASE_RESULTS_URL` and `SUPABASE_RESULTS_KEY` in `.env`. Run `packages/db/schema.sql` once in your Supabase project to set up the schema.\n\n## Public Dashboard\n\nA read-only public variant (Leaderboard, Targets, Insights, About) is automatically built and deployed to **GitHub Pages** on every version tag (`v*.*.*`).\n\nTo trigger a deployment, push a tag:\n\n```bash\ngit tag v1.0.0\ngit push --tags\n```\n\nRequired GitHub Secrets: `VITE_SUPABASE_URL`, `VITE_SUPABASE_ANON_KEY`.\nGitHub Pages source must be set to **GitHub Actions** (repo Settings → Pages).\n\nTo build locally:\n\n```bash\ncd packages/dashboard\n# Add to .env.public.local:\n#   VITE_SUPABASE_URL=https://xxx.supabase.co\n#   VITE_SUPABASE_ANON_KEY=eyJ...\nnpm run build:public   # output → dist-public/\n```\n\n## Run System\n\nThe benchmark runner is built around a single table (`runs`) that doubles as\na persistent attempt queue. Every `(run_id, target_id, attempt)` combination\nis pre-inserted before any work starts and moves through these statuses:\n\n`waiting` → `pending` → `running` → `done` | `error` | `paused`\n\n- **`waiting`** — follow-up attempt (n ≥ 2), blocked until the previous\n  attempt for the same target finishes `done`.\n- **`pending`** — claim-ready. A worker may pick it up.\n- **`running`** — claimed by a worker; protected with a `claim_token` so a\n  pause or re-claim can't be overwritten by a stale worker.\n- **`done`** — complete. Only `done` rows appear in the leaderboard\n  (grouped by `model + reasoning_effort`), model-level insights, the History\n  view and the Supabase upload.\n- **`error`** — non-terminal. The row stays visible in the Queue view with\n  a Retry button per attempt, plus Reset-all-errors per run.\n- **`paused`** — set by Cancel. The original status is saved in\n  `paused_from` so Resume restores the row exactly.\n\nA `runs_summary` view aggregates per-run status with priority\n`paused \u003e running \u003e error \u003e queued \u003e done` and powers the Queue / History\nsplit in the dashboard.\n\nWorkers claim the next `pending` row atomically with `BEGIN IMMEDIATE` +\n`UPDATE ... RETURNING`. Ordering is FIFO over `(enqueued_at, id)` across\nall runs, so a resumed run re-enters at the back of the queue. Each active\nworker pool is run-scoped and only claims rows for its own `run_id`.\n\nOn server startup, any leftover `running` rows (from a crashed process)\nare flipped back to `pending` and their claim tokens are cleared.\n\n**API surface**\n\n- `POST /api/runs/start` — new run or fill-run. Pre-enqueues all attempts.\n- `POST /api/runs/:runId/cancel` — pauses the run (abort + `paused_from`).\n- `POST /api/runs/:runId/resume` — restores the pre-pause state and bumps\n  `enqueued_at` to now; accepts optional JSON body `{ \"concurrency\": \u003cn\u003e }`.\n- `POST /api/runs/attempts/:id/retry` — single `error` → `pending`.\n- `POST /api/runs/:runId/reset-errors` — bulk `error` → `pending`.\n- `GET /api/runs/queue` — everything not-yet-done, with attempts nested.\n- `GET /api/runs/history` — done-only runs, newest finish first.\n- `GET /api/runs/:runId/progress` — SSE, used by the Run tab.\n\nQueue state is local to each runner process — only `done` rows are ever\nsynced to Supabase.\n\nReasoning options are configured in `config/model-reasoning.json`. New runs\nstore the explicit `default` value when no reasoning parameter is sent, so\nleaderboard groups do not depend on ambiguous `NULL` defaults. Run\n`npm run audit-reasoning` to inspect legacy or invalid reasoning groups; add\n`-- --apply` to apply the script's safe corrections.\n\n## Running Tests\n\n```bash\n# Single file\nnode --test packages/db/adapters/sqlite/runs.test.js\n\n# All tests\nnode --test packages/**/*.test.js\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeowolve%2Fclawbattle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbeowolve%2Fclawbattle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeowolve%2Fclawbattle/lists"}