{"id":50873645,"url":"https://github.com/hinanohart/scorewright","last_synced_at":"2026-06-15T07:31:09.233Z","repository":{"id":360060382,"uuid":"1247758590","full_name":"hinanohart/scorewright","owner":"hinanohart","description":"Sandboxed, multi-signal, cross-framework fitness scoring for evolution / RSI / agent loops, with an inline anti-gaming integrity layer. Apache-2.0.","archived":false,"fork":false,"pushed_at":"2026-06-10T13:37:07.000Z","size":426,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-11T06:31:32.389Z","etag":null,"topics":["agents","evolutionary-computation","fitness-function","llm-evaluation","reward-hacking","sandbox","scoring"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hinanohart.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T18:35:01.000Z","updated_at":"2026-06-10T13:38:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hinanohart/scorewright","commit_stats":null,"previous_names":["hinanohart/scorewright"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/hinanohart/scorewright","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fscorewright","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fscorewright/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fscorewright/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fscorewright/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hinanohart","download_url":"https://codeload.github.com/hinanohart/scorewright/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fscorewright/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","evolutionary-computation","fitness-function","llm-evaluation","reward-hacking","sandbox","scoring"],"created_at":"2026-06-15T07:31:08.392Z","updated_at":"2026-06-15T07:31:09.225Z","avatar_url":"https://github.com/hinanohart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scorewright\n\n**Sandboxed, multi-signal, cross-framework fitness scoring for evolution / RSI / agent loops — with an inline anti-gaming integrity layer.**\n\n[![CI](https://github.com/hinanohart/scorewright/actions/workflows/ci.yml/badge.svg)](https://github.com/hinanohart/scorewright/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](pyproject.toml)\n\nA reusable fitness-scoring layer for programs that evolve or improve code automatically — assemble scorers, let scorewright handle sandboxing, measurement, and honest failure reporting.\n\n\u003e **Status: pre-alpha (v0.1.0a2).** APIs may change. Core is dependency-free (standard library only). Measured numbers in this README are produced by `benchmarks/run_bench.py` on the hardware/date noted next to each figure; figures that require an API key are shown as `N/A` when no key is present.\n\n## Why\n\nInstead of writing a one-off `evaluate.py` in every project, you assemble scorers (correctness, performance, cost, LLM judge, integrity) and let scorewright handle the rest. Combining scores into a single number stays your responsibility — scorewright only measures.\n\nscorewright is:\n\n- **Multi-signal** — correctness, performance, cost, and LLM-judge quality are separate, composable scorers, each emitting *measured* values only.\n- **Sandboxed** — candidate code runs under a subprocess sandbox with CPU/memory/file-descriptor limits, wall-clock timeout, a temp working directory, and an environment allow-list (no ambient secrets leak in), plus optional best-effort network isolation. This is OS-level hardening, **not a hard security boundary**: it raises the bar against accidental damage and resource abuse, but for genuinely untrusted code use a VM/container backend (the `microsandbox` extra) or a disposable, network-isolated VM.\n- **Cross-framework** — adapters convert scorewright's intermediate representation into the shape a host framework expects. v0.1.0a2 ships an **OpenEvolve** adapter; a `verifiers` adapter is planned for v0.2.\n- **Gaming-aware** — an `AntiGamingScorer` adds *integrity* signals as a first-class part of fitness: held-out divergence, performance self-consistency, and structured-output anchoring. **Warn-only by default**; fail-closed is opt-in.\n\n### What scorewright deliberately is **not**\n\n- Not an evolution engine, not a search algorithm — it scores; your loop searches.\n- Not a reward-hacking *detector* with completeness guarantees. The integrity layer is a **best-effort, multi-signal, opt-in-fail-closed** set of heuristics that flags suspicious candidates and biases toward the safe side on uncertainty. It does not, and does not claim to, catch all gaming.\n\n## Install\n\n```bash\npip install scorewright                     # core (standard library only)\npip install \"scorewright[openevolve]\"       # + run the OpenEvolve example/bench\npip install \"scorewright[microsandbox]\"     # + VM-isolated backend (stub in 0.1.0a2)\n```\n\n## Quickstart\n\n```python\nfrom pathlib import Path\nfrom scorewright import Candidate, CompositeScorer\nfrom scorewright.sandbox import SubprocessSandbox\nfrom scorewright.scorers import CorrectnessScorer, PerfScorer\n\nsandbox = SubprocessSandbox(cpu_seconds=10, memory_mb=512, timeout_s=30)\n# fs isolation and the memory limit are on by default; pass allow_network=False\n# for best-effort network isolation (requires Python 3.12+ / a permitting kernel).\n\nscorer = CompositeScorer([\n    CorrectnessScorer(sandbox, test_command=[\"python\", \"-m\", \"pytest\", \"-q\"]),\n    PerfScorer(sandbox, command=[\"python\", \"solution.py\"], repeats=5),\n])\n\ncandidate = Candidate(path=Path(\"./candidate_program\"))\nfor result in scorer.score_all(candidate):\n    print(result.scorer, result.ok, [(s.name, s.value, s.unit) for s in result.signals])\n```\n\n`scorewright` measures; it does not silently aggregate. Combining signals into a single fitness number is the caller's (or the adapter's) responsibility, so the weighting stays explicit and auditable.\n\n### Plugging into OpenEvolve\n\n```python\nfrom scorewright.adapters.openevolve import to_openevolve_evaluator\n\nevaluate = to_openevolve_evaluator(scorer)   # -\u003e Callable[[str], dict[str, float]]\n# pass `evaluate` where OpenEvolve expects an evaluation function\n```\n\n### Catching scorer gaming\n\n```python\nfrom scorewright.scorers import AntiGamingScorer, is_flagged\n\nintegrity = AntiGamingScorer(\n    sandbox,\n    visible_test_command=[\"python\", \"-m\", \"pytest\", \"test_visible.py\", \"-q\"],\n    heldout_test_command=[\"python\", \"-m\", \"pytest\", \"test_heldout.py\", \"-q\"],\n    perf_command=[\"python\", \"solution.py\"],\n)\nresult = integrity.score(candidate)   # warn-only: always measures, never rejects\nprint(is_flagged(result))             # True if any integrity signal tripped\nprint(result.signal(\"integrity_flagged\").raw[\"reasons\"])\n```\n\nThe scorer only *measures* (warn-only) — the reject decision is opt-in at the judgment layer. Wire it through the adapter to fail closed:\n\n```python\nevaluate = to_openevolve_evaluator(\n    CompositeScorer([correctness, perf, integrity]),\n    aggregate=my_aggregate,\n    reject_on_gaming=True,   # flagged candidates get reject_score\n)\n```\n\n## How It Works\n\nscorewright is built around three core concepts:\n\n- **Signal** — a single measured quantity with a name, numeric value, unit, and `higher_is_better` direction. Scorers emit signals; nothing inside scorewright normalizes or weights them.\n- **ScoreResult** — the outcome of one scorer on one candidate. Either `ok=True` with one or more signals, or `ok=False` with an error message and no signals. A scorer that cannot measure (missing API key, execution error) reports failure honestly rather than fabricating a value.\n- **Candidate** — a path to the candidate program directory. The sandbox runs subprocesses under that path with OS-level resource limits applied.\n\n**Measurement vs. judgment.** Scorers *measure* and return `Signal`s with units and a `higher_is_better` flag. They never normalize or weight. Aggregation is a separate, explicit step in the adapter or your loop.\n\n**Honest failure.** A scorer that cannot run (missing API key, missing pricing, execution error) returns `ScoreResult(ok=False, error=...)` with no signals. It never invents a value.\n\n### Architecture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/architecture.png\" alt=\"scorewright architecture\" width=\"840\"\u003e\n\u003c/div\u003e\n\n### Source layout\n\n```\nsrc/scorewright/\n  types.py        # ScoreResult / Signal — the intermediate representation (IR)\n  scorer.py       # Scorer protocol + CompositeScorer (runs scorers, never aggregates)\n  _pricing.py     # ModelPrice + a clearly-dated EXAMPLE pricing snapshot\n  sandbox/        # SubprocessSandbox (default) + microsandbox stub (extra)\n  scorers/        # correctness, perf, cost, llm_judge, anti_gaming\n  adapters/       # openevolve (IR -\u003e native, pure conversion)\n```\n\n## Benchmark\n\n`benchmarks/run_bench.py` scores a fixed suite of candidate programs through the scorers and the OpenEvolve-adapter interface (no live LLM or evolution run is needed), and records correctness, performance, cost, and the anti-gaming **caught-rate** (fraction of deliberately-planted gaming candidates that the integrity layer flags). Each run is stamped with its environment (`os, machine, python, date, scorewright version, perf_repeats`) and written to `benchmarks/results/`.\n\nMeasured on `Linux-6.6 WSL2 x86_64`, Python 3.12.3, 2026-05-24 (UTC), 5-task suite, `perf_repeats=4` — reproduce with `python benchmarks/run_bench.py` (raw output in [`benchmarks/results/`](benchmarks/results/)):\n\n| Signal | Value | Notes |\n|---|---|---|\n| Anti-gaming **caught-rate** | **1.0** (3/3) | held-out \u0026 judge-injection catches are exact; the perf-variance catch fires on a large, machine-robust timing margin (CV 0.90 vs 0.5 threshold) |\n| Anti-gaming **false-positive-rate** | **0.0** (0/2) | honest candidates not flagged |\n| Correctness (honest pass-rate) | **1.0** | deterministic |\n| Perf (median wall-time, honest) | ~0.018 s | machine dependent; compare only within the same environment |\n| Cost (per honest candidate) | $0.00024 | computed from recorded token usage × the **example** pricing snapshot (not authoritative) |\n\n\u003e The caught-rate is over a small, hand-built suite of *known* strategies — a demonstration that the checks fire on what they target, not a coverage claim against reward-hacking in the wild. Cost figures require a pricing table and recorded token usage; with neither present the cost signal reports `ok=False` rather than a fabricated number. See [benchmarks/README.md](benchmarks/README.md) for methodology and caveats.\n\n## Audit-trail integration (memcanon)\n\n[`memcanon`](https://github.com/hinanohart/memcanon) v0.2+ accepts events from this repo via a thin in-process shim and content-hashes them into a local audit store:\n\n\u003e memcanon is not on PyPI yet. Install it from the tagged release:\n\u003e\n\u003e ```bash\n\u003e pip install \"git+https://github.com/hinanohart/memcanon@v0.2.0a2\"\n\u003e ```\n\n```python\nfrom memcanon.emit import emit\nfrom memcanon.store.local import LocalStore\n\nwith LocalStore(\"audit\") as store:\n    emit(\"scorewright\", {\"kind\": \"...\", \"decision\": \"...\"}, store=store)\n```\n\nEach record is tagged `source:scorewright` + `schema:memcanon-emit/1`. Memcanon's `memcanon export --format eu-ai-act-12 --to OUT.json` can then build an Article 12(2) paragraph-mapped audit-log artefact (SHAPE only, NOT a conformity assessment).\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fscorewright","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhinanohart%2Fscorewright","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fscorewright/lists"}