{"id":50873693,"url":"https://github.com/hinanohart/envfuzz","last_synced_at":"2026-06-15T07:31:23.857Z","repository":{"id":360768739,"uuid":"1251557442","full_name":"hinanohart/envfuzz","owner":"hinanohart","description":"Pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability, not a prover of safety.","archived":false,"fork":false,"pushed_at":"2026-06-10T10:54:49.000Z","size":80,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T11:07:45.751Z","etag":null,"topics":["ci","fail-closed","llm-evaluation","reward-hacking","reward-model","rlvr","verifiers"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hinanohart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T17:39:11.000Z","updated_at":"2026-06-10T10:54:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hinanohart/envfuzz","commit_stats":null,"previous_names":["hinanohart/envfuzz"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/hinanohart/envfuzz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fenvfuzz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fenvfuzz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fenvfuzz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fenvfuzz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hinanohart","download_url":"https://codeload.github.com/hinanohart/envfuzz/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fenvfuzz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ci","fail-closed","llm-evaluation","reward-hacking","reward-model","rlvr","verifiers"],"created_at":"2026-06-15T07:31:22.950Z","updated_at":"2026-06-15T07:31:23.852Z","avatar_url":"https://github.com/hinanohart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# envfuzz\n\n**A pre-publish, fail-closed adversarial gate for RL-verifier `Environment`s.**\nA *falsifier* of reward-hackability — not a *prover* of safety.\n\n[![ci](https://github.com/hinanohart/envfuzz/actions/workflows/ci.yml/badge.svg)](https://github.com/hinanohart/envfuzz/actions/workflows/ci.yml)\n\u0026nbsp;MIT · pre-alpha (`v0.1.0a1`)\n\n`envfuzz` drives an RL verifier environment with deterministic adversarial policies\n*before* you publish it, and **fails closed** (non-zero exit) if any policy can\ninflate reward without actually satisfying the task. Wire it into CI as the last\ngate before `prime env push` so a reward-hackable environment never ships.\n\n---\n\n## Architecture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/architecture.png\" alt=\"envfuzz architecture\" width=\"840\"\u003e\n\u003c/div\u003e\n\n---\n\n## Quickstart\n\n```bash\npip install \"git+https://github.com/hinanohart/envfuzz@v0.1.0a1\"   # core: numpy + rich, CPU, offline\nenvfuzz audit corpus --fail-on-hackability     # exit 1 if any env is hackable\nenvfuzz report corpus --html scorecard.html    # self-contained HTML scorecard\n```\n\nOr drop it into a CI workflow as a GitHub Action:\n\n```yaml\n- uses: hinanohart/envfuzz@v0.1.0a1\n  with:\n    target: corpus            # a corpus env id, or 'corpus'/'all'\n    fail-on-hackability: \"true\"\n    html: scorecard.html\n```\n\n---\n\n## What it is\n\n- A **gate**, run by the environment *author* before publishing. The unit of\n  analysis is the environment's rubric + parser, not a trained model.\n- A **falsifier**: it searches for an exploit. Finding one is a definite \"this is\n  hackable.\" Finding none is \"no exploit in this catalog\" — never a safety proof.\n- **CPU-only and deterministic.** Every randomized routine takes an explicit seed.\n  The claimed path imports only `numpy`; it does not require a GPU, network, API\n  key, or even `verifiers`.\n\n## What it is NOT\n\n- Not a calibrated probability of hackability. Scores are **order-only** and are\n  combined with noisy-OR; the reward-inflation interval is a percentile bootstrap.\n- Not an exhaustiveness proof, not a training-time reward-hacking detector, not a\n  model red-teaming tool.\n- Not a full OS sandbox. The live driver isolates untrusted environments at the\n  **Python process level** (see Threat model), not with seccomp/namespaces.\n\n---\n\n## How it works\n\nFor each environment, envfuzz drives a set of reference policies (`honest_correct`,\n`honest_wrong`, `null`) and a catalog of blind adversarial policies, then runs a\nsuite of detectors over the resulting rollouts. An environment is flagged\n**hackable** only when **both**:\n\n1. the noisy-OR composite of fired detectors ≥ 0.5; **and**\n2. the reward-inflation bootstrap CI lower bound \u003e 0 — never report\n   inflation the interval cannot separate from zero.\n\nOn the deterministic fixture corpus the composite is the sole operative\ndiscriminator. Because the inflation sample keeps only strictly-positive exploit\ngains over the null floor, condition 2 collapses to 0 exactly when no exploit\nbeats the floor, so on fixtures it tracks \"some exploit inflated reward\" rather\nthan independently gating the verdict. The CI guard becomes load-bearing on the\nstochastic live path (live detector wiring deferred to v0.1.1), where sampling\nnoise can drag the lower bound to 0.\n\n### Attack classes (v0.1.0a1)\n\n| class | name | what it exploits |\n|------|------|------------------|\n| A1 | early-termination / empty-work | reward for participation/termination signals on their own |\n| A2 | parser-injection | format/well-formedness credit earnable without solving |\n| A3 | rubric-weight interaction | keyword/length/auxiliary reward dominating correctness |\n| A4 | state side-channel | grader-trust / echoing information the env exposes |\n\n### Detectors\n\n`RewardGap`, `CeilingBreach`, `EmptyWorkHighReward`, `ParserOnly`, and\n`MonotoneInflation` are active. `SideEffect` is a **v0.1.1 placeholder**: its\ninterface is stable but it is inert in every v0.1.0a1 path (it depends on the\nsandbox-backed by-name live driver — see Roadmap). A4 is still exercised in\nfixture mode through the `self_certify` / `prompt_leak` rubric attacks.\n\n---\n\n## Claims, non-claims, and scope\n\n**CLAIMED (verified by CI):** on the bundled **synthetic** corpus, the four attack\nclasses above are falsified deterministically; detectors separate gameable from\nrobust fixtures with the numbers quoted below; the subprocess sandbox contains the\nbehaviors listed under Threat model; the CLI exits non-zero on a hackable env.\n\n**NON-CLAIM (shipped capability, not covered by CI, hardening in `v0.1.1`):**\ndriving real, live `verifiers` environments end-to-end (the `[vf]` extra; see\n`envfuzz.drivers.vf_env`), tool-call (`ToolEnv`) environments, and loading\nenvironments *by name* through the sandbox.\n\n**Out of scope:** browser / side-effecting environments, `verifiers` framework\ninternals (report those upstream), training-dynamics hacking, learned attackers,\nand any exhaustiveness guarantee.\n\nThese boundaries are fixed and intentional:\n\n- **NC1** — envfuzz does not prove the absence of exploits; it only falsifies.\n- **NC2** — training-time reward hacking is not addressed.\n- **NC3** — attacks on a trained model (rather than the environment) are not addressed.\n- **NC4** — hackability scores are not calibrated probabilities.\n- **NC5** — corpus numbers describe the bundled synthetic corpus, not any real model.\n\n---\n\n## Numbers (bundled synthetic corpus)\n\nProduced by `envfuzz bench --quick` and asserted in CI (`tests/check_bench.py`),\nso the code and this table cannot drift apart:\n\n| metric | value |\n|--------|-------|\n| environments | 12 (7 gameable, 5 robust) |\n| precision | 1.0 |\n| recall | 1.0 |\n| accuracy | 1.0 |\n| attack classes exercised | A1, A2, A3, A4 |\n\nPer **NC5**, these are properties of a small synthetic corpus designed to exercise\neach attack class with robust negative controls — not a measurement against a real\nreward model.\n\n---\n\n## Threat model (live driver)\n\nUntrusted `verifiers` environments are third-party code; importing one can execute\narbitrary code at module load. envfuzz therefore runs untrusted execution in a\n**separate process** with:\n\n- `resource` limits (CPU seconds, address space, file size, no core dumps);\n- a **scrubbed environment** (host secrets are not forwarded to the child);\n- a Python-level **network guard** (`socket` raises) and **filesystem write\n  containment** (writes outside the sandbox working directory are denied);\n- **fail-closed** semantics: any failure to obtain a clean result is treated as\n  \"did not clear\" — i.e., blocking.\n\nThis is **process-level Python isolation, not OS-level isolation.** There is no\nseccomp or namespace confinement, so determined native/syscall-level code can still\nescape; that hardening is planned for `v0.1.1`. The escape test\n(`tests/test_sandbox.py`) asserts exactly the guarantees above and nothing more.\n\nThe current `VerifiersDriver` drives an `Environment` object **you construct**\n(which you therefore already trust); loading arbitrary environments by name through\nthe sandbox is the `v0.1.1` item.\n\n---\n\n## Install\n\nv0.1.0a1 is distributed via GitHub (a PyPI release is planned):\n\n```bash\npip install \"git+https://github.com/hinanohart/envfuzz@v0.1.0a1\"           # core (numpy, rich)\npip install \"envfuzz[vf] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1\"   # + verifiers (live, NON-CLAIM)\npip install \"envfuzz[dev] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1\"  # + test/lint toolchain\n```\n\nPython 3.10–3.13.\n\n---\n\n## Roadmap (v0.1.1)\n\nThese are deliberately deferred from v0.1.0a1 (the subprocess sandbox primitive\nalready ships and is escape-tested; the items below are about *wiring* it):\n\n- Load environments **by name** and drive them **inside the subprocess sandbox**\n  (the current live driver runs a user-constructed `Environment` in-process).\n- Make the `SideEffect` detector live: have the sandbox observe and report an\n  environment's host side-effect attempts.\n- OS-level sandbox hardening (seccomp / namespaces) beyond Python-level guards.\n- Tool-call (`ToolEnv`) driving; an optional PyPI release.\n\n---\n\n## License\n\nMIT. See `LICENSE` and `THIRD_PARTY_NOTICES.md`.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fenvfuzz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhinanohart%2Fenvfuzz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fenvfuzz/lists"}