https://github.com/hinanohart/envfuzz
Pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability, not a prover of safety.
https://github.com/hinanohart/envfuzz
ci fail-closed llm-evaluation reward-hacking reward-model rlvr verifiers
Last synced: 2 days ago
JSON representation
Pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability, not a prover of safety.
- Host: GitHub
- URL: https://github.com/hinanohart/envfuzz
- Owner: hinanohart
- License: mit
- Created: 2026-05-27T17:39:11.000Z (20 days ago)
- Default Branch: main
- Last Pushed: 2026-06-10T10:54:49.000Z (7 days ago)
- Last Synced: 2026-06-10T11:07:45.751Z (7 days ago)
- Topics: ci, fail-closed, llm-evaluation, reward-hacking, reward-model, rlvr, verifiers
- Language: Python
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# envfuzz
**A pre-publish, fail-closed adversarial gate for RL-verifier `Environment`s.**
A *falsifier* of reward-hackability — not a *prover* of safety.
[](https://github.com/hinanohart/envfuzz/actions/workflows/ci.yml)
MIT · pre-alpha (`v0.1.0a1`)
`envfuzz` drives an RL verifier environment with deterministic adversarial policies
*before* you publish it, and **fails closed** (non-zero exit) if any policy can
inflate reward without actually satisfying the task. Wire it into CI as the last
gate before `prime env push` so a reward-hackable environment never ships.
---
## Architecture
---
## Quickstart
```bash
pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # core: numpy + rich, CPU, offline
envfuzz audit corpus --fail-on-hackability # exit 1 if any env is hackable
envfuzz report corpus --html scorecard.html # self-contained HTML scorecard
```
Or drop it into a CI workflow as a GitHub Action:
```yaml
- uses: hinanohart/envfuzz@v0.1.0a1
with:
target: corpus # a corpus env id, or 'corpus'/'all'
fail-on-hackability: "true"
html: scorecard.html
```
---
## What it is
- A **gate**, run by the environment *author* before publishing. The unit of
analysis is the environment's rubric + parser, not a trained model.
- A **falsifier**: it searches for an exploit. Finding one is a definite "this is
hackable." Finding none is "no exploit in this catalog" — never a safety proof.
- **CPU-only and deterministic.** Every randomized routine takes an explicit seed.
The claimed path imports only `numpy`; it does not require a GPU, network, API
key, or even `verifiers`.
## What it is NOT
- Not a calibrated probability of hackability. Scores are **order-only** and are
combined with noisy-OR; the reward-inflation interval is a percentile bootstrap.
- Not an exhaustiveness proof, not a training-time reward-hacking detector, not a
model red-teaming tool.
- Not a full OS sandbox. The live driver isolates untrusted environments at the
**Python process level** (see Threat model), not with seccomp/namespaces.
---
## How it works
For each environment, envfuzz drives a set of reference policies (`honest_correct`,
`honest_wrong`, `null`) and a catalog of blind adversarial policies, then runs a
suite of detectors over the resulting rollouts. An environment is flagged
**hackable** only when **both**:
1. the noisy-OR composite of fired detectors ≥ 0.5; **and**
2. the reward-inflation bootstrap CI lower bound > 0 — never report
inflation the interval cannot separate from zero.
On the deterministic fixture corpus the composite is the sole operative
discriminator. Because the inflation sample keeps only strictly-positive exploit
gains over the null floor, condition 2 collapses to 0 exactly when no exploit
beats the floor, so on fixtures it tracks "some exploit inflated reward" rather
than independently gating the verdict. The CI guard becomes load-bearing on the
stochastic live path (live detector wiring deferred to v0.1.1), where sampling
noise can drag the lower bound to 0.
### Attack classes (v0.1.0a1)
| class | name | what it exploits |
|------|------|------------------|
| A1 | early-termination / empty-work | reward for participation/termination signals on their own |
| A2 | parser-injection | format/well-formedness credit earnable without solving |
| A3 | rubric-weight interaction | keyword/length/auxiliary reward dominating correctness |
| A4 | state side-channel | grader-trust / echoing information the env exposes |
### Detectors
`RewardGap`, `CeilingBreach`, `EmptyWorkHighReward`, `ParserOnly`, and
`MonotoneInflation` are active. `SideEffect` is a **v0.1.1 placeholder**: its
interface is stable but it is inert in every v0.1.0a1 path (it depends on the
sandbox-backed by-name live driver — see Roadmap). A4 is still exercised in
fixture mode through the `self_certify` / `prompt_leak` rubric attacks.
---
## Claims, non-claims, and scope
**CLAIMED (verified by CI):** on the bundled **synthetic** corpus, the four attack
classes above are falsified deterministically; detectors separate gameable from
robust fixtures with the numbers quoted below; the subprocess sandbox contains the
behaviors listed under Threat model; the CLI exits non-zero on a hackable env.
**NON-CLAIM (shipped capability, not covered by CI, hardening in `v0.1.1`):**
driving real, live `verifiers` environments end-to-end (the `[vf]` extra; see
`envfuzz.drivers.vf_env`), tool-call (`ToolEnv`) environments, and loading
environments *by name* through the sandbox.
**Out of scope:** browser / side-effecting environments, `verifiers` framework
internals (report those upstream), training-dynamics hacking, learned attackers,
and any exhaustiveness guarantee.
These boundaries are fixed and intentional:
- **NC1** — envfuzz does not prove the absence of exploits; it only falsifies.
- **NC2** — training-time reward hacking is not addressed.
- **NC3** — attacks on a trained model (rather than the environment) are not addressed.
- **NC4** — hackability scores are not calibrated probabilities.
- **NC5** — corpus numbers describe the bundled synthetic corpus, not any real model.
---
## Numbers (bundled synthetic corpus)
Produced by `envfuzz bench --quick` and asserted in CI (`tests/check_bench.py`),
so the code and this table cannot drift apart:
| metric | value |
|--------|-------|
| environments | 12 (7 gameable, 5 robust) |
| precision | 1.0 |
| recall | 1.0 |
| accuracy | 1.0 |
| attack classes exercised | A1, A2, A3, A4 |
Per **NC5**, these are properties of a small synthetic corpus designed to exercise
each attack class with robust negative controls — not a measurement against a real
reward model.
---
## Threat model (live driver)
Untrusted `verifiers` environments are third-party code; importing one can execute
arbitrary code at module load. envfuzz therefore runs untrusted execution in a
**separate process** with:
- `resource` limits (CPU seconds, address space, file size, no core dumps);
- a **scrubbed environment** (host secrets are not forwarded to the child);
- a Python-level **network guard** (`socket` raises) and **filesystem write
containment** (writes outside the sandbox working directory are denied);
- **fail-closed** semantics: any failure to obtain a clean result is treated as
"did not clear" — i.e., blocking.
This is **process-level Python isolation, not OS-level isolation.** There is no
seccomp or namespace confinement, so determined native/syscall-level code can still
escape; that hardening is planned for `v0.1.1`. The escape test
(`tests/test_sandbox.py`) asserts exactly the guarantees above and nothing more.
The current `VerifiersDriver` drives an `Environment` object **you construct**
(which you therefore already trust); loading arbitrary environments by name through
the sandbox is the `v0.1.1` item.
---
## Install
v0.1.0a1 is distributed via GitHub (a PyPI release is planned):
```bash
pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # core (numpy, rich)
pip install "envfuzz[vf] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # + verifiers (live, NON-CLAIM)
pip install "envfuzz[dev] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # + test/lint toolchain
```
Python 3.10–3.13.
---
## Roadmap (v0.1.1)
These are deliberately deferred from v0.1.0a1 (the subprocess sandbox primitive
already ships and is escape-tested; the items below are about *wiring* it):
- Load environments **by name** and drive them **inside the subprocess sandbox**
(the current live driver runs a user-constructed `Environment` in-process).
- Make the `SideEffect` detector live: have the sandbox observe and report an
environment's host side-effect attempts.
- OS-level sandbox hardening (seccomp / namespaces) beyond Python-level guards.
- Tool-call (`ToolEnv`) driving; an optional PyPI release.
---
## License
MIT. See `LICENSE` and `THIRD_PARTY_NOTICES.md`.