https://github.com/hinanohart/hazardloop
Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.
https://github.com/hinanohart/hazardloop
aalen-johansen agent-evaluation competing-risks kaplan-meier llm-agents survival-analysis time-to-event
Last synced: 13 days ago
JSON representation
Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.
- Host: GitHub
- URL: https://github.com/hinanohart/hazardloop
- Owner: hinanohart
- License: mit
- Created: 2026-05-29T16:06:40.000Z (30 days ago)
- Default Branch: main
- Last Pushed: 2026-06-10T09:26:11.000Z (18 days ago)
- Last Synced: 2026-06-10T11:07:50.400Z (18 days ago)
- Topics: aalen-johansen, agent-evaluation, competing-risks, kaplan-meier, llm-agents, survival-analysis, time-to-event
- Language: Python
- Homepage: https://github.com/hinanohart/hazardloop
- Size: 134 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# hazardloop
**Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories.**
Survival statistics for agent runs: handles **censored** timeouts correctly and separates **competing failure modes** (wrong patch, tool error, infinite loop, budget exhaustion) instead of collapsing them into one binary label.
> **Status: 0.1.0a3 (pre-alpha).** hazardloop is an *estimator plus a fail-closed policy
> interface*; it is **not a trained controller**. Any benefit to a live agent is
> **UNVERIFIED** in this release — hazardloop never runs, retrains, or re-executes agents.
> See [`docs/CLAIMS.md`](docs/CLAIMS.md) for the exact CLAIM / NON-CLAIM boundary.
## Why
A METR-style time horizon fits a logistic curve to binary success against task length.
Two runs that both "failed" are treated identically even if one hit a wall-clock timeout
(we never observed whether it would have succeeded — that is *censoring*) and the other
submitted a wrong patch (an observed terminal event). And a single failure label hides
*which* terminal event occurred. Biostatistics solved exactly this shape — time-to-event
with right-censoring and competing risks — decades ago. hazardloop applies it to agent
runs, and quantifies how much the censoring-blind and censoring-aware pictures disagree.
## Install
Not yet on PyPI (a pre-alpha; PyPI upload deferred) — install from source:
```bash
pip install "git+https://github.com/hinanohart/hazardloop" # core: numpy, scipy, typer
pip install "hazardloop[data] @ git+https://github.com/hinanohart/hazardloop" # + HF dataset adapter
pip install "hazardloop[test] @ git+https://github.com/hinanohart/hazardloop" # + pytest/hypothesis/lifelines
```
## Quickstart
```python
from hazardloop import fit_survival, ReplayEvaluator, EventModel
from hazardloop.adapters.mock import synthetic_competing_risks
records = synthetic_competing_risks(1500, seed=0) # deterministic synthetic runs
report = fit_survival(records) # KM + Nelson-Aalen + Aalen-Johansen CIF + Weibull
print(report.weibull.shape) # β > 1 => wear-out, β < 1 => early-mortality
print(sorted(report.cif.cif_by_cause)) # competing causes
replay = ReplayEvaluator(EventModel.failure_as_event(), seed=0).evaluate(records)
print(replay.test_metrics.recall, replay.test_metrics.saved_compute_fraction_all)
# values above are illustrative (synthetic, seed=0); real measured numbers are below.
```
```bash
hazardloop fit --backend mock # fit and summarise
hazardloop replay --backend mock # offline-replay decision quality
hazardloop fit --backend swe-smith --limit 400 # real SWE-smith-trajectories (needs [data])
hazardloop --help
```
## How it works
1. **Ingest** — agent run logs are normalised into `SurvivalRecord` objects. Each record carries:
- a step count
- a terminal mode (resolved, wrong-patch, tool-error, infinite-loop, budget-exhausted, or timeout/censored)
- an optional cluster identifier (model / harness / repo)
2. **Risk-table** — `core/_risksets.py` builds the at-risk set at each observed event time, handling right-censoring correctly. This is the shared foundation for all estimators.
3. **Estimators** — four estimators run on the same risk table:
- **Kaplan-Meier** (`core/kaplan_meier.py`) — survival function S(t) with Greenwood variance and complementary-log-log confidence intervals.
- **Nelson-Aalen** (`core/nelson_aalen.py`) — cumulative hazard H(t); the per-step hazard increment is what the abort policy reads.
- **Aalen-Johansen** (`core/aalen_johansen.py`) — competing-risk cumulative incidence function (CIF) per cause; avoids the naive `1 − KM_c` over-estimation.
- **Weibull AFT** (`core/weibull_aft.py`) — shape parameter β; β < 1 indicates early-mortality (most failures happen early), β > 1 wear-out (hazard rises with step count).
4. **Bootstrap** (`core/bootstrap.py`) — BCa cluster bootstrap; clusters are defined by model / harness / repository so that correlated runs don't inflate precision.
5. **Offline-replay** (`replay.py`) — a `HazardThresholdPolicy` is fitted on a train split and evaluated on a disjoint test split. Reported metrics: recall, premature-abort rate, saved-compute fraction, and median lead-time to failure. The abort threshold is never chosen on the test set (hindsight violation R5).
6. **CLI** (`cli.py`) — `hazardloop fit` and `hazardloop replay` wrap the above for command-line use with optional backend selection (`mock`, `swe-smith`).
## Architecture
## What it computes
| Estimator | Module | Notes |
|---|---|---|
| Kaplan-Meier survival + Greenwood + cloglog CI | `core.kaplan_meier` | from scratch (numpy) |
| Nelson-Aalen cumulative hazard | `core.nelson_aalen` | per-step instantaneous hazard for control |
| Aalen-Johansen competing-risk CIF | `core.aalen_johansen` | avoids the naive `1 - KM_c` over-estimation |
| Weibull AFT shape β | `core.weibull_aft` | β<1 early-mortality, β>1 wear-out |
| Cluster bootstrap CI (BCa) | `core.bootstrap` | default; cluster = model / harness / repo |
| Offline-replay decision quality | `replay` | lead-time, premature-abort, saved-compute, per-cause PR |
## Measured results
All numbers below are generated by `scripts/run_bench.py` into
[`bench_results/`](bench_results/) and are the single source of truth (no hand-written
numbers). Reproduce with:
```bash
python scripts/run_bench.py
```
Note: the script exits with a non-zero code because the `datasets` streaming loader
segfaults at interpreter teardown (a known pyarrow/GIL issue) *after* both JSON files are
written. Verify success by checking that `bench_results/real_survival.json` and
`bench_results/synthetic_cif.json` exist and are non-empty, not by the exit code.
**Real trajectories** (`bench_results/real_survival.json`) — SWE-bench/SWE-smith-trajectories,
`tool` split, n=400 (301 resolved, 99 unresolved), step axis, mode-A (failure-as-event):
- Kaplan-Meier final survival ≈ **0.049**; median survival ≈ **47 steps**; 48 event times.
- **Censoring-blind logistic vs censoring-aware KM** (the failure-CDF divergence): mean
absolute ≈ **0.090**, max absolute ≈ **0.155**. This is a difference between two
estimators on one dataset; it is **not** a statement that the METR methodology is
incorrect.
- **Offline-replay** (abort threshold selected on a train split, evaluated on a disjoint
test split of 200 runs; cluster bootstrap over the 58 held-out test repositories, of 108
total): recall ≈ **0.75**, premature-abort rate ≈ **0.43** (95% CI ≈ [0.35, 0.51]),
saved-compute fraction ≈ **0.31**, median lead-time ≈ **17 steps**.
**Synthetic competing risks** (`bench_results/synthetic_cif.json`) — typed multi-cause
Aalen-Johansen CIF on the deterministic generator (n=1500): final CIF by cause ≈
wrong_patch 0.44, tool_error 0.26, infinite_loop 0.16, budget_exhausted 0.14; additivity
residual `|ΣCIF − (1−S)|` ≈ 1e-16; Weibull shape β ≈ 1.22.
> **Honesty note:** the SWE-smith-trajectories source carries only a binary outcome and no
> per-cause failure labels. Therefore the typed competing-risk CIF in this release is synthetic-validation-only;
> the real-trajectory typed CIF is deferred. The real Kaplan-Meier / Nelson-Aalen /
> divergence / replay numbers above are measured on real data.
## Scope and honesty
hazardloop deliberately does **not**:
- claim any agent capability, task-completion, or pass@k gain (offline-replay numbers are
counterfactual estimates on logged runs, not live trials);
- extrapolate tails / return levels (no extreme-value theory — the i.i.d. assumption
breaks for sequential agent failures, arXiv:2511.02927);
- act as a verifier, reward auditor, or formal safety guarantee.
The `fork` decision's rescue rate is unobserved and is **not reported**.
### Relation to prior work
- arXiv:2509.02360 (offline PRM-score replay) — scalar-reward replay without censoring or
typed competing risks; hazardloop's replay consumes the censoring-aware survival core.
- arXiv:2512.03109 (E-valuator) — anytime-valid testing for binary correctness; deferred
and cited here, not re-implemented (`hazardloop compare` is a v0.2 stub).
## License
MIT. See [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE). The GPL survival package named
in NOTICE is **not** a dependency and is excluded by a CI check; KM / Nelson-Aalen /
Aalen-Johansen / Weibull-AFT are re-implemented on numpy.