An open API service indexing awesome lists of open source software.

https://github.com/hinanohart/hazardloop

Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.
https://github.com/hinanohart/hazardloop

aalen-johansen agent-evaluation competing-risks kaplan-meier llm-agents survival-analysis time-to-event

Last synced: 13 days ago
JSON representation

Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.

Awesome Lists containing this project

README

          

# hazardloop

**Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories.**

Survival statistics for agent runs: handles **censored** timeouts correctly and separates **competing failure modes** (wrong patch, tool error, infinite loop, budget exhaustion) instead of collapsing them into one binary label.

> **Status: 0.1.0a3 (pre-alpha).** hazardloop is an *estimator plus a fail-closed policy
> interface*; it is **not a trained controller**. Any benefit to a live agent is
> **UNVERIFIED** in this release — hazardloop never runs, retrains, or re-executes agents.
> See [`docs/CLAIMS.md`](docs/CLAIMS.md) for the exact CLAIM / NON-CLAIM boundary.

## Why

A METR-style time horizon fits a logistic curve to binary success against task length.
Two runs that both "failed" are treated identically even if one hit a wall-clock timeout
(we never observed whether it would have succeeded — that is *censoring*) and the other
submitted a wrong patch (an observed terminal event). And a single failure label hides
*which* terminal event occurred. Biostatistics solved exactly this shape — time-to-event
with right-censoring and competing risks — decades ago. hazardloop applies it to agent
runs, and quantifies how much the censoring-blind and censoring-aware pictures disagree.

## Install

Not yet on PyPI (a pre-alpha; PyPI upload deferred) — install from source:

```bash
pip install "git+https://github.com/hinanohart/hazardloop" # core: numpy, scipy, typer
pip install "hazardloop[data] @ git+https://github.com/hinanohart/hazardloop" # + HF dataset adapter
pip install "hazardloop[test] @ git+https://github.com/hinanohart/hazardloop" # + pytest/hypothesis/lifelines
```

## Quickstart

```python
from hazardloop import fit_survival, ReplayEvaluator, EventModel
from hazardloop.adapters.mock import synthetic_competing_risks

records = synthetic_competing_risks(1500, seed=0) # deterministic synthetic runs
report = fit_survival(records) # KM + Nelson-Aalen + Aalen-Johansen CIF + Weibull
print(report.weibull.shape) # β > 1 => wear-out, β < 1 => early-mortality
print(sorted(report.cif.cif_by_cause)) # competing causes

replay = ReplayEvaluator(EventModel.failure_as_event(), seed=0).evaluate(records)
print(replay.test_metrics.recall, replay.test_metrics.saved_compute_fraction_all)
# values above are illustrative (synthetic, seed=0); real measured numbers are below.
```

```bash
hazardloop fit --backend mock # fit and summarise
hazardloop replay --backend mock # offline-replay decision quality
hazardloop fit --backend swe-smith --limit 400 # real SWE-smith-trajectories (needs [data])
hazardloop --help
```

## How it works

1. **Ingest** — agent run logs are normalised into `SurvivalRecord` objects. Each record carries:
- a step count
- a terminal mode (resolved, wrong-patch, tool-error, infinite-loop, budget-exhausted, or timeout/censored)
- an optional cluster identifier (model / harness / repo)

2. **Risk-table** — `core/_risksets.py` builds the at-risk set at each observed event time, handling right-censoring correctly. This is the shared foundation for all estimators.

3. **Estimators** — four estimators run on the same risk table:
- **Kaplan-Meier** (`core/kaplan_meier.py`) — survival function S(t) with Greenwood variance and complementary-log-log confidence intervals.
- **Nelson-Aalen** (`core/nelson_aalen.py`) — cumulative hazard H(t); the per-step hazard increment is what the abort policy reads.
- **Aalen-Johansen** (`core/aalen_johansen.py`) — competing-risk cumulative incidence function (CIF) per cause; avoids the naive `1 − KM_c` over-estimation.
- **Weibull AFT** (`core/weibull_aft.py`) — shape parameter β; β < 1 indicates early-mortality (most failures happen early), β > 1 wear-out (hazard rises with step count).

4. **Bootstrap** (`core/bootstrap.py`) — BCa cluster bootstrap; clusters are defined by model / harness / repository so that correlated runs don't inflate precision.

5. **Offline-replay** (`replay.py`) — a `HazardThresholdPolicy` is fitted on a train split and evaluated on a disjoint test split. Reported metrics: recall, premature-abort rate, saved-compute fraction, and median lead-time to failure. The abort threshold is never chosen on the test set (hindsight violation R5).

6. **CLI** (`cli.py`) — `hazardloop fit` and `hazardloop replay` wrap the above for command-line use with optional backend selection (`mock`, `swe-smith`).

## Architecture


hazardloop architecture

## What it computes

| Estimator | Module | Notes |
|---|---|---|
| Kaplan-Meier survival + Greenwood + cloglog CI | `core.kaplan_meier` | from scratch (numpy) |
| Nelson-Aalen cumulative hazard | `core.nelson_aalen` | per-step instantaneous hazard for control |
| Aalen-Johansen competing-risk CIF | `core.aalen_johansen` | avoids the naive `1 - KM_c` over-estimation |
| Weibull AFT shape β | `core.weibull_aft` | β<1 early-mortality, β>1 wear-out |
| Cluster bootstrap CI (BCa) | `core.bootstrap` | default; cluster = model / harness / repo |
| Offline-replay decision quality | `replay` | lead-time, premature-abort, saved-compute, per-cause PR |

## Measured results

All numbers below are generated by `scripts/run_bench.py` into
[`bench_results/`](bench_results/) and are the single source of truth (no hand-written
numbers). Reproduce with:

```bash
python scripts/run_bench.py
```

Note: the script exits with a non-zero code because the `datasets` streaming loader
segfaults at interpreter teardown (a known pyarrow/GIL issue) *after* both JSON files are
written. Verify success by checking that `bench_results/real_survival.json` and
`bench_results/synthetic_cif.json` exist and are non-empty, not by the exit code.

**Real trajectories** (`bench_results/real_survival.json`) — SWE-bench/SWE-smith-trajectories,
`tool` split, n=400 (301 resolved, 99 unresolved), step axis, mode-A (failure-as-event):

- Kaplan-Meier final survival ≈ **0.049**; median survival ≈ **47 steps**; 48 event times.
- **Censoring-blind logistic vs censoring-aware KM** (the failure-CDF divergence): mean
absolute ≈ **0.090**, max absolute ≈ **0.155**. This is a difference between two
estimators on one dataset; it is **not** a statement that the METR methodology is
incorrect.
- **Offline-replay** (abort threshold selected on a train split, evaluated on a disjoint
test split of 200 runs; cluster bootstrap over the 58 held-out test repositories, of 108
total): recall ≈ **0.75**, premature-abort rate ≈ **0.43** (95% CI ≈ [0.35, 0.51]),
saved-compute fraction ≈ **0.31**, median lead-time ≈ **17 steps**.

**Synthetic competing risks** (`bench_results/synthetic_cif.json`) — typed multi-cause
Aalen-Johansen CIF on the deterministic generator (n=1500): final CIF by cause ≈
wrong_patch 0.44, tool_error 0.26, infinite_loop 0.16, budget_exhausted 0.14; additivity
residual `|ΣCIF − (1−S)|` ≈ 1e-16; Weibull shape β ≈ 1.22.

> **Honesty note:** the SWE-smith-trajectories source carries only a binary outcome and no
> per-cause failure labels. Therefore the typed competing-risk CIF in this release is synthetic-validation-only;
> the real-trajectory typed CIF is deferred. The real Kaplan-Meier / Nelson-Aalen /
> divergence / replay numbers above are measured on real data.

## Scope and honesty

hazardloop deliberately does **not**:

- claim any agent capability, task-completion, or pass@k gain (offline-replay numbers are
counterfactual estimates on logged runs, not live trials);
- extrapolate tails / return levels (no extreme-value theory — the i.i.d. assumption
breaks for sequential agent failures, arXiv:2511.02927);
- act as a verifier, reward auditor, or formal safety guarantee.

The `fork` decision's rescue rate is unobserved and is **not reported**.

### Relation to prior work

- arXiv:2509.02360 (offline PRM-score replay) — scalar-reward replay without censoring or
typed competing risks; hazardloop's replay consumes the censoring-aware survival core.
- arXiv:2512.03109 (E-valuator) — anytime-valid testing for binary correctness; deferred
and cited here, not re-implemented (`hazardloop compare` is a v0.2 stub).

## License

MIT. See [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE). The GPL survival package named
in NOTICE is **not** a dependency and is excluded by a CI check; KM / Nelson-Aalen /
Aalen-Johansen / Weibull-AFT are re-implemented on numpy.