{"id":50873690,"url":"https://github.com/hinanohart/hazardloop","last_synced_at":"2026-06-15T07:31:23.009Z","repository":{"id":361216747,"uuid":"1253588668","full_name":"hinanohart/hazardloop","owner":"hinanohart","description":"Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.","archived":false,"fork":false,"pushed_at":"2026-06-10T09:26:11.000Z","size":137,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T11:07:50.400Z","etag":null,"topics":["aalen-johansen","agent-evaluation","competing-risks","kaplan-meier","llm-agents","survival-analysis","time-to-event"],"latest_commit_sha":null,"homepage":"https://github.com/hinanohart/hazardloop","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hinanohart.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-29T16:06:40.000Z","updated_at":"2026-06-10T09:26:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hinanohart/hazardloop","commit_stats":null,"previous_names":["hinanohart/hazardloop"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/hinanohart/hazardloop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fhazardloop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fhazardloop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fhazardloop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fhazardloop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hinanohart","download_url":"https://codeload.github.com/hinanohart/hazardloop/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fhazardloop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aalen-johansen","agent-evaluation","competing-risks","kaplan-meier","llm-agents","survival-analysis","time-to-event"],"created_at":"2026-06-15T07:31:20.995Z","updated_at":"2026-06-15T07:31:23.004Z","avatar_url":"https://github.com/hinanohart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# hazardloop\n\n**Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories.**\n\nSurvival statistics for agent runs: handles **censored** timeouts correctly and separates **competing failure modes** (wrong patch, tool error, infinite loop, budget exhaustion) instead of collapsing them into one binary label.\n\n\u003e **Status: 0.1.0a3 (pre-alpha).** hazardloop is an *estimator plus a fail-closed policy\n\u003e interface*; it is **not a trained controller**. Any benefit to a live agent is\n\u003e **UNVERIFIED** in this release — hazardloop never runs, retrains, or re-executes agents.\n\u003e See [`docs/CLAIMS.md`](docs/CLAIMS.md) for the exact CLAIM / NON-CLAIM boundary.\n\n## Why\n\nA METR-style time horizon fits a logistic curve to binary success against task length.\nTwo runs that both \"failed\" are treated identically even if one hit a wall-clock timeout\n(we never observed whether it would have succeeded — that is *censoring*) and the other\nsubmitted a wrong patch (an observed terminal event). And a single failure label hides\n*which* terminal event occurred. Biostatistics solved exactly this shape — time-to-event\nwith right-censoring and competing risks — decades ago. hazardloop applies it to agent\nruns, and quantifies how much the censoring-blind and censoring-aware pictures disagree.\n\n## Install\n\nNot yet on PyPI (a pre-alpha; PyPI upload deferred) — install from source:\n\n```bash\npip install \"git+https://github.com/hinanohart/hazardloop\"          # core: numpy, scipy, typer\npip install \"hazardloop[data] @ git+https://github.com/hinanohart/hazardloop\"  # + HF dataset adapter\npip install \"hazardloop[test] @ git+https://github.com/hinanohart/hazardloop\"  # + pytest/hypothesis/lifelines\n```\n\n## Quickstart\n\n```python\nfrom hazardloop import fit_survival, ReplayEvaluator, EventModel\nfrom hazardloop.adapters.mock import synthetic_competing_risks\n\nrecords = synthetic_competing_risks(1500, seed=0)        # deterministic synthetic runs\nreport = fit_survival(records)                            # KM + Nelson-Aalen + Aalen-Johansen CIF + Weibull\nprint(report.weibull.shape)                              # β \u003e 1 =\u003e wear-out, β \u003c 1 =\u003e early-mortality\nprint(sorted(report.cif.cif_by_cause))                   # competing causes\n\nreplay = ReplayEvaluator(EventModel.failure_as_event(), seed=0).evaluate(records)\nprint(replay.test_metrics.recall, replay.test_metrics.saved_compute_fraction_all)\n# values above are illustrative (synthetic, seed=0); real measured numbers are below.\n```\n\n```bash\nhazardloop fit    --backend mock                 # fit and summarise\nhazardloop replay --backend mock                 # offline-replay decision quality\nhazardloop fit    --backend swe-smith --limit 400  # real SWE-smith-trajectories (needs [data])\nhazardloop --help\n```\n\n## How it works\n\n1. **Ingest** — agent run logs are normalised into `SurvivalRecord` objects. Each record carries:\n   - a step count\n   - a terminal mode (resolved, wrong-patch, tool-error, infinite-loop, budget-exhausted, or timeout/censored)\n   - an optional cluster identifier (model / harness / repo)\n\n2. **Risk-table** — `core/_risksets.py` builds the at-risk set at each observed event time, handling right-censoring correctly. This is the shared foundation for all estimators.\n\n3. **Estimators** — four estimators run on the same risk table:\n   - **Kaplan-Meier** (`core/kaplan_meier.py`) — survival function S(t) with Greenwood variance and complementary-log-log confidence intervals.\n   - **Nelson-Aalen** (`core/nelson_aalen.py`) — cumulative hazard H(t); the per-step hazard increment is what the abort policy reads.\n   - **Aalen-Johansen** (`core/aalen_johansen.py`) — competing-risk cumulative incidence function (CIF) per cause; avoids the naive `1 − KM_c` over-estimation.\n   - **Weibull AFT** (`core/weibull_aft.py`) — shape parameter β; β \u003c 1 indicates early-mortality (most failures happen early), β \u003e 1 wear-out (hazard rises with step count).\n\n4. **Bootstrap** (`core/bootstrap.py`) — BCa cluster bootstrap; clusters are defined by model / harness / repository so that correlated runs don't inflate precision.\n\n5. **Offline-replay** (`replay.py`) — a `HazardThresholdPolicy` is fitted on a train split and evaluated on a disjoint test split. Reported metrics: recall, premature-abort rate, saved-compute fraction, and median lead-time to failure. The abort threshold is never chosen on the test set (hindsight violation R5).\n\n6. **CLI** (`cli.py`) — `hazardloop fit` and `hazardloop replay` wrap the above for command-line use with optional backend selection (`mock`, `swe-smith`).\n\n## Architecture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/architecture.png\" alt=\"hazardloop architecture\" width=\"840\"\u003e\n\u003c/div\u003e\n\n## What it computes\n\n| Estimator | Module | Notes |\n|---|---|---|\n| Kaplan-Meier survival + Greenwood + cloglog CI | `core.kaplan_meier` | from scratch (numpy) |\n| Nelson-Aalen cumulative hazard | `core.nelson_aalen` | per-step instantaneous hazard for control |\n| Aalen-Johansen competing-risk CIF | `core.aalen_johansen` | avoids the naive `1 - KM_c` over-estimation |\n| Weibull AFT shape β | `core.weibull_aft` | β\u003c1 early-mortality, β\u003e1 wear-out |\n| Cluster bootstrap CI (BCa) | `core.bootstrap` | default; cluster = model / harness / repo |\n| Offline-replay decision quality | `replay` | lead-time, premature-abort, saved-compute, per-cause PR |\n\n## Measured results\n\nAll numbers below are generated by `scripts/run_bench.py` into\n[`bench_results/`](bench_results/) and are the single source of truth (no hand-written\nnumbers). Reproduce with:\n\n```bash\npython scripts/run_bench.py\n```\n\nNote: the script exits with a non-zero code because the `datasets` streaming loader\nsegfaults at interpreter teardown (a known pyarrow/GIL issue) *after* both JSON files are\nwritten. Verify success by checking that `bench_results/real_survival.json` and\n`bench_results/synthetic_cif.json` exist and are non-empty, not by the exit code.\n\n**Real trajectories** (`bench_results/real_survival.json`) — SWE-bench/SWE-smith-trajectories,\n`tool` split, n=400 (301 resolved, 99 unresolved), step axis, mode-A (failure-as-event):\n\n- Kaplan-Meier final survival ≈ **0.049**; median survival ≈ **47 steps**; 48 event times.\n- **Censoring-blind logistic vs censoring-aware KM** (the failure-CDF divergence): mean\n  absolute ≈ **0.090**, max absolute ≈ **0.155**. This is a difference between two\n  estimators on one dataset; it is **not** a statement that the METR methodology is\n  incorrect.\n- **Offline-replay** (abort threshold selected on a train split, evaluated on a disjoint\n  test split of 200 runs; cluster bootstrap over the 58 held-out test repositories, of 108\n  total): recall ≈ **0.75**, premature-abort rate ≈ **0.43** (95% CI ≈ [0.35, 0.51]),\n  saved-compute fraction ≈ **0.31**, median lead-time ≈ **17 steps**.\n\n**Synthetic competing risks** (`bench_results/synthetic_cif.json`) — typed multi-cause\nAalen-Johansen CIF on the deterministic generator (n=1500): final CIF by cause ≈\nwrong_patch 0.44, tool_error 0.26, infinite_loop 0.16, budget_exhausted 0.14; additivity\nresidual `|ΣCIF − (1−S)|` ≈ 1e-16; Weibull shape β ≈ 1.22.\n\n\u003e **Honesty note:** the SWE-smith-trajectories source carries only a binary outcome and no\n\u003e per-cause failure labels. Therefore the typed competing-risk CIF in this release is synthetic-validation-only;\n\u003e the real-trajectory typed CIF is deferred. The real Kaplan-Meier / Nelson-Aalen /\n\u003e divergence / replay numbers above are measured on real data.\n\n## Scope and honesty\n\nhazardloop deliberately does **not**:\n\n- claim any agent capability, task-completion, or pass@k gain (offline-replay numbers are\n  counterfactual estimates on logged runs, not live trials);\n- extrapolate tails / return levels (no extreme-value theory — the i.i.d. assumption\n  breaks for sequential agent failures, arXiv:2511.02927);\n- act as a verifier, reward auditor, or formal safety guarantee.\n\nThe `fork` decision's rescue rate is unobserved and is **not reported**.\n\n### Relation to prior work\n\n- arXiv:2509.02360 (offline PRM-score replay) — scalar-reward replay without censoring or\n  typed competing risks; hazardloop's replay consumes the censoring-aware survival core.\n- arXiv:2512.03109 (E-valuator) — anytime-valid testing for binary correctness; deferred\n  and cited here, not re-implemented (`hazardloop compare` is a v0.2 stub).\n\n## License\n\nMIT. See [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE). The GPL survival package named\nin NOTICE is **not** a dependency and is excluded by a CI check; KM / Nelson-Aalen /\nAalen-Johansen / Weibull-AFT are re-implemented on numpy.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fhazardloop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhinanohart%2Fhazardloop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fhazardloop/lists"}