{"id":51317594,"url":"https://github.com/wbopan/retro-harness","last_synced_at":"2026-07-01T09:01:27.256Z","repository":{"id":363756532,"uuid":"1261698932","full_name":"wbopan/retro-harness","owner":"wbopan","description":"RHO: Evolving Agents in the Dark — Retrospective Harness Optimization via Self-Preference. Improving LLM agents from unlabeled past trajectories (arXiv:2606.05922).","archived":false,"fork":false,"pushed_at":"2026-06-10T07:09:22.000Z","size":2733,"stargazers_count":14,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T09:07:16.770Z","etag":null,"topics":["agent-optimization","llm","llm-agents","prompt-optimization","research","self-supervised","swe-bench"],"latest_commit_sha":null,"homepage":"https://paper-rho.wenbo.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wbopan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-07T03:22:10.000Z","updated_at":"2026-06-10T07:09:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/wbopan/retro-harness","commit_stats":null,"previous_names":["wbopan/retro-harness"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/wbopan/retro-harness","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wbopan%2Fretro-harness","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wbopan%2Fretro-harness/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wbopan%2Fretro-harness/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wbopan%2Fretro-harness/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wbopan","download_url":"https://codeload.github.com/wbopan/retro-harness/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wbopan%2Fretro-harness/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34999792,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-01T02:00:05.325Z","response_time":130,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-optimization","llm","llm-agents","prompt-optimization","research","self-supervised","swe-bench"],"created_at":"2026-07-01T09:01:26.283Z","updated_at":"2026-07-01T09:01:27.222Z","avatar_url":"https://github.com/wbopan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Evolving Agents in the Dark\n\n**Retrospective Harness Optimization (RHO) via Self-Preference**\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2606.05922\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2606.05922-b31b1b.svg?style=for-the-badge\" alt=\"arXiv\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://paper-rho.wenbo.io\"\u003e\u003cimg src=\"https://img.shields.io/badge/Project-Page-2F4F6F.svg?style=for-the-badge\" alt=\"Project page\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.wenbo.io/blog/rho-dynamic-workflow/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Blog-Post-4A6B8A.svg?style=for-the-badge\" alt=\"Blog post\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-green.svg?style=for-the-badge\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/python-3.11%2B-blue.svg?style=for-the-badge\" alt=\"Python 3.11+\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"site/static/fig2-pipeline.png\" width=\"100%\" alt=\"The RHO pipeline\"\u003e\n\u003c/p\u003e\n\n\u003e **TL;DR** — AI agents rely on a *harness* of skills, tools, and workflows to solve complex tasks.\n\u003e **RHO** improves that harness **without any ground-truth labels or validation set** — it learns purely\n\u003e from the agent's own past trajectories. A single retrospective pass lifts SWE-Bench Pro pass rate\n\u003e from **59% → 78%**.\n\u003e\n\u003e Read the story behind RHO and dynamic workflows in the [blog post](https://www.wenbo.io/blog/rho-dynamic-workflow/) ([中文版](https://www.wenbo.io/blog/zh/rho-dynamic-workflow/)).\n\n---\n\n## ⚡ Run it on your own projects\n\n| Agent | One line to try | |\n| :-- | :-- | :-- |\n| \u003cimg src=\"https://cdn.simpleicons.org/claude\" width=\"18\" alt=\"Claude\"\u003e \u0026nbsp;**Claude Code** | Paste as a prompt:\u003cbr\u003e`Run the workflow at https://raw.githubusercontent.com/wbopan/retro-harness/main/.claude/workflows/retrospection.js on this project` | **Dynamic workflow, plug-and-play on the project you're in. Recommended.** |\n| \u003cimg src=\"https://avatars.githubusercontent.com/u/14957082?s=36\" width=\"18\" alt=\"OpenAI\"\u003e \u0026nbsp;**Codex CLI** | `curl -fsSLO https://raw.githubusercontent.com/wbopan/retro-harness/main/codex/retrospection.py \u0026\u0026 python3 retrospection.py` | Stdlib-only orchestrator over `codex exec` — the same cycle on your `AGENTS.md` + skills. |\n| \u003cimg src=\"https://cdn.simpleicons.org/gnubash\" width=\"18\" alt=\"CLI\"\u003e \u0026nbsp;**This repo** | `git clone https://github.com/wbopan/retro-harness \u0026\u0026 cd retro-harness \u0026\u0026 uv sync \u0026\u0026 uv run rho evolve --dataset locomo:data/locomo10.json --rounds 1` | Used to reproduce our results, for research purposes. |\n\nBoth one-liners mine the sessions you have **already accumulated** in that project, diagnose recurring\nfailures, and evolve the agent's persistent harness (`CLAUDE.md` / auto-memory / scripts, or\n`AGENTS.md` / skills) — applying an update only when the agent's own pairwise self-preference favors\nit. Details: [Retrospection on Claude Code](#retrospection-try-rho-on-your-own-claude-code-projects)\n· [`codex/retrospection.py`](codex/retrospection.py).\n\n## What is RHO?\n\nMost harness-optimization methods (prompt optimization, skill/tool synthesis, agent search) iterate\nagainst a *labeled validation set*. In real deployments such labels are expensive or impossible to\ncollect — but a deployed agent continuously produces a rich stream of **unlabeled trajectories**.\n\nRHO turns those trajectories into harness improvements with **no external grading**, in three stages:\n\n1. **Coreset Selection** — pick a small, difficulty-diverse subset of past tasks with a determinantal\n   point process (DPP).\n2. **Group Rollout** — re-solve each coreset task *G* times in parallel, then extract two\n   label-free diagnostic signals: **self-validation** (within a trajectory) and **self-consistency**\n   (across parallel trajectories).\n3. **Harness Proposal** — sample *N* candidate harness edits and keep the one whose rollouts are most\n   preferred by the agent's own **pairwise self-preference**.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"site/static/fig1-rho-comparison.png\" width=\"55%\" alt=\"Validation-based vs retrospective optimization\"\u003e\n\u003c/p\u003e\n\n## Results\n\nHeld-out pass rate after a single optimization round (Codex + GPT-5.5), versus feedback-free baselines\nthat operate under the same agent-call budget:\n\n| Method | Harness surface | SWE-Bench Pro | Terminal-Bench 2 | GAIA-2 |\n| :-- | :-- | :--: | :--: | :--: |\n| Vanilla Codex | — | 0.59 | 0.71 | 0.29 |\n| Dynamic Cheatsheet | Skills | 0.62\u0026nbsp;(+0.03) | 0.73\u0026nbsp;(+0.02) | 0.30\u0026nbsp;(+0.01) |\n| ReasoningBank | Memory | 0.61\u0026nbsp;(+0.02) | 0.73\u0026nbsp;(+0.02) | 0.28\u0026nbsp;(−0.01) |\n| Sleep-time Compute | Memory | 0.64\u0026nbsp;(+0.05) | 0.73\u0026nbsp;(+0.02) | 0.32\u0026nbsp;(+0.03) |\n| **RHO (ours)** | **Skills + Tools** | **0.78\u0026nbsp;(+0.19)** | **0.76\u0026nbsp;(+0.05)** | **0.37\u0026nbsp;(+0.08)** |\n\nRHO also surpasses **Meta-Harness**, a *validation-feedback* optimizer, at a matched single-round budget\n(0.78 vs 0.62 on SWE-Bench Pro) — without ever touching ground-truth labels.\n\n## Install\n\nThe project uses [`uv`](https://docs.astral.sh/uv/).\n\n```bash\ngit clone https://github.com/wbopan/retro-harness.git\ncd retro-harness\nuv sync                       # core dependencies\nuv sync --extra swebench-pro  # + a dataset extra you want to run\n```\n\nRHO drives the [Codex CLI](https://github.com/openai/codex) as its base agent. Point it at a model\nbackend by copying a config from [`configs/`](configs/) (e.g. `configs/codex.chatgpt-default.toml`)\nand passing it via `--codex-config`.\n\n## Quickstart\n\n```bash\n# Run one retrospective optimization round on a dataset's trajectory split,\n# then grade the winning harness on the held-out split.\nuv run rho evolve \\\n  --dataset locomo:data/locomo10.json \\\n  --rounds 1 \\\n  --codex-config configs/codex.chatgpt-default.toml\n\n# Solve a single task with a given harness\nuv run rho solve --dataset \u003cds\u003e --task \u003cid\u003e --harness \u003cdir\u003e --run-dir runs/demo\n\n# Browse runs (prompts, completions, trajectories, harness diffs) in a web UI\nuv run rho ui\n```\n\nEvery run persists prompts, completions, trajectories, diagnoses, candidate harnesses, harness diffs,\nconfigs, scores, and held-out reports under `runs/\u003ctimestamp\u003e-\u003cdataset\u003e/`. See the full command\nreference in [`docs/cli-help.md`](docs/cli-help.md).\n\n## Repository layout\n\n```\nsrc/rho/\n├── loop.py            # the RHO evolution loop (select → rollout → propose)\n├── protocols.py       # typing.Protocol interfaces (Dataset, Harness, Task, TrajectoryStore, …)\n├── selection/         # coreset selection (DPP, coverage, difficulty)\n├── strategies/        # harness-proposal strategies + feedback-free baselines\n├── orchestrators/     # solve / group-rollout orchestration\n├── datasets/          # SWE-Bench Pro, Terminal-Bench 2, GAIA-2, LOCOMO loaders\n├── reasoningbank/     # ReasoningBank baseline\n├── meta_harness/      # Meta-Harness (validation-feedback) baseline\n└── stores/            # trajectory + harness stores\nconfigs/               # Codex CLI backend configs\nscripts/               # figure-building \u0026 analysis scripts\nwebui/                 # run-browser frontend\ntests/                 # hermetic + real-agent end-to-end tests\n.claude/workflows/\n└── retrospection.js   # RHO as a Claude Code dynamic workflow (see below)\ncodex/\n└── retrospection.py   # RHO over `codex exec` for Codex CLI users (stdlib-only)\n```\n\nImplementations are decoupled behind `typing.Protocol` so components (selectors, strategies, datasets,\nagents) can be swapped for ablations.\n\n## Retrospection: try RHO on your own Claude Code projects\n\n[`.claude/workflows/retrospection.js`](.claude/workflows/retrospection.js) packages the paper's method\nas a single [Claude Code dynamic workflow](https://code.claude.com/docs/en/workflows) that evolves the\nharness Claude Code natively exposes — your project's `CLAUDE.md`, its auto-memory directory, and\nhelper scripts — using only the session transcripts you have already accumulated. No labels, no\nvalidation set, no benchmark: the trajectories are your own past sessions.\n\nOne run is one retrospection cycle (≈40 agents, well under the 1,000-agent cap):\n\n1. **Bootstrap** — locate the project's transcripts (`~/.claude/projects/\u003cslug\u003e/*.jsonl`, including\n   worktree sessions) and snapshot the current harness *h₀*.\n2. **Digest** — parallel agents summarize past sessions into difficulty scores + task fingerprints\n   (the paper's LLM judge).\n3. **Coreset** — plain-JS greedy MAP on the paper's DPP kernel `L = diag(r)·S·diag(r)` (Jaccard\n   fingerprint kernel, same `θ` trade-off). Similar sessions are grouped so the diagnoser can recover\n   **self-consistency** across them; singletons fall back to validation-only diagnosis.\n4. **Diagnose** — **self-validation** + cross-session **self-consistency**, producing severity-weighted,\n   task-agnostic improvement directions.\n5. **Optimize** — *N* independent candidate harnesses, staged outside the working tree.\n6. **Probe \u0026 select** — replayable past tasks are re-attempted under each candidate in isolated\n   worktrees; **pairwise self-preference** scores the fresh trajectory against the original session.\n   The winner is applied only if its mean score is positive, with a full backup first.\n\nUsage — copy the file into a project's `.claude/workflows/` (or `~/.claude/workflows/` for all\nprojects), then in Claude Code:\n\n```\n/retrospection\n```\n\nor target another project / override knobs via args:\n\n```js\n{ projectDir: \"/path/to/project\",  // default: current project\n  model: \"opus\",                   // default: session model\n  k: 8,                            // coreset size (paper: 10)\n  n: 2,                            // candidate harnesses (paper: 3)\n  probes: 4,                       // self-preference probe tasks\n  maxSessions: 36, theta: 0.7,     // DPP difficulty/diversity trade-off\n  apply: true }                    // false = stage the winner, don't touch live files\n```\n\nEvery cycle persists its artifacts (digests, diagnoses, candidates, probe trajectories, scores,\n`report.md`, and a `backup/` of the pre-apply harness) under `~/.claude/rho-runs/\u003ctimestamp\u003e-\u003cproject\u003e/`.\nRe-running the command later is the next evolution round — the harness keeps learning from whatever\nreal sessions you accumulate in between.\n\n### Codex CLI variant\n\n[`codex/retrospection.py`](codex/retrospection.py) runs the same cycle for [Codex CLI](https://github.com/openai/codex)\nusers — a single stdlib-only Python file orchestrating parallel `codex exec` subprocesses. The mapping\ndiffers only in what the native harness is:\n\n- **Trajectories** come from Codex's rollout store (`~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`),\n  filtered to the target project via each rollout's `session_meta.cwd`.\n- **Harness** = `AGENTS.md` (kept lean — Codex caps combined project docs at 32 KB) +\n  `.agents/skills/*/SKILL.md` (Codex's persistent knowledge units, the analog of auto-memory) +\n  helper scripts.\n- **Structured stages** (digest / diagnose / score) use `codex exec --output-schema`; probes run in\n  git worktrees with the candidate harness materialized inside, so Codex loads it natively.\n- All orchestration calls run `--ephemeral` (they never enter the session store, so a later cycle\n  can't mine its own machinery) with the experimental memories feature disabled.\n\n```bash\npython3 codex/retrospection.py --dry-run            # list the sessions it would mine\npython3 codex/retrospection.py                      # one cycle on the current project\npython3 codex/retrospection.py --project ~/my/app \\\n    --model gpt-5.5 --n 2 --probes 4 --no-apply     # stage the winner without touching live files\n```\n\n## Citation\n\n```bibtex\n@article{pan2026rho,\n  title   = {Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference},\n  author  = {Pan, Wenbo and Liu, Shujie and Lin, Chin-Yew and Zeng, Jingying and Tang, Xianfeng and Zhou, Xiangyang and Lu, Yan and Jia, Xiaohua},\n  journal = {arXiv preprint arXiv:2606.05922},\n  year    = {2026}\n}\n```\n\n## License\n\nReleased under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwbopan%2Fretro-harness","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwbopan%2Fretro-harness","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwbopan%2Fretro-harness/lists"}