An open API service indexing awesome lists of open source software.

https://github.com/raphaelchristi/harness-evolver

Automated harness evolution for AI agents. A Claude Code plugin that iteratively optimizes system prompts, routing, retrieval, and orchestration code using full-trace counterfactual diagnosis. Based on Meta-Harness (Lee et al., 2026).
https://github.com/raphaelchristi/harness-evolver

agent-evolution claude-code-plugin codex-skills harness-engineering meta-harness

Last synced: 11 days ago
JSON representation

Automated harness evolution for AI agents. A Claude Code plugin that iteratively optimizes system prompts, routing, retrieval, and orchestration code using full-trace counterfactual diagnosis. Based on Meta-Harness (Lee et al., 2026).

Awesome Lists containing this project

README

          


Harness Evolver

# Harness Evolver


npm
License: MIT
Paper
Built by Raphael Valdetaro

**LangSmith-native autonomous agent optimization.** Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.

Inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026). The scaffolding around your LLM produces a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates the search for better scaffolding.

---

## Install

### Claude Code Plugin (recommended)

```
/plugin marketplace add raphaelchristi/harness-evolver-marketplace
/plugin install harness-evolver
```

Updates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.

### npx (first-time setup or non-Claude Code runtimes)

```bash
npx harness-evolver@latest
```

Interactive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.

> **Both install paths work together.** Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.

---

## Quick Start

```bash
cd my-llm-project
export LANGSMITH_API_KEY="lsv2_pt_..."
claude

/evolver:setup # explores project, configures LangSmith
/evolver:evolve # runs the optimization loop
/evolver:status # check progress
/evolver:deploy # tag, push, finalize
```

---

## How It Works

LangSmith-Native
No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.

Real Code Evolution
Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.

Self-Organizing Proposers
Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by Dochkina (2026).

Agent-Based Evaluation
The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed.

Production Traces
Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.

Active Critic
Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.

ULTRAPLAN Architect
Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).

Evolution Memory
Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.

Smart Gating
Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.

Background Mode
Run all iterations in background while you continue working. Get notified on completion or significant improvements.

---

## Commands

| Command | What it does |
|---|---|
| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
| `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
| `/evolver:status` | Show progress, scores, history |
| `/evolver:deploy` | Tag, push, clean up temporary files |

---

## Agents

| Agent | Role | Color |
|---|---|---|
| **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
| **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
| **TestGen** | Generates test inputs + adversarial injection mode | Cyan |

---

## Evolution Loop

```
/evolver:evolve
|
+- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
+- 1. Read state (.evolver.json + LangSmith experiments)
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
+- 1.8 Analyze per-task failures
+- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
+- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
+- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
+- 3. Run target for each candidate (code-based evaluators)
+- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
+- 4. Compare experiments -> select winner + per-task champion
+- 5. Merge winning worktree into main branch
+- 5.5 Regression tracking (auto-add guard examples to dataset)
+- 6. Report results
+- 6.2 Consolidate evolution memory (orient/gather/consolidate/prune)
+- 6.5 Auto-trigger Active Critic (detect + fix evaluator gaming)
+- 7. Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
+- 8. Three-gate check (score plateau, cost budget, convergence)
```

---

## Architecture

```
Plugin hook (SessionStart)
└→ Creates venv, installs langsmith + langsmith-cli, exports env vars

Skills (markdown)
├── /evolver:setup → explores project, runs setup.py
├── /evolver:evolve → orchestrates the evolution loop
├── /evolver:status → reads .evolver.json + LangSmith
└── /evolver:deploy → tags and pushes

Agents (markdown)
├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
├── Evaluator → LLM-as-judge via langsmith-cli
├── Critic → detects gaming + implements stricter evaluators
├── Architect → ULTRAPLAN deep analysis (opus model)
├── Consolidator → cross-iteration memory (autoDream-inspired)
└── TestGen → generates test inputs + adversarial injection

Tools (Python + langsmith SDK)
├── setup.py → creates datasets, configures evaluators
├── run_eval.py → runs target against dataset
├── read_results.py → compares experiments
├── trace_insights.py → clusters errors from traces
├── seed_from_traces.py → imports production traces
├── validate_state.py → validates config vs LangSmith state
├── iteration_gate.py → three-gate iteration triggers
├── regression_tracker.py → tracks regressions, adds guard examples
├── consolidate.py → cross-iteration memory consolidation
├── synthesize_strategy.py→ generates strategy document + investigation lenses
├── add_evaluator.py → programmatically adds evaluators
└── adversarial_inject.py → detects memorization, injects adversarial tests
```

---

## Requirements

- **LangSmith account** + `LANGSMITH_API_KEY`
- **Python 3.10+**
- **Git** (for worktree-based isolation)
- **Claude Code** (or Cursor/Codex/Windsurf)

Dependencies (`langsmith`, `langsmith-cli`) are installed automatically by the plugin hook or the npx installer.

---

## Framework Support

LangSmith traces **any** AI framework. The evolver works with all of them:

| Framework | LangSmith Tracing |
|---|---|
| LangChain / LangGraph | Auto (env vars only) |
| OpenAI SDK | `wrap_openai()` (2 lines) |
| Anthropic SDK | `wrap_anthropic()` (2 lines) |
| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
| Any Python code | `@traceable` decorator |

---

## References

- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
- [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
- [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain

---

## License

MIT