{"id":47725992,"url":"https://github.com/raphaelchristi/harness-evolver","last_synced_at":"2026-04-07T01:00:34.930Z","repository":{"id":348198533,"uuid":"1196848284","full_name":"raphaelchristi/harness-evolver","owner":"raphaelchristi","description":"Automated harness evolution for AI agents. A Claude Code plugin that iteratively optimizes system prompts, routing, retrieval, and orchestration code using full-trace counterfactual diagnosis. Based on Meta-Harness (Lee et al., 2026).","archived":false,"fork":false,"pushed_at":"2026-04-04T01:32:57.000Z","size":1809,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-04T23:18:41.931Z","etag":null,"topics":["agent-evolution","claude-code-plugin","codex-skills","harness-engineering","meta-harness"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raphaelchristi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-31T05:16:24.000Z","updated_at":"2026-04-04T01:33:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"6c1b0856-330f-4fc4-84c0-11d4464630d7","html_url":"https://github.com/raphaelchristi/harness-evolver","commit_stats":null,"previous_names":["raphaelchristi/harness-evolver"],"tags_count":74,"template":false,"template_full_name":null,"purl":"pkg:github/raphaelchristi/harness-evolver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelchristi%2Fharness-evolver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelchristi%2Fharness-evolver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelchristi%2Fharness-evolver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelchristi%2Fharness-evolver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raphaelchristi","download_url":"https://codeload.github.com/raphaelchristi/harness-evolver/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelchristi%2Fharness-evolver/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31454200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T21:22:52.476Z","status":"ssl_error","status_checked_at":"2026-04-05T21:22:51.943Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evolution","claude-code-plugin","codex-skills","harness-engineering","meta-harness"],"created_at":"2026-04-02T20:28:18.919Z","updated_at":"2026-04-06T00:01:00.950Z","avatar_url":"https://github.com/raphaelchristi.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/banner.jpg\" alt=\"Harness Evolver\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n# Harness Evolver\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.npmjs.com/package/harness-evolver\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/harness-evolver?style=for-the-badge\u0026color=blueviolet\" alt=\"npm\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/raphaelchristi/harness-evolver/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-green?style=for-the-badge\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2603.28052\"\u003e\u003cimg src=\"https://img.shields.io/badge/Paper-Meta--Harness-FFD700?style=for-the-badge\" alt=\"Paper\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/raphaelchristi/harness-evolver\"\u003e\u003cimg src=\"https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge\" alt=\"Built by Raphael Valdetaro\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**LangSmith-native autonomous agent optimization.** Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.\n\nInspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026). The scaffolding around your LLM produces a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates the search for better scaffolding.\n\n---\n\n## Install\n\n### Claude Code Plugin (recommended)\n\n```\n/plugin marketplace add raphaelchristi/harness-evolver-marketplace\n/plugin install harness-evolver\n```\n\nUpdates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.\n\n### npx (first-time setup or non-Claude Code runtimes)\n\n```bash\nnpx harness-evolver@latest\n```\n\nInteractive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.\n\n\u003e **Both install paths work together.** Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.\n\n---\n\n## Quick Start\n\n```bash\ncd my-llm-project\nexport LANGSMITH_API_KEY=\"lsv2_pt_...\"\nclaude\n\n/evolver:setup      # explores project, configures LangSmith\n/evolver:evolve     # runs the optimization loop\n/evolver:status     # check progress\n/evolver:deploy     # tag, push, finalize\n```\n\n---\n\n## How It Works\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eLangSmith-Native\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eNo custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eReal Code Evolution\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eProposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eSelf-Organizing Proposers\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eEach iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by \u003ca href=\"https://arxiv.org/abs/2603.28990\"\u003eDochkina (2026)\u003c/a\u003e.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eAgent-Based Evaluation\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eThe evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eProduction Traces\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eAuto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eActive Critic\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eAuto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eULTRAPLAN Architect\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eAuto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eEvolution Memory\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eCross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eSmart Gating\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eThree-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cb\u003eBackground Mode\u003c/b\u003e\u003c/td\u003e\n\u003ctd\u003eRun all iterations in background while you continue working. Get notified on completion or significant improvements.\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## Commands\n\n| Command | What it does |\n|---|---|\n| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |\n| `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |\n| `/evolver:status` | Show progress, scores, history |\n| `/evolver:deploy` | Tag, push, clean up temporary files |\n\n---\n\n## Agents\n\n| Agent | Role | Color |\n|---|---|---|\n| **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |\n| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |\n| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |\n| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |\n| **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |\n| **TestGen** | Generates test inputs + adversarial injection mode | Cyan |\n\n---\n\n## Evolution Loop\n\n```\n/evolver:evolve\n  |\n  +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)\n  +- 1.   Read state (.evolver.json + LangSmith experiments)\n  +- 1.5  Gather trace insights (cluster errors, tokens, latency)\n  +- 1.8  Analyze per-task failures\n  +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)\n  +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)\n  +- 2.   Spawn N self-organizing proposers in parallel (each in a git worktree)\n  +- 3.   Run target for each candidate (code-based evaluators)\n  +- 3.5  Spawn evaluator agent (LLM-as-judge via langsmith-cli)\n  +- 4.   Compare experiments -\u003e select winner + per-task champion\n  +- 5.   Merge winning worktree into main branch\n  +- 5.5  Regression tracking (auto-add guard examples to dataset)\n  +- 6.   Report results\n  +- 6.2  Consolidate evolution memory (orient/gather/consolidate/prune)\n  +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)\n  +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)\n  +- 8.   Three-gate check (score plateau, cost budget, convergence)\n```\n\n---\n\n## Architecture\n\n```\nPlugin hook (SessionStart)\n  └→ Creates venv, installs langsmith + langsmith-cli, exports env vars\n\nSkills (markdown)\n  ├── /evolver:setup    → explores project, runs setup.py\n  ├── /evolver:evolve   → orchestrates the evolution loop\n  ├── /evolver:status   → reads .evolver.json + LangSmith\n  └── /evolver:deploy   → tags and pushes\n\nAgents (markdown)\n  ├── Proposer (xN)     → self-organizing, lens-driven, isolated git worktrees\n  ├── Evaluator          → LLM-as-judge via langsmith-cli\n  ├── Critic             → detects gaming + implements stricter evaluators\n  ├── Architect          → ULTRAPLAN deep analysis (opus model)\n  ├── Consolidator       → cross-iteration memory (autoDream-inspired)\n  └── TestGen            → generates test inputs + adversarial injection\n\nTools (Python + langsmith SDK)\n  ├── setup.py              → creates datasets, configures evaluators\n  ├── run_eval.py           → runs target against dataset\n  ├── read_results.py       → compares experiments\n  ├── trace_insights.py     → clusters errors from traces\n  ├── seed_from_traces.py   → imports production traces\n  ├── validate_state.py     → validates config vs LangSmith state\n  ├── iteration_gate.py     → three-gate iteration triggers\n  ├── regression_tracker.py → tracks regressions, adds guard examples\n  ├── consolidate.py        → cross-iteration memory consolidation\n  ├── synthesize_strategy.py→ generates strategy document + investigation lenses\n  ├── add_evaluator.py      → programmatically adds evaluators\n  └── adversarial_inject.py → detects memorization, injects adversarial tests\n```\n\n---\n\n## Requirements\n\n- **LangSmith account** + `LANGSMITH_API_KEY`\n- **Python 3.10+**\n- **Git** (for worktree-based isolation)\n- **Claude Code** (or Cursor/Codex/Windsurf)\n\nDependencies (`langsmith`, `langsmith-cli`) are installed automatically by the plugin hook or the npx installer.\n\n---\n\n## Framework Support\n\nLangSmith traces **any** AI framework. The evolver works with all of them:\n\n| Framework | LangSmith Tracing |\n|---|---|\n| LangChain / LangGraph | Auto (env vars only) |\n| OpenAI SDK | `wrap_openai()` (2 lines) |\n| Anthropic SDK | `wrap_anthropic()` (2 lines) |\n| CrewAI / AutoGen | OpenTelemetry (~10 lines) |\n| Any Python code | `@traceable` decorator |\n\n---\n\n## References\n\n- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026\n- [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026\n- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI\n- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind\n- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain\n- [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain\n\n---\n\n## License\n\nMIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelchristi%2Fharness-evolver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraphaelchristi%2Fharness-evolver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelchristi%2Fharness-evolver/lists"}