{"id":49058644,"url":"https://github.com/2389-research/simmer","last_synced_at":"2026-06-28T19:32:19.286Z","repository":{"id":356337205,"uuid":"1199970632","full_name":"2389-research/simmer","owner":"2389-research","description":"Iterative artifact refinement with investigation-first judge board - constructs problem-specific judges that read the code, understand the problem, and propose evidence-based improvements","archived":false,"fork":false,"pushed_at":"2026-05-07T15:24:11.000Z","size":200,"stargazers_count":10,"open_issues_count":1,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-07T17:29:59.116Z","etag":null,"topics":["ai-agents","ai-coding","anthropic","artifact","automation","claude","claude-code","claude-code-skills","criteria","evaluator","generator","improvement","iteration","judge","llm","optimization","pipeline","refinement","scoring","workspace"],"latest_commit_sha":null,"homepage":"https://2389.ai/posts/simmer-skill/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/2389-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-02T22:41:10.000Z","updated_at":"2026-05-07T15:41:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/2389-research/simmer","commit_stats":null,"previous_names":["2389-research/simmer"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/2389-research/simmer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2389-research%2Fsimmer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2389-research%2Fsimmer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2389-research%2Fsimmer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2389-research%2Fsimmer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/2389-research","download_url":"https://codeload.github.com/2389-research/simmer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2389-research%2Fsimmer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34901959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","ai-coding","anthropic","artifact","automation","claude","claude-code","claude-code-skills","criteria","evaluator","generator","improvement","iteration","judge","llm","optimization","pipeline","refinement","scoring","workspace"],"created_at":"2026-04-20T01:00:25.348Z","updated_at":"2026-06-28T19:32:19.280Z","avatar_url":"https://github.com/2389-research.png","language":null,"funding_links":[],"categories":["Plugins"],"sub_categories":["All Plugins"],"readme":"# Simmer\n\nYou wrote a prompt. It works. But is it *good*? Simmer runs your artifact through multiple rounds of criteria-driven refinement — each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.\n\n[Read the story behind Simmer →](https://2389.ai/posts/simmer-skill/)\n\nIterative artifact refinement — take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.\n\n## Installation\n\n```bash\n/plugin marketplace add 2389-research/claude-plugins\n/plugin install simmer@2389-research\n```\n\n## What This Plugin Provides\n\nOne skill (`simmer`) with four subskills that run the refinement loop:\n\n- **Setup** — identify the artifact (file or workspace), elicit 2-3 quality criteria, determine evaluation method\n- **Generator** — produce an improved version based on the judge's ASI (Actionable Side Information — the single highest-leverage fix)\n- **Judge** — score the candidate 1-10 per criterion, produce the ASI\n- **Reflect** — record the trajectory, track the best candidate across iterations\n\n## Trigger Phrases\n\nSay any of these to start a simmer loop:\n\n\u003e \"simmer this\", \"refine this\", \"hone this\", \"iterate on this\", \"make this better\", \"improve this over a few rounds\", \"polish this\", \"tighten this up\", \"optimize this pipeline\", \"find the best model for this task\", \"tune this configuration\"\n\nAny request to **iteratively improve** an artifact triggers simmer.\n\n## When to Use Simmer vs. Cookoff vs. Omakase-Off\n\n| Situation | Use | Why |\n|-----------|-----|-----|\n| You have an artifact and want it better | **simmer** | Iterative refinement against criteria |\n| You want parallel code implementations competing | **cookoff** | Parallel generation + fixed criteria + pick winner |\n| You don't know what you want yet | **omakase-off** | Parallel designs + react + discover preference |\n\n**Rule of thumb:** Simmer = serial iteration. Cookoff/omakase-off = parallel exploration.\n\n## Quick Example: Single File\n\n```\nUser: \"Simmer this\" [pastes a pitch email]\n\nClaude: I'm using the simmer skill to set up iterative refinement.\n\nWhat does \"better\" mean for this? I'd suggest:\n1. Value prop clarity\n2. Tone match\n3. Call to action strength\n\nUser: \"Those work\"\n\n[Iteration 0: Judge scores seed — 4.0/10]\n[Iteration 1: Generator fixes value prop → 5.3/10]\n[Iteration 2: Generator fixes CTA → 6.3/10]\n[Iteration 3: Generator fixes tone → 7.3/10]\n\nTrajectory:\n| Iter | Value Prop | Tone | CTA | Composite | Key Change |\n|------|-----------|------|-----|-----------|------------|\n| 0    | 4         | 5    | 3   | 4.0       | seed       |\n| 1    | 7         | 5    | 4   | 5.3       | specific problem statement |\n| 2    | 7         | 6    | 6   | 6.3       | low-friction CTA |\n| 3    | 7         | 7    | 8   | 7.3       | peer-sharing tone |\n\nBest candidate: iteration 3 (7.3/10)\n3 iterations complete. Run 3 more?\n```\n\n## Quick Example: Workspace with Evaluator\n\n```\nUser: \"Simmer this pipeline — find the best setup for this task\"\n\nClaude inspects the workspace, finds evaluate.sh and config.json,\nand proposes criteria + evaluation method:\n\n  Evaluator: ./evaluate.sh\n  Criteria: accuracy, cost efficiency, latency\n  Search space: models, prompt text, pipeline topology\n\nUser: \"Looks good, coverage is the priority. 5 iterations.\"\n\n[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]\n[Iteration 1: Generator swaps to cheaper model → 5.3/10]\n[Iteration 2: Generator splits into 2-step chain → 7.0/10]\n[Iteration 3: Generator adds few-shot examples → 7.7/10]\n...\n\nBest candidate: iteration 4 (8.1/10)\n```\n\n## Works On Anything\n\n| Artifact type | Suggested criteria |\n|---|---|\n| Document / spec | clarity, completeness, actionability |\n| Creative writing | narrative tension, specificity, voice consistency |\n| Email / comms | value prop clarity, tone match, call to action strength |\n| Prompt / instructions | instruction precision, output predictability, edge case coverage |\n| API design | contract completeness, developer ergonomics, consistency |\n| Pipeline / workflow | coverage, efficiency, noise |\n| Configuration / infra | correctness, resource efficiency, maintainability |\n\n## Evaluation Modes\n\n| Mode | When to use |\n|------|------------|\n| **Judge-only** (default) | Text artifacts — judge scores against criteria |\n| **Runnable** | Code/pipelines — judge interprets script output |\n| **Hybrid** | Both — run script AND judge results against criteria |\n\nNo format contract on evaluator output. The judge reads whatever your script produces — test results, metrics, error logs, anything.\n\n## Judge Board\n\nSimmer auto-selects between a single judge and a multi-judge board based on complexity:\n\n- **Simple** (short email, tweet, ≤2 criteria) → single judge, fast\n- **Complex** (3 criteria, long artifact, code, pipelines) → judge board with deliberation\n\nThe board constructs three judges tailored to your specific problem — not from a fixed menu, but by reading your artifact, criteria, and constraints and designing judges with diverse perspectives. An extraction prompt gets different judges than a DND adventure hook.\n\nJudges investigate before scoring — they read the evaluator script, ground truth, prior candidates, and config files to understand the problem deeply. A judge who reads the evaluator discovers scoring mechanics on iteration 0 instead of learning them through 3 iterations of trial and error.\n\nIf a single-judge run hits a plateau (3 iterations without improvement), simmer offers to upgrade to the board mid-run with 2 extra iterations.\n\n\n## Defaults and Safety\n\n**Default iteration count:** 3 rounds per batch. After each batch, simmer asks whether to continue. You can request a specific count (\"simmer this for 10 rounds\") or stop early at any prompt.\n\n**Regression safety:** The reflect subskill tracks the best candidate seen so far. If a new iteration scores lower than the current best, the best-so-far is preserved — the loop never loses progress. At the end, `result.md` always contains the highest-scoring candidate, not just the latest one.\n\n## Advanced Features\n\n| Feature | When you need it |\n|---------|-----------------|\n| **Workspace targets** | Refining a multi-file directory — iterations tracked as git commits so you can diff any two rounds |\n| **Runnable evaluators** | Your artifact has a test script — point simmer at it (`python evaluate.py`) and the judge interprets output |\n| **Background constraints** | The generator needs to know what's available (models, budget, latency targets) to make realistic choices |\n| **Output contracts** | Valid output has a defined shape (e.g., JSON schema) — violations score 1/10, forcing format fixes first |\n| **Validation commands** | A cheap pre-check (`./validate.sh`) catches broken pipelines in seconds before the full evaluator runs |\n| **Search space tracking** | Explicit bounds on what to explore — reflect tracks tried vs. untried regions so the judge steers toward gaps |\n\nSee the [v2 design spec](./docs/specs/2026-03-16-simmer-v2-design.md) for full details on each feature.\n\n## Output Directory Structure\n\n**Single-file mode** (default output dir: `docs/simmer`):\n```\ndocs/simmer/\n  iteration-0-candidate.md     # Seed (original artifact)\n  iteration-1-candidate.md     # Each improved candidate\n  iteration-2-candidate.md\n  iteration-3-candidate.md\n  trajectory.md                # Running score table\n  result.md                    # Final best candidate (highest score, not necessarily latest)\n```\n\n**Workspace mode:**\n```\n./pipeline/                    # Target directory (modified in place)\n  [project files]              # Tracked via git commits per iteration\n\ndocs/simmer/                   # Tracking files (separate from workspace)\n  trajectory.md                # Running score table\n```\n\nWorkspace iterations are tracked as git commits rather than separate files.\n\n## How It Works\n\n- **Focused improvement** — each iteration targets one direction (the ASI), not everything at once. Compound gains over scattered edits.\n- **Context isolation** — generator doesn't see scores, judge doesn't see previous scores. Each role gets only the context it needs to avoid bias.\n- **The generator is the search strategy** — in workspace mode, the generator decides what to change (swap a model, restructure a pipeline, tune a prompt). The ASI guides direction, the generator executes.\n\nSee the [design spec](./docs/specs/2026-03-16-simmer-v2-design.md) for the full architecture.\n\n## Related Skills\n\nPart of the test-kitchen family, but independently installable:\n- `test-kitchen:omakase-off` — parallel design exploration\n- `test-kitchen:cookoff` — parallel implementation competition\n- `simmer` — iterative refinement\n\n## Documentation\n\n- [CLAUDE.md](./CLAUDE.md) — full plugin instructions\n- [Simmer skill](./skills/simmer/SKILL.md) — orchestrator\n- [v2 Design](./docs/specs/2026-03-16-simmer-v2-design.md) — design spec\n- [Integration tests](./tests/integration/simmer-scenario.md) — test scenarios\n\n---\n\nIf Simmer helped you ship something better than your first draft, a ⭐ helps us know it's landing.\n\nBuilt by [2389](https://2389.ai) · Part of the [Claude Code plugin marketplace](https://github.com/2389-research/claude-plugins)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2389-research%2Fsimmer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F2389-research%2Fsimmer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2389-research%2Fsimmer/lists"}