{"id":37809738,"url":"https://github.com/shinpr/rashomon","last_synced_at":"2026-05-18T15:00:52.305Z","repository":{"id":332703848,"uuid":"1134323949","full_name":"shinpr/rashomon","owner":"shinpr","description":"Compare, improve, and verify prompt changes with evidence — not vibes.","archived":false,"fork":false,"pushed_at":"2026-01-15T08:22:43.000Z","size":86,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-15T11:33:22.065Z","etag":null,"topics":["ai-tools","claude-code","developer-tools","llm","prompt-engineering","prompt-evaluation","prompt-optimization"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shinpr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-14T15:06:50.000Z","updated_at":"2026-01-15T08:22:45.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/shinpr/rashomon","commit_stats":null,"previous_names":["shinpr/rashomon"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/shinpr/rashomon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shinpr%2Frashomon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shinpr%2Frashomon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shinpr%2Frashomon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shinpr%2Frashomon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shinpr","download_url":"https://codeload.github.com/shinpr/rashomon/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shinpr%2Frashomon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479409,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","claude-code","developer-tools","llm","prompt-engineering","prompt-evaluation","prompt-optimization"],"created_at":"2026-01-16T15:33:44.667Z","updated_at":"2026-05-18T15:00:52.275Z","avatar_url":"https://github.com/shinpr.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/rashomon-banner.jpg\" width=\"600\" alt=\"Rashomon\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://claude.ai/code\"\u003e\u003cimg src=\"https://img.shields.io/badge/Claude%20Code-Plugin-purple\" alt=\"Claude Code\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue\" alt=\"License\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**Know whether your skills actually improve agent behavior — not just look different.**\n\n## Why rashomon?\n\n\u003e Inspired by the *Rashomon effect* — the idea that the same event can produce different outcomes depending on perspective.\n\u003e rashomon makes those differences explicit and comparable.\n\n- Built a skill but unsure if it actually changes agent behavior?\n- Iterating on skills and prompts by gut feel instead of evidence?\n- Want proof that your changes made things better, not just different?\n\n**rashomon** evaluates skills and prompts through blind comparison — running tasks with and without your changes in isolated environments, then comparing real outputs without knowing which version produced which.\n\n### Who Is This For?\n\nrashomon is designed for:\n- Skill authors who want evidence-based validation\n- Developers using Claude Code daily\n- Teams iterating on complex prompts (coding, analysis, writing)\n- Anyone who wants **evidence**, not vibes, when improving skills and prompts\n\nNot ideal if:\n- You want one-shot prompt rewriting without comparison\n\n## Quick Example\n\n### Skill Evaluation\n\n```\n/recipe-eval-skill create\n```\n\nCreates a skill through interactive dialog, then evaluates effectiveness:\n1. Collects domain knowledge, project-specific rules, and trigger phrases\n2. Generates optimized skill content (graded A/B/C)\n3. Runs a test task with and without the skill in isolated environments, using blind A/B comparison\n\n**What the evaluation report looks like:**\n\n```\nSkill Quality: Grade A\n- Project-specific rules clearly encoded, no critical issues\n\nTrigger Check: pass (discovered + invoked)\n\nExecution Effectiveness:\n- Winner: with-skill\n- Assessment: structural improvement\n- Key difference: 3-stage catch ordering and retry constraints\n  applied correctly (attributed to skill Rules 3 and 6)\n\nRecommendation: ship\n```\n\n```\n/recipe-eval-skill api-error-handling skill's scope needs adjustment\n```\n\nUpdates an existing skill, then evaluates old vs new version side by side.\n\nSee a real-world example: [I Built a Skill Reviewer. Then I Ran It on Itself.](https://dev.to/shinpr/i-built-a-skill-reviewer-then-i-ran-it-on-itself-4m4j)\n\n### Prompt Evaluation\n\n```\n/recipe-eval-prompt Write a function to sort an array\n```\n\nAnalyzes prompt issues, generates an improved version, runs both in isolated environments, and shows what actually changed.\n\n\u003cdetails\u003e\n\u003csummary\u003ePrompt Evaluation Details\u003c/summary\u003e\n\n#### What You Get\n\n**1. Detected Issues**\n```\n- BP-002 (Vague Instructions): Sort order, language, and error handling not specified\n- BP-003 (Missing Output Format): No expected output structure defined\n```\n\n**2. Improved Prompt**\n```\nWrite a TypeScript function that sorts a number array in ascending order.\n- Return empty array for empty input\n- Include JSDoc comments\n- Output: function code with example usage\n```\n\n**3. Comparison Report**\n\n| Aspect | Original | Improved |\n|--------|----------|----------|\n| Type definitions | None | Included |\n| Edge case handling | None | Included |\n| Documentation | None | JSDoc added |\n\n**Result: Structural Improvement** - The optimization made a meaningful difference.\n\n#### Example: When rashomon finds no real improvement\n\n```\n/recipe-eval-prompt Summarize this article in 3 bullet points\n```\n\n**Result: Variance** - Prompt was already well-scoped; differences were stylistic only.\n\n\u003c/details\u003e\n\n## Installation\n\n\u003e Requires [Claude Code](https://claude.ai/code) (this is a Claude Code plugin)\n\n```bash\n# 1. Start Claude Code\nclaude\n\n# 2. Install the marketplace\n/plugin marketplace add shinpr/rashomon\n\n# 3. Install plugin\n/plugin install rashomon@rashomon\n\n# 4. Restart session (required)\n# Exit and restart Claude Code\n```\n\n## Usage\n\n### Skill Evaluation\n\n```\n/recipe-eval-skill create\n```\n\nCreate a new skill and evaluate its effectiveness.\n\n```\n/recipe-eval-skill my-skill-name what to change\n```\n\nUpdate an existing skill and compare old vs new.\n\n### Prompt Evaluation\n\n```\n/recipe-eval-prompt Your prompt here\n```\n\nFrom a file:\n```\n/recipe-eval-prompt Generate code following this skill: ./prompts/my-skill.md\n```\n\nFor complex tasks that need more time, just mention it in natural language:\n```\n/recipe-eval-prompt Refactor the entire authentication module. This might take a while.\n```\n\n## How It Works\n\n```\nSkill Evaluation (/recipe-eval-skill)\n    ├── skill-creator (generates/modifies skills)\n    ├── skill-reviewer (grades quality A/B/C)\n    ├── eval-executor ×2 (isolated worktrees)\n    └── skill-eval-reporter (blind A/B comparison)\n\nPrompt Evaluation (/recipe-eval-prompt)\n    ├── prompt-analyzer (analyzes and optimizes)\n    ├── prompt-executor ×2 (isolated worktrees)\n    └── report-generator (compares results)\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eTechnical Details\u003c/summary\u003e\n\n### Isolated Execution\n\nrashomon uses **git worktrees** to run both versions in completely separate environments. A worktree is a Git feature that creates independent working directories from the same repository—this ensures the two executions don't interfere with each other.\n\n\u003c/details\u003e\n\n## Improvement Classification\n\nNot all differences are improvements. rashomon classifies results into four categories:\n\n| Classification | Meaning | Recommendation |\n|---------------|---------|----------------|\n| **Structural** | Real improvement in accuracy, completeness, or quality | Use the new version |\n| **Context Addition** | One version had more project-specific knowledge | Useful if the context is accurate |\n| **Expressive** | Different wording, same substance | Either version is fine |\n| **Variance** | Just normal LLM randomness | Original was already good |\n\nClassification is based on:\n- Whether detected issues were resolved\n- Output completeness and constraint adherence\n- Agreement between blind quality assessment and observable output differences\n\n\u003cdetails\u003e\n\u003csummary\u003eQuality Patterns (BP-001 through BP-008)\u003c/summary\u003e\n\nBoth skill review and prompt analysis check against 8 common patterns:\n\n| Priority | Issues |\n|----------|--------|\n| **Critical** | Negative instructions (\"don't do X\"), vague instructions, missing output format |\n| **High Impact** | Unstructured prompts, missing context, complex tasks without breakdown |\n| **Enhancement** | Biased examples, no permission for uncertainty |\n\n### P1: Critical (Must Fix)\n\n| ID | Pattern | Problem | Fix |\n|----|---------|---------|-----|\n| BP-001 | Negative Instructions | \"Don't do X\" often backfires—LLMs focus on what's mentioned | Reframe positively: \"Don't include opinions\" → \"Include only factual information\" |\n| BP-002 | Vague Instructions | Missing specifics cause high output variance | Add explicit constraints: format, length, scope, tone |\n| BP-003 | Missing Output Format | No format spec leads to inconsistent outputs | Define expected structure: JSON schema, section headers, etc. |\n\n### P2: High Impact (Should Fix)\n\n| ID | Pattern | Problem | Fix |\n|----|---------|---------|-----|\n| BP-004 | Unstructured Prompt | Wall of text obscures priorities | Apply 4-block pattern: Context / Task / Constraints / Output Format |\n| BP-005 | Missing Context | No background leads to wrong assumptions | Add purpose, audience, relevant constraints |\n| BP-006 | Complex Task | Undivided complex tasks have higher error rates | Break into steps with quality checkpoints |\n\n### P3: Enhancement (Could Fix)\n\n| ID | Pattern | Problem | Fix |\n|----|---------|---------|-----|\n| BP-007 | Biased Examples | Homogeneous examples cause overfitting | Diversify: include edge cases, different formats |\n| BP-008 | No Uncertainty Permission | No \"I don't know\" option causes hallucination | Add: \"If unsure, say so\" |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAbout Knowledge Base\u003c/summary\u003e\n\n## Knowledge Base\n\nrashomon learns from your project over time.\n\n**Location**: `.claude/.rashomon/prompt-knowledge.yaml`\n\n**How it works**:\n- Automatically enabled when the file exists\n- Stores project-specific patterns (not generic best practices)\n- Referenced during analysis, updated after comparisons\n- Max 20 entries, lowest-confidence ones removed first\n\n**Key principle**: Old knowledge isn't automatically removed. Patterns that have worked for a long time are often the most valuable.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTroubleshooting\u003c/summary\u003e\n\n## Troubleshooting\n\n### Leftover worktrees\n\nIf rashomon exits unexpectedly, temporary directories might remain:\n\n```bash\n# Worktrees are stored in system temp directory\n# Clean up manually if needed:\nrm -rf ${TMPDIR:-/tmp}/worktree-rashomon-*\n```\n\n### Timeout issues\n\nFor complex prompts that need more time, mention it when invoking:\n\n```\n/recipe-eval-prompt Complex task here. This might take longer than usual.\n```\n\n### \"Not a git repository\" error\n\nrashomon requires a git repository. Initialize one with:\n\n```bash\ngit init\n```\n\n\u003c/details\u003e\n\n## Requirements\n\n- Git 2.5+\n- Python 3.9+\n- Claude Code\n- Must run inside a git repository\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshinpr%2Frashomon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshinpr%2Frashomon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshinpr%2Frashomon/lists"}