https://github.com/shinpr/rashomon
Compare, improve, and verify prompt changes with evidence — not vibes.
https://github.com/shinpr/rashomon
ai-tools claude-code developer-tools llm prompt-engineering prompt-evaluation prompt-optimization
Last synced: about 1 month ago
JSON representation
Compare, improve, and verify prompt changes with evidence — not vibes.
- Host: GitHub
- URL: https://github.com/shinpr/rashomon
- Owner: shinpr
- Created: 2026-01-14T15:06:50.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-01-15T08:22:43.000Z (about 1 month ago)
- Last Synced: 2026-01-15T11:33:22.065Z (about 1 month ago)
- Topics: ai-tools, claude-code, developer-tools, llm, prompt-engineering, prompt-evaluation, prompt-optimization
- Language: Shell
- Homepage:
- Size: 84 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**See what actually changes when you improve your prompts — not just different wording.**
## Why rashomon?
> Inspired by the *Rashomon effect* — the idea that the same event can produce different outcomes depending on perspective.
> rashomon makes those differences explicit and comparable.
- Spending too much time on trial-and-error with prompts?
- Read best practices but not sure how they apply to your case?
- Want proof that your changes actually made things better?
**rashomon** analyzes, improves, and compares prompts—so you can see what *actually* changed, and whether it matters.
### Who Is This For?
rashomon is designed for:
- Developers using Claude Code daily
- Teams iterating on complex prompts (coding, analysis, writing)
- Anyone who wants **evidence**, not vibes, when improving prompts
Not ideal if:
- You don't use git
- You want one-shot prompt rewriting without comparison
## Quick Example
```
/rashomon Write a function to sort an array
```
### What You Get
**1. Detected Issues**
```
- BP-002 (Vague Instructions): Sort order, language, and error handling not specified
- BP-003 (Missing Output Format): No expected output structure defined
```
**2. Improved Prompt**
```
Write a TypeScript function that sorts a number array in ascending order.
- Return empty array for empty input
- Include JSDoc comments
- Output: function code with example usage
```
**3. Comparison Report**
| Aspect | Original | Improved |
|--------|----------|----------|
| Type definitions | None | Included |
| Edge case handling | None | Included |
| Documentation | None | JSDoc added |
**Result: Structural Improvement** - The optimization made a meaningful difference.
### Example: When rashomon finds no real improvement
```
/rashomon Summarize this article in 3 bullet points
```
**Result: Variance** - Prompt was already well-scoped; differences were stylistic only.
## Installation
> Requires [Claude Code](https://claude.ai/code) (this is a Claude Code plugin)
```bash
# 1. Start Claude Code
claude
# 2. Install the marketplace
/plugin marketplace add shinpr/rashomon
# 3. Install plugin
/plugin install rashomon@rashomon
# 4. Restart session (required)
# Exit and restart Claude Code
```
## Usage
```
/rashomon Your prompt here
```
From a file:
```
/rashomon Generate code following this skill: ./prompts/my-skill.md
```
For complex tasks that need more time, just mention it in natural language:
```
/rashomon Refactor the entire authentication module. This might take a while.
```
## How It Works
```
Your Prompt
↓
1. Analyze → Detect common issues automatically
↓
2. Improve → Generate optimized version
↓
3. Run Both → Execute in isolated environments
↓
4. Compare → Show what actually changed
↓
Comparison Report
```
Technical Details
### Isolated Execution
rashomon uses **git worktrees** to run both prompts in completely separate environments. A worktree is a Git feature that creates independent working directories from the same repository—this ensures the two executions don't interfere with each other.
### Parallel Execution
Both prompts run simultaneously via Claude Code subagents, so comparison time is roughly the same as a single execution, not double.
### Architecture
```
Main Orchestrator
├── prompt-analyzer (analyzes and optimizes)
├── prompt-executor ×2 (runs in parallel)
└── report-generator (compares results)
```
## What It Detects
rashomon checks for 8 common prompt issues:
| Priority | Issues |
|----------|--------|
| **Critical** | Negative instructions ("don't do X"), vague instructions, missing output format |
| **High Impact** | Unstructured prompts, missing context, complex tasks without breakdown |
| **Enhancement** | Biased examples, no permission for uncertainty |
Pattern Details (BP-001 through BP-008)
### P1: Critical (Must Fix)
| ID | Pattern | Problem | Fix |
|----|---------|---------|-----|
| BP-001 | Negative Instructions | "Don't do X" often backfires—LLMs focus on what's mentioned | Reframe positively: "Don't include opinions" → "Include only factual information" |
| BP-002 | Vague Instructions | Missing specifics cause high output variance | Add explicit constraints: format, length, scope, tone |
| BP-003 | Missing Output Format | No format spec leads to inconsistent outputs | Define expected structure: JSON schema, section headers, etc. |
### P2: High Impact (Should Fix)
| ID | Pattern | Problem | Fix |
|----|---------|---------|-----|
| BP-004 | Unstructured Prompt | Wall of text obscures priorities | Apply 4-block pattern: Context / Task / Constraints / Output Format |
| BP-005 | Missing Context | No background leads to wrong assumptions | Add purpose, audience, relevant constraints |
| BP-006 | Complex Task | Undivided complex tasks have higher error rates | Break into steps with quality checkpoints |
### P3: Enhancement (Could Fix)
| ID | Pattern | Problem | Fix |
|----|---------|---------|-----|
| BP-007 | Biased Examples | Homogeneous examples cause overfitting | Diversify: include edge cases, different formats |
| BP-008 | No Uncertainty Permission | No "I don't know" option causes hallucination | Add: "If unsure, say so" |
## Improvement Classification
Not all differences are improvements. rashomon classifies results into three categories:
| Classification | Meaning | Recommendation |
|---------------|---------|----------------|
| **Structural** | Real improvement in accuracy, completeness, or quality | Use the optimized prompt |
| **Expressive** | Different wording, same substance | Either version is fine |
| **Variance** | Just normal LLM randomness | Original prompt was already good |
Classification is based on:
- Whether detected issues (BP patterns) were resolved
- Output completeness and constraint adherence
- Consistency across multiple evaluation signals
About Knowledge Base
## Knowledge Base
rashomon learns from your project over time.
**Location**: `.claude/.rashomon/prompt-knowledge.yaml`
**How it works**:
- Automatically enabled when the file exists
- Stores project-specific patterns (not generic best practices)
- Referenced during analysis, updated after comparisons
- Max 20 entries, lowest-confidence ones removed first
**Key principle**: Old knowledge isn't automatically removed. Patterns that have worked for a long time are often the most valuable.
Troubleshooting
## Troubleshooting
### Leftover worktrees
If rashomon exits unexpectedly, temporary directories might remain:
```bash
# Worktrees are stored in system temp directory
# Clean up manually if needed:
rm -rf ${TMPDIR:-/tmp}/worktree-rashomon-*
```
### Timeout issues
For complex prompts that need more time, mention it when invoking:
```
/rashomon Complex task here. This might take longer than usual.
```
### "Not a git repository" error
rashomon requires a git repository. Initialize one with:
```bash
git init
```
## Requirements
- Git 2.5+
- Claude Code
- Must run inside a git repository
## License
MIT