https://github.com/ckailash/review-orchestra
Multi-model code review orchestration for Claude Code — runs Claude + Codex reviewers in parallel, consolidates findings, and returns results for supervised fixing.
https://github.com/ckailash/review-orchestra
ai-code-review claude claude-code code-review codex typescript
Last synced: 6 days ago
JSON representation
Multi-model code review orchestration for Claude Code — runs Claude + Codex reviewers in parallel, consolidates findings, and returns results for supervised fixing.
- Host: GitHub
- URL: https://github.com/ckailash/review-orchestra
- Owner: ckailash
- License: mit
- Created: 2026-03-13T13:16:10.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-13T15:47:45.000Z (2 months ago)
- Last Synced: 2026-04-13T16:14:34.080Z (about 2 months ago)
- Topics: ai-code-review, claude, claude-code, code-review, codex, typescript
- Language: TypeScript
- Homepage: https://github.com/ckailash/review-orchestra
- Size: 302 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# review-orchestra
Multi-model code review orchestration for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Runs multiple AI reviewers (Claude + Codex by default) in parallel, consolidates findings, and presents them to the user. The orchestrator Claude fixes code directly with user guidance in a supervised loop.
## How it works
```
+-------------------+
| Detect Scope | git diff / branch diff / PR diff
+--------+----------+
|
+--------v----------+
| Parallel Review | Claude + Codex (headless, concurrent)
+--------+----------+
|
+--------v----------+
| Consolidate | Dedup, classify confidence × impact, compute P-level
+--------+----------+
|
+--------v----------+
| Return findings | ReviewResult JSON → skill → user
+-------------------+
```
1. **Scope detection** — auto-detects uncommitted changes, branch diff vs main, or open PR diff.
2. **Parallel review** — launches all configured reviewers as headless CLI processes concurrently.
3. **Consolidation** — deduplicates findings across reviewers, classifies each on two axes (confidence × impact), computes a P-level (P0–P3), tags pre-existing issues outside the diff, and compares findings against previous rounds.
4. **Return findings** — the CLI returns a `ReviewResult` JSON on stdout. The skill presents findings to the user, who decides what to fix.
After receiving findings, the orchestrator Claude (who wrote the code and has full context) fixes issues directly with user guidance. The user controls what gets fixed and when to re-review.
## Installation
### Prerequisites
- Node.js >= 22
- [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) (`claude`)
- [Codex CLI](https://github.com/openai/codex) (`codex`) — optional, can be disabled
### From npm (recommended)
```bash
npm install -g review-orchestra
review-orchestra setup
```
### From source (contributors)
```bash
git clone https://github.com/ckailash/review-orchestra.git
cd review-orchestra
npm install
npm run build
review-orchestra setup
```
`review-orchestra setup` validates your environment (Node version, required CLIs), creates the skill symlink for Claude Code, and configures `.gitignore`. Run `review-orchestra doctor` anytime to diagnose issues.
## Usage
Invoke via the `/review-orchestra` skill in Claude Code. Arguments are natural language — no flags.
```
/review-orchestra # auto-detect scope, all defaults
/review-orchestra src/auth/ src/api/ # only review these paths
/review-orchestra fix quality issues too # extend threshold to P2
/review-orchestra only use claude # single reviewer
/review-orchestra skip codex # disable a specific reviewer
```
**Defaults** (when no arguments are provided):
- Auto-detect diff scope
- Both Claude + Codex reviewers
- Stop-at threshold: P1 (all critical + functional issues recommended for fixing)
### CLI subcommands
```bash
review-orchestra review # run reviewers + consolidate, return findings (default)
review-orchestra review src/services/ # with scope args
review-orchestra review only claude # natural language filtering
review-orchestra stale # check if worktree changed since last review (exit 0=fresh, 1=stale, 2=no session)
review-orchestra reset # clear session state
review-orchestra setup # first-time install + repair broken install
review-orchestra doctor # diagnose issues without modifying anything
```
Running with no subcommand (just scope args) is equivalent to `review-orchestra review`.
### Supervised workflow
The CLI runs review + consolidation and returns a `ReviewResult` JSON. The skill then enters the supervised loop:
1. **Review** — run `review-orchestra review` (spawns reviewers, consolidates, returns findings)
2. **Present** — skill shows findings grouped by severity with progressive disclosure
3. **User decides** — user selects which findings to fix (by ID, severity band, or "all")
4. **Confirm** — skill echoes back planned actions, waits for confirmation
5. **Fix** — orchestrator Claude fixes code directly using Edit/Write tools
6. **Re-review** — if requested, run `review-orchestra review` again (back to step 1)
7. **Done** — summarize session, suggest next action (commit, push, PR)
The user controls the loop at every step. Escalation is implicit: the user sees all findings and decides.
### Session artifacts
All session data lives in `.review-orchestra/` in the project root. Reviewer outputs are persisted as flat files (no per-round subdirectories) so debug evidence survives parse failures and crashes:
```
.review-orchestra/
├── session.json # Session state (sessionId, status, scope, rounds[], reviews, consolidated, reviewerErrors)
├── state.lock # PID file for concurrent-run prevention
├── progress.json # Live reviewer status during a run (deleted on round complete)
├── round-1-claude-raw.txt # Raw stdout from each reviewer, written before parsing
├── round-1-codex-raw.txt # On parse failure renamed to *.raw.txt; on codex failure renamed *.failed
└── round-2-claude-raw.txt
```
Everything is kept. Never auto-deleted. Users can `rm -rf .review-orchestra/` or `review-orchestra reset` when done. `setup` adds `.review-orchestra/` to `.gitignore` automatically.
Concurrent runs of `review-orchestra review` in the same project are blocked by `state.lock` (atomic-rename release protocol with PID re-check) so two terminals don't trample each other's session.
#### Session state (`session.json`)
```json
{
"sessionId": "20260315-143022",
"status": "active",
"scope": { "type": "branch", "baseBranch": "main", "description": "branch feat/auth vs main" },
"currentRound": 2,
"worktreeHash": "abc123",
"rounds": [
{
"number": 1,
"phase": "complete",
"worktreeHash": "def456",
"reviews": { "claude": { "findings": [], "metadata": { } }, "codex": { } },
"consolidated": [],
"reviewerErrors": [],
"findingsPersisted": true,
"startedAt": "2026-03-15T14:30:22Z",
"completedAt": "2026-03-15T14:32:11Z"
}
],
"startedAt": "2026-03-15T14:30:22Z",
"completedAt": null
}
```
Key fields:
- `sessionId` — timestamp-based unique ID (e.g. `20260315-143022`)
- `status` — `active`, `expired` (scope base changed, requires reset), or `completed`
- `scope` — diff scope detected at session creation (with `pathFilters` if any)
- `currentRound` — most recent round number
- `worktreeHash` — SHA-256 over HEAD + staged + unstaged + untracked files; per-round snapshot for stale detection
- `rounds[]` — per-round artifacts: `phase`, `reviews` (per-reviewer findings + metadata), `consolidated`, `reviewerErrors` (preserved across crash recovery), `findingsPersisted`, timestamps
- `startedAt` — session creation timestamp
Session lifecycle:
- `review-orchestra review` → creates or continues session, runs reviewers + consolidation, returns findings
- `review-orchestra reset` → clears the session (equivalent to `rm -rf .review-orchestra/`)
- Session auto-expires if the scope base changes (e.g., new commits on main) — stale session error, user must reset and start fresh
- Detached HEAD: `baseBranch` becomes `detached@` (not the literal `"HEAD"`) so cross-round comparison stays stable
Crash recovery: if a round was interrupted mid-`reviewing` or mid-`consolidating`, the next `review` invocation resumes from the persisted phase. Saved reviewer findings and reviewer errors carry forward.
## Configuration
Configuration lives in `config/default.json`:
```json
{
"reviewers": {
"claude": {
"enabled": true,
"command": "claude -p - --allowedTools \"Read,Grep,Glob,Bash\" --output-format json",
"outputFormat": "json"
},
"codex": {
"enabled": true,
"command": "codex exec - --output-last-message {outputFile} --json",
"outputFormat": "json"
}
},
"thresholds": {
"stopAt": "p1"
}
}
```
### Thresholds
| Setting | Default | Description |
|---|---|---|
| `stopAt` | `p1` | Suggests which findings to fix. `p0` = critical only, `p1` = critical + functional, `p2` = + quality, `p3` = fix everything. The user controls the loop and decides when to stop. |
### Finding comparison
Cross-round finding comparison uses LLM-based semantic matching by default (via Claude Haiku). This handles renamed files, shifted line numbers, and reworded descriptions across review rounds. On LLM failure or timeout, it falls back to heuristic matching (`file + title.toLowerCase()`). Configure via `findingComparison` in `config/default.json`:
| Setting | Default | Description |
|---|---|---|
| `method` | `"llm"` | Comparison method: `"llm"` for semantic matching, `"heuristic"` for file+title matching |
| `model` | `"claude-haiku-4-5"` | Model to use for LLM comparison |
| `timeoutMs` | `60000` | Timeout for LLM comparison call (ms) |
| `fallback` | `"heuristic"` | Fallback method when LLM fails |
### Severity model
Findings are classified on two independent axes, then a P-level is derived:
| | Critical | Functional | Quality | Nitpick |
|---|---|---|---|---|
| **Verified** | P0 | P1 | P2 | P3 |
| **Likely** | P0 | P1 | P2 | P3 |
| **Possible** | P1 | P2 | P3 | P3 |
| **Speculative** | P2 | P3 | P3 | P3 |
Pre-existing findings (outside the diff hunks) are tagged and excluded from recommendations. In supervised mode, the user can explicitly request fixing a pre-existing finding.
## Component overview
```
review-orchestra/
├── src/
│ ├── orchestrator.ts # Main orchestration: preflight → reviewers → consolidate → return ReviewResult
│ ├── reviewers/
│ │ ├── types.ts # Reviewer interface (+ ReviewerCallContext)
│ │ ├── claude.ts # Claude headless reviewer
│ │ ├── codex.ts # Codex headless reviewer
│ │ ├── command.ts # Command template parsing (handles \" and \\ escapes)
│ │ ├── prompt.ts # Review prompt builder
│ │ ├── raw-output.ts # persistRawOutput helper — saves reviewer stdout before parsing
│ │ └── index.ts # Registry / factory (GenericReviewer for custom reviewers)
│ ├── consolidator.ts # Dedup, classify, merge findings
│ ├── findings-store.ts # Persistent cross-session finding storage (~/.review-orchestra/findings.jsonl)
│ ├── fuzzy-match.ts # Fuzzy matching for cross-reviewer dedup (tokenize, Jaccard similarity, isFuzzyMatch)
│ ├── scope.ts # Diff scope auto-detection
│ ├── config.ts # Configuration loading & defaults
│ ├── types.ts # Shared types (Finding, Round, SessionState, ReviewResult, etc.)
│ ├── state.ts # SessionManager: session-based state tracking
│ ├── worktree-hash.ts # Worktree hash computation and stale detection
│ ├── finding-comparison.ts # Cross-round finding comparison (new/persisting/resolved)
│ ├── reviewer-parser.ts # Parse/normalize reviewer output
│ ├── parse-args.ts # Natural language CLI argument parsing
│ ├── process.ts # Process spawning with streaming
│ ├── toolchain.ts # Project tech stack detection
│ ├── progress.ts # Progress file (reviewer status during review, progress.json)
│ ├── preflight.ts # Validates required binaries
│ ├── checks.ts # Shared check functions for setup/doctor
│ ├── setup.ts # Setup command (runs checks + fixes)
│ ├── doctor.ts # Doctor command (runs checks + reports)
│ ├── json-utils.ts # JSON extraction & envelope unwrapping
│ └── log.ts # Logging utilities
├── config/
│ └── default.json # Default configuration (reviewers, thresholds)
├── skill/
│ └── SKILL.md # Claude Code skill entry point (supervised flow)
├── prompts/
│ └── review.md # Template for reviewer agents
├── schemas/
│ └── findings.schema.json # JSON schema for structured findings output
├── test/ # Unit/integration tests (Vitest)
└── evals/ # LLM eval harness
```
The orchestrator (`src/orchestrator.ts`) runs a single pass: preflight → parallel reviewers → consolidation → finding comparison → return `ReviewResult`. There is no loop — the user controls re-review decisions via the skill.
State is managed by `SessionManager` (`src/state.ts`), which tracks sessions across multiple CLI invocations. Each invocation creates or continues a session, persisting round artifacts and worktree hashes for stale-detection and finding comparison.
## Adding custom reviewers
Any CLI tool that accepts a prompt and produces output can be a reviewer. Implement the `Reviewer` interface:
```typescript
// src/reviewers/types.ts
interface Reviewer {
name: string;
review(
prompt: string,
scope: DiffScope,
context: ReviewerCallContext, // { roundNumber } — used to write round-N--raw.txt before parsing
): Promise<{ findings: Finding[]; rawOutput: string; elapsedMs?: number }>;
}
```
Or add a generic reviewer via config — any command with a `{prompt}` placeholder works:
```json
{
"reviewers": {
"my-linter": {
"enabled": true,
"command": "my-tool review \"{prompt}\"",
"outputFormat": "json"
}
}
}
```
The consolidator normalizes different output formats into the standard findings schema. Custom reviewers registered in config are handled by the `GenericReviewer` class, which executes the command and parses the output.
## Tests and evals
```bash
# Lint (type-check)
npm run lint
# Unit/integration tests (Vitest)
npm test
# Run all evals (LLM-as-judge against synthetic repos)
npm run eval
# Run a single eval fixture
npm run eval -- sql-injection
# Override the judge model
npm run eval -- --judge-model claude-sonnet-4-6
```
Tests cover deterministic components (scope detection, consolidator, config, session management, reviewer parser, finding comparison, worktree hashing, CLI subcommands) using TDD. LLM-facing components (reviewer adapters, orchestrator wiring) have integration tests. The eval harness uses synthetic repos with planted bugs and LLM-as-judge scoring for precision, recall, and severity accuracy.
## License
MIT