An open API service indexing awesome lists of open source software.

https://github.com/tznthou/claude-prism

Cross-provider AI code review for Claude Code — evidence-based confidence scoring with Codex, Gemini & Claude
https://github.com/tznthou/claude-prism

ai-code-review bash claude-code code-review codex confidence-scoring cross-provider fact-checking gemini github-actions multi-provider

Last synced: about 2 months ago
JSON representation

Cross-provider AI code review for Claude Code — evidence-based confidence scoring with Codex, Gemini & Claude

Awesome Lists containing this project

README

          

# claude-prism


claude-prism

[![npm](https://img.shields.io/npm/v/claud-prism-aireview.svg)](https://www.npmjs.com/package/claud-prism-aireview)
[![Homebrew](https://img.shields.io/badge/Homebrew-tap-FBB040.svg)](https://github.com/tznthou/homebrew-claude-prism)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Bash](https://img.shields.io/badge/Bash-3.2+-4EAA25.svg)](https://www.gnu.org/software/bash/)
[![Claude Code](https://img.shields.io/badge/Claude_Code-Plugin-7C3AED.svg)](https://claude.com/claude-code)
[![ShellCheck](https://img.shields.io/badge/ShellCheck-Passing-4EAA25.svg)](https://www.shellcheck.net/)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)

[繁體中文](README.zh-TW.md)

Cross-provider AI orchestration for Claude Code — eliminate same-source blind spots.

---

## Core Concept

### The Problem

AI code review is noisy. Even the best tools on the market only hit around 64% F1 score — a significant fraction of findings are either false positives or missed issues. Most tools try to fix this by having AI self-assess its own confidence ("I'm 90% sure this is a real bug"), but that's grading your own exam.

There's a deeper structural issue: when Claude Code writes your code **and** reviews it, you get **same-source blind spots**. Certain classes of bugs, design flaws, and security issues consistently slip through because the same model has the same knowledge gaps. Even multi-agent review within a single provider doesn't help — four Claude agents still share the same training data and biases. More agents ≠ more perspectives.

### The Solution

**1. Cross-provider triangulation** — Claude orchestrates, but dispatches review tasks to **Gemini** and **Codex** via their CLIs. Three providers, three training datasets, three sets of blind spots that cancel each other out.

**2. Evidence-based confidence scoring** — Instead of asking AI "how confident are you?", each finding is scored by verifiable evidence: specific line numbers (+25), rule citations (+20), reproducible scenarios (+15), hallucinated references (-50). The core formula is deterministic — same evidence, same score. [Open spec](spec/confidence-scoring-v1.md), no black box.

**3. Local-first** — Your code goes directly to each provider's API. No intermediary server, no relay, no third party touching your code.

### Why claude-prism?

| | claude-prism | Single-provider multi-agent review |
|---|---|---|
| **Provider diversity** | Codex + Gemini + Claude (3 independent models) | Multiple agents, same underlying model |
| **Blind spot coverage** | Cross-training-data: each model catches what others miss | Same training data bias amplified across agents |
| **Cost** | Near-zero (Codex CLI + Gemini CLI free tiers) | $15–25 per PR (official tools, Team/Enterprise plans) |
| **Speed** | 1–2 minutes | ~20 minutes |
| **Availability** | Anyone with CLI access | Paid team plans only |
| **Scoring method** | Evidence-based formula ([open spec](spec/confidence-scoring-v1.md)) — deterministic core, same evidence = same score | LLM self-assessment — AI grades its own confidence |
| **Data path** | Local-first: direct to provider APIs, no intermediary server | Varies by service |

### Compared to `/ultrareview` (released 2026-04-16)

Anthropic shipped [`/ultrareview`](https://code.claude.com/docs/en/ultrareview.md) alongside Claude Opus 4.7 on 2026-04-16 — a cloud-hosted review that spins up a fleet of Claude agents in a remote sandbox, with each finding independently verified. It's a genuine upgrade over single-pass `/review`, but it plays a different game than claude-prism:

| | claude-prism `/pi-multi-review` | `/ultrareview` |
|---|---|---|
| **Model diversity** | Codex + Gemini + Claude (3 independent models) | Many Claude agents, one underlying model |
| **Blind-spot strategy** | Heterogeneous training data cancels shared gaps | Homogeneous — more passes, same gaps |
| **Review stance** | Adversarial, divided attack surfaces | Verification-focused (lower false-positive rate) |
| **Free tier** | Unlimited (uses existing CLI subscriptions) | 3 runs per account, one-time — does not refresh |
| **Team / Enterprise free runs** | Unlimited | 0 |
| **After free tier** | Still unlimited | ~$5–$20 per run as extra usage |
| **API-key-only Claude Code** | Works | Blocked (Claude.ai login required) |
| **Bedrock / Vertex / Foundry** | Works (provider-agnostic) | Not supported |
| **Zero Data Retention orgs** | Works | Blocked |
| **CI/CD** | `ci-review.sh` in GitHub Actions | Interactive session only |

The two tools optimize for different scenarios. `/ultrareview` is a solid pre-merge confidence boost for Pro/Max users working inside the Claude ecosystem. `/pi-multi-review` exists for everything outside that box — CI pipelines, regulated / ZDR environments, teams without Pro/Max seats, and anyone who wants perspectives that don't share Claude's training data.

### Why Trust the Findings?

Most AI review tools use a similar approach: have the AI score its own findings on a 0–100 confidence scale, then filter below a threshold. [Anthropic's official code review plugin](https://github.com/anthropics/claude-code/blob/main/plugins/code-review/README.md), for example, uses a ≥80 threshold. The problem is that LLM-generated confidence scores are non-deterministic — run the same review twice, you may get different scores.

claude-prism's [Confidence Scoring Framework](spec/confidence-scoring-v1.md) works differently: it scores findings by **verifiable evidence**, not AI self-assessment. Does the comment cite a specific line number? +25. Does it reference an OWASP rule? +20. Does it mention a file that doesn't exist? -50. The formula is deterministic, model-agnostic, and fully auditable — you can open the spec and verify exactly how every point was calculated.

---

## Quick Start

### Prerequisites

| Tool | Required | Install |
|------|----------|---------|
| [Claude Code](https://claude.com/claude-code) | Yes | `npm install -g @anthropic-ai/claude-code` |
| [Gemini CLI](https://github.com/google-gemini/gemini-cli) | For Gemini commands | `npm install -g @google/gemini-cli` |
| [Codex CLI](https://github.com/openai/codex) | For Codex commands | `npm install -g @openai/codex` |

> **⚠️ Gemini CLI Service Update (effective March 25, 2026)**
>
> Google is [changing how Gemini CLI routes traffic](https://github.com/google-gemini/gemini-cli/discussions/22970). Free tier users will be limited to Flash models only (no Pro), and all users may experience rate limiting during peak hours. If you encounter timeouts:
>
> ```bash
> # Option 1: Use Flash (faster, higher coding benchmark scores than Pro)
> export GEMINI_MODEL="gemini-3-flash-preview"
>
> # Option 2: Use your own API key for self-managed quota
> export GEMINI_API_KEY="your-key-from-ai-studio"
> ```
>
> See [Environment Variables](#environment-variables) for details.

### Install

**Quick install (recommended)**

```bash
npx claud-prism-aireview
```

This downloads the package and runs the installer in one step — it deploys commands to `~/.claude/commands/` and scripts to `~/.claude/scripts/`. No `postinstall` scripts are used; you explicitly run this command yourself.

**npm global install**

```bash
npm install -g claud-prism-aireview
claud-prism-aireview # ← Required: deploys commands and scripts to ~/.claude/
```

> **Important:** `npm install -g` only places the binary in your PATH. You must run `claud-prism-aireview` once to actually deploy the commands. This is by design — we intentionally avoid `postinstall` scripts for [supply chain security](https://socket.dev/blog/pnpm-10-0-0-blocks-lifecycle-scripts-by-default).

**Homebrew (macOS)**

```bash
brew tap tznthou/claude-prism
brew install claud-prism-aireview
```

**Manual**

```bash
git clone https://github.com/tznthou/claude-prism.git
cd claude-prism
./install.sh
```

The installer:
- Checks for prerequisites and reports what's available
- Verifies file integrity via SHA256 checksums (if `checksums.sha256` is present)
- Backs up any existing files before overwriting
- Copies commands to `~/.claude/commands/` and scripts to `~/.claude/scripts/`

### Verify

```bash
./tests/smoke-test.sh
```

### Uninstall

```bash
npx claud-prism-aireview --uninstall
# or manually:
./uninstall.sh
```

---

## Commands

| Command | Provider | Description |
|---------|----------|-------------|
| `/pi-ask-codex` | Codex | Direct Q&A — get OpenAI's perspective |
| `/pi-ask-gemini` | Gemini | Direct Q&A — get Google's perspective |
| `/pi-askall` | Codex + Gemini + Claude | Ask all providers the same question — three perspectives with synthesis |
| `/pi-code-review` | Codex | Adversarial code review — break confidence in changes, not validate them (with confidence scoring) |
| `/pi-fact-check` | Gemini + WebSearch + Claude | Fact-check content — dual-track search (Gemini + WebSearch), Claude validates with convergence scoring |
| `/pi-ui-design` | Gemini | HTML mockup from design spec |
| `/pi-ui-review` | Gemini | UI/UX accessibility & design audit (with confidence scoring) |
| `/pi-research` | Gemini + WebSearch + Claude | Structured technical research — dual-track search (Gemini + WebSearch parallel) |
| `/pi-multi-review` | Codex + Gemini + Claude | Triple-provider adversarial review with divided attack surfaces (smart routing + confidence scoring) |
| `/pi-plan` | Codex + Gemini + Claude | Multi-provider implementation planning for architectural decisions |

All commands include **graceful degradation** — if a provider is unavailable, Claude continues with the remaining providers instead of failing. Every failure includes a **structured error diagnostic** (TIMEOUT, RATE_LIMIT, AUTH_ERROR, PERMISSION, SANDBOX, NETWORK, CLI_ERROR, or CLI_NOT_FOUND) and suggests an alternative command.

### `/pi-ask-codex` — Ask OpenAI

Direct Q&A with Codex. Good for getting a second opinion on any technical question.

```
/pi-ask-codex What's the best way to handle optimistic updates in React Query v5?
```

### `/pi-ask-gemini` — Ask Google

Direct Q&A with Gemini. Leverages Google's broad ecosystem knowledge.

```
/pi-ask-gemini Compare Bun vs Deno vs Node.js for a new backend project in 2026
```

### `/pi-askall` — Ask All Providers

Ask Codex and Gemini the same question in parallel, then Claude synthesizes all three perspectives. Works with **any topic** — not limited to code. Each provider responds independently (no cross-contamination), then Claude compares consensus, flags divergence, and delivers an integrated take.

```
/pi-askall Should we use a monorepo or polyrepo for the new microservices?
/pi-askall What's the best approach for real-time sync in a collaborative editor?
```

### `/pi-fact-check` — Cross-Provider Fact Verification

Dual-track search: Gemini (search grounding) and WebSearch run simultaneously — no dead-wait time. Claude cross-verifies with independent judgment, validates URLs, and scores confidence using source convergence (how many independent sources agree). Includes source tier ranking (L1 official records through L6 community), editorial chain deduplication, and adversarial evidence checking.

```
/pi-fact-check article.md
/pi-fact-check https://example.com/blog-post
/pi-fact-check "Tesla delivered 1.8M vehicles in 2024 and holds over 50% market share"
```

### `/pi-code-review` — Adversarial Code Review

Codex reviews code that Claude wrote with an **adversarial stance** — the reviewer's job is to break confidence in the change, not validate it. The core use case is still **different AI, different blind spots**, but the prompt has been upgraded from a neutral "Senior Reviewer" to an adversarial role with 9 attack surface categories (auth, data loss, race conditions, rollback safety, etc.), a finding bar (every finding must answer: what can go wrong, why vulnerable, likely impact, concrete fix), and calibration rules ("prefer one strong finding over several weak ones").

Each issue is scored 0–100 using the [Confidence Scoring Framework](spec/confidence-scoring-v1.md) — evidence-based, deterministic noise filtering. Only issues scoring ≥ 80 are shown. Reviews also check inline annotation compliance (`IMPORTANT`/`WARNING`/`TODO` comments) and, in `--pr` mode, surface recurring issues from historical PR comments on the same files.

```
/pi-code-review # review staged changes
/pi-code-review src/auth.ts # review specific file
/pi-code-review --diff # review unstaged changes
/pi-code-review --pr # review entire PR
```

### `/pi-ui-design` — HTML Mockup from Design Spec

Gemini reads a design specification and generates a self-contained HTML mockup (Tailwind CDN) you can preview in a browser. Confirm the design visually, then let Claude Code implement it into your project.

```
/pi-ui-design design-spec.md # generate HTML mockup from design spec
/pi-ui-design "a SaaS dashboard" # no spec → Gemini drafts spec first, then mockup
```

### `/pi-ui-review` — UI/UX Audit

Gemini reviews frontend code for accessibility, responsive design, component structure, and UX patterns. Issues are confidence-scored with UI-specific factors (WCAG citations, user impact descriptions). Guideline compliance is checked if `CLAUDE.md` or `Agents.md` exists.

```
/pi-ui-review src/components/Header.tsx
/pi-ui-review src/app/(public)/
/pi-ui-review --screenshot ./screenshot.png # uses Claude's vision instead
```

### `/pi-research` — Technical Research

Dual-track search: Gemini (search grounding) and WebSearch run in parallel for structured technical research with comparison tables, recommendations, and source URLs. If one track fails, the other covers the gap — same resilience architecture as `/pi-fact-check`. If the topic relates to the current project, relevant context (dependencies, existing patterns) is automatically included. Results can optionally be saved to `.claude/pi-research/` for future reference.

```
/pi-research Best authentication libraries for Next.js App Router
/pi-research Monorepo tooling: Turborepo vs Nx vs Moon
```

### `/pi-multi-review` — Triple-Provider Adversarial Review (Divided Attack Surfaces)

The flagship command. Sends the same code to **both** Codex and Gemini for **adversarial review with divided attack surfaces** — Codex attacks security & data integrity (auth, data loss, race conditions, schema drift, etc.), Gemini attacks design, UX & maintainability (edge-case UX, accessibility, abstraction leaks, test coverage, etc.). Claude synthesizes:

1. **Consensus** — issues both providers flagged (high confidence, fix first)
2. **Divergence** — issues only one found (Claude judges validity)
3. **Guideline compliance** — violations of `CLAUDE.md` / `Agents.md` project rules
4. **Claude supplement** — issues neither caught

**Confidence Scoring** ([spec](spec/confidence-scoring-v1.md)): Every issue from all providers is scored 0–100 based on evidence quality — not Claude's opinion. Only issues ≥ 80 make it to the output. Base score 40, with evidence-based factors: specific line numbers (+25), code introduced in this diff (+25), cited rules (+20), reproducible scenario (+15), multi-provider consensus (+20). Noise signals reduce the score: subjective style (-25), linter-catchable (-25), hallucinated references (-50). The framework is deterministic, model-agnostic, and [independently reusable](spec/confidence-scoring-v1.md).

**Smart Routing** (v0.7.0): Automatically detects the domain of changes (frontend/backend/fullstack) from file extensions and paths. During synthesis, the domain-authoritative provider gets higher weight — Gemini for frontend (UI/UX expertise), Codex for backend (security/algorithm expertise). Both providers are always called; weighting only affects how Claude resolves disagreements.

```
/pi-multi-review # review staged changes
/pi-multi-review --pr # review entire PR
```

### `/pi-plan` — Multi-Provider Implementation Planning

For tasks with **multiple viable approaches** — architectural decisions, tech stack selection, migration strategies. Consults Codex and Gemini for independent architectural analysis, then Claude synthesizes into a structured plan file.

Plans are saved to `.claude/pi-plans/` and include: context, multi-provider analysis, step-by-step implementation, key files, risks, and verification criteria. Plans persist across sessions.

Best for complex decisions; for simple task breakdown, use Claude Code's built-in plan mode instead.

```
/pi-plan Migrate from REST to GraphQL — evaluate Apollo vs Relay vs urql
/pi-plan Refactor the payment module to support multiple providers
```

---

## Architecture

```mermaid
flowchart LR
User["👤 You"] <--> Claude["🟣 Claude Code\n(Orchestrator)"]
Claude -->|"/pi-ask-codex\n/pi-askall\n/pi-code-review\n/pi-multi-review\n/pi-plan"| Codex["🟢 Codex CLI"]
Claude -->|"/pi-ask-gemini\n/pi-askall\n/pi-fact-check\n/pi-ui-design\n/pi-ui-review\n/pi-research\n/pi-multi-review\n/pi-plan"| Gemini["🔵 Gemini CLI"]
CI["⚙️ GitHub Actions"] -->|"ci-review.sh"| GeminiAPI["🔵 Gemini API"]
CI -->|"ci-review.sh"| OpenAIAPI["🟢 OpenAI API"]
CI -->|"synthesis"| ClaudeAPI["🟣 Claude API"]
```

### How It Works

1. User types a slash command in Claude Code (e.g., `/pi-code-review src/auth.ts`)
2. Claude Code reads the command definition (Markdown with instructions)
3. Claude reads the relevant code, builds a prompt
4. Claude calls the shell script via Bash tool → script invokes the external CLI
5. External AI processes the request and returns results
6. Claude presents the results, adding its own perspective where relevant
7. For review commands, Claude interprets the providers' output (severity, category, source) and appends a structured record to `review-insights.jsonl` for later trend analysis

For details on what data crosses trust boundaries, see [Privacy & Data Flow](#privacy--data-flow).

---

## Tech Stack

| Technology | Purpose | Notes |
|------------|---------|-------|
| Bash | CLI wrapper scripts | Handles binary detection, logging, stdin piping |
| Markdown | Slash command definitions | Claude Code reads these as instructions |
| Claude Code | Orchestrator | Reads commands, dispatches to external CLIs |
| Codex CLI | OpenAI access | Code review and Q&A (model configurable) |
| Gemini CLI | Google access | Research, UI review, Q&A (model configurable) |
| GitHub Actions | CI/CD integration | Automated PR review via REST APIs |

## Project Structure

```
claude-prism/
├── .github/workflows/
│ ├── ai-review.yml # GitHub Actions workflow for CI review
│ └── shellcheck.yml # ShellCheck static analysis for shell scripts
├── spec/ # Standalone specifications
│ └── confidence-scoring-v1.md # Evidence-based noise filtering framework
├── commands/ # Slash command definitions (Markdown)
│ ├── pi-ask-codex.md
│ ├── pi-ask-gemini.md
│ ├── pi-askall.md
│ ├── pi-code-review.md
│ ├── pi-fact-check.md
│ ├── pi-multi-review.md
│ ├── pi-plan.md
│ ├── pi-research.md
│ ├── pi-ui-design.md
│ └── pi-ui-review.md
├── scripts/ # CLI wrappers & utilities (Bash)
│ ├── call-codex.sh # Codex CLI wrapper
│ ├── call-gemini.sh # Gemini CLI wrapper
│ ├── detect-domain.sh # Domain detection for smart routing
│ ├── ci-review.sh # CI/CD review orchestrator (curl APIs)
│ ├── usage-summary.sh # API usage statistics
│ └── review-insights.sh # Review pattern analysis
├── tests/
│ └── smoke-test.sh
├── checksums.sha256 # SHA256 checksums for integrity verification
├── install.sh
├── uninstall.sh
├── README.md
└── README.zh-TW.md
```

Installed to:

```
~/.claude/
├── commands/ # ← command definitions copied here
├── scripts/ # ← wrapper scripts copied here
└── logs/
├── multi-ai.log # Call logs (timestamps, prompt/response lengths)
└── review-insights.jsonl # Structured review history (auto-recorded)

# Created at runtime by /pi-plan and /pi-research:
.claude/pi-plans/ # ← plan files (project-local, cross-session)
.claude/pi-research/ # ← saved research results (optional)
```

---

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `GEMINI_MODEL` | (CLI default) | Override Gemini model (e.g. `gemini-3-flash-preview`) |
| `GEMINI_MODEL_DEEP` | (falls back to `GEMINI_MODEL`, then CLI default) | Model for heavy-reasoning commands (`/pi-fact-check`, `/pi-research`, `/pi-multi-review`, `/pi-plan`) |
| `CODEX_MODEL` | (CLI default) | Override Codex model (e.g. `gpt-5.3-codex`) |
| `GEMINI_API_KEY` | (none) | Use your own AI Studio API key instead of OAuth ([get one](https://aistudio.google.com/apikey)) |
| `GEMINI_BIN` | (auto-detect) | Path to gemini binary |
| `CODEX_BIN` | (auto-detect) | Path to codex binary |
| `MULTI_AI_LOG_DIR` | `~/.claude/logs` | Log directory |

By default, scripts defer to each CLI's built-in default model — no configuration needed. As CLIs update, you automatically get the latest model. To pin a specific model:

```bash
# Shell profile (~/.zshrc or ~/.bashrc)
export GEMINI_MODEL="gemini-3-flash-preview" # Default: fast, stable
export GEMINI_MODEL_DEEP="gemini-3-pro-preview" # Heavy reasoning: fact-check, research, review, planning
export CODEX_MODEL="gpt-5.3-codex"

# Optional: use your own API key for direct quota control
export GEMINI_API_KEY="your-key-from-ai-studio"

# Or per-invocation via the -m flag
~/.claude/scripts/call-gemini.sh -m gemini-3-flash-preview "your prompt"
```

### Script Features

Both wrapper scripts support:

| Feature | Description |
|---------|-------------|
| **Streaming output** | CLI responses stream directly to stdout via `tee` — no buffering. Callers that background the script (e.g. Claude Code auto-backgrounding) can capture output in real time |
| **SIGHUP survival** | Scripts ignore `SIGHUP`, so they survive terminal detach when backgrounded |
| **Background fallback** | Output persisted to `~/.claude/logs/pi-{codex,gemini}-last.out`. All commands include a fallback directive — if the Bash tool was backgrounded or returned empty output, Claude reads the result from the log file |
| **Binary detection** | Searches multiple paths for the CLI binary |
| **Logging** | Every call logged to `~/.claude/logs/multi-ai.log` with timestamps |
| **`--dry-run`** | Test without calling the API (no tokens consumed) |
| **Stdin piping** | `echo "code" \| call-gemini.sh "prompt"` for long inputs |
| **Model override** | `-m model-name` to use a different model |

### Customization

**Adding a new provider:**

1. Create `scripts/call-newprovider.sh` following the pattern of existing scripts
2. Create `commands/ask-newprovider.md` with the command definition
3. Run `./install.sh` to deploy

**Changing the review prompt:**

Edit the command `.md` files in `commands/`. The prompt templates are inline and easy to modify.

---

## Observability

### Usage Summary

Track API call volume and estimated token consumption:

```bash
~/.claude/scripts/usage-summary.sh # today
~/.claude/scripts/usage-summary.sh --week # last 7 days
~/.claude/scripts/usage-summary.sh --all # all time
~/.claude/scripts/usage-summary.sh --date 2026-02-24 # specific date
```

Output includes per-provider call counts, success/error/dry-run breakdown, and a rough token estimate (~4 chars/token).

### Review Insights

After each `/pi-code-review` or `/pi-multi-review`, Claude interprets the providers' output — mapping emoji severity to strings, inferring discovery source, filtering by the confidence threshold — and appends a structured record to `~/.claude/logs/review-insights.jsonl`. The script below reads that file with `jq` for raw counts; ask Claude to layer interpretation on top of the numbers:

```bash
~/.claude/scripts/review-insights.sh # raw counts (no AI interpretation)
~/.claude/scripts/review-insights.sh --recent 10 # last 10 reviews
~/.claude/scripts/review-insights.sh --project my-app # filter by project
```

Output includes:
- **Category distribution** — security, performance, design, logic, etc. (with bar chart)
- **Severity breakdown** — critical / medium / suggestion
- **Discovery source** — consensus vs. single-provider findings
- **Most frequent issues** — recurring patterns highlighted
- **Recent review timeline** — last 5 reviews with issue counts

Each review record follows this schema:

```json
{
"date": "2026-02-24T10:30:00Z",
"project": "my-app",
"scope": "pr",
"domain": "backend",
"providers": ["codex", "gemini", "claude"],
"issues": [
{
"category": "security",
"severity": "critical",
"confidence": 95,
"title": "SQL injection in user input handler",
"source": "consensus"
}
]
}
```

Categories: `security`, `performance`, `design`, `logic`, `maintainability`, `guideline`, `accessibility`, `other`. The `guideline` category tracks violations of project-specific rules (`CLAUDE.md` / `Agents.md`).

### Invocation Diagnostics

When something goes wrong — a `/pi-*` command hangs, returns empty output, or a log file is unexpectedly 0 bytes — `analyze-log.sh` groups `multi-ai.log` events by pid and classifies each invocation's outcome:

```bash
~/.claude/scripts/analyze-log.sh # analyze the default log
~/.claude/scripts/analyze-log.sh /path/to/log # inspect a specific log file
```

Each invocation falls into one of four categories:

- **SUCCESS** — completed normally
- **ERROR** — the CLI returned non-zero (error class shown: `TIMEOUT`, `RATE_LIMIT`, `AUTH_ERROR`, `PERMISSION`, `SANDBOX`, `NETWORK`, `CLI_ERROR`, `CLI_NOT_FOUND`)
- **SIGNAL** — the script caught `HUP` / `INT` / `TERM` mid-run (the stage the script died at is recorded)
- **SILENT** — invoked but produced no completion event — the signature of `SIGKILL`, commonly caused by Claude Code's Bash tool auto-backgrounding a call and killing its child

Silent deaths are the most actionable diagnostic signal: if you see one, your command was terminated before it could finish. The `pi-*` commands shipped in v0.12.3+ include a "Bash invocation rules" preamble that tells Claude to call scripts in foreground synchronous mode, bypassing this failure mode. See [CHANGELOG.md](CHANGELOG.md) for the full regression writeup.

### Cache TTL Behavior

Claude Code currently uses a 5-minute prompt cache TTL for all subscribers — Pro and Max alike. When a `/pi-*` command takes longer than 5 minutes to return (Codex or Gemini occasionally runs past this window on complex tasks), the next turn in Claude Code pays full input price instead of reading from cache at the usual 10x discount.

This isn't a claude-prism bug — it reflects a Claude Code-wide shift from a 1-hour default back to 5 minutes around 2026-03-08 (see [GitHub issue #46829](https://github.com/anthropics/claude-code/issues/46829)). Anthropic's official [prompt caching documentation](https://platform.claude.com/docs/en/build-with-claude/prompt-caching) does not gate TTL by subscription tier; community claims that Max subscribers automatically receive a 1-hour TTL remain unverified.

In practice, this overhead hasn't produced noticeable cost spikes for claude-prism users so far. If your usage pattern changes that, file an issue.

---

## CI/CD Integration

Automate multi-provider reviews on every PR via GitHub Actions. The CI path uses REST APIs directly (no CLI installation needed on runners).

### Quick Setup

1. Copy the workflow file to your project:

```bash
mkdir -p .github/workflows
cp path/to/claude-prism/.github/workflows/ai-review.yml .github/workflows/
cp path/to/claude-prism/scripts/ci-review.sh scripts/
```

2. Add API keys as GitHub Secrets (at least one required):

| Secret | Provider | Required? |
|--------|----------|-----------|
| `GEMINI_API_KEY` | Gemini review | Optional |
| `OPENAI_API_KEY` | OpenAI review | Optional |
| `ANTHROPIC_API_KEY` | Claude synthesis | Optional |

3. Add the `ai-review` label to a PR to trigger the review.

### Trigger Modes

**Label trigger (default):** Add `ai-review` label to a PR → workflow runs. Best for cost control.

**Auto trigger:** Uncomment the `pull_request: [opened, synchronize]` block in the workflow file → runs on every PR update.

### How It Works (CI)

1. GitHub Actions checks out the PR and fetches the diff
2. `ci-review.sh` auto-discovers `CLAUDE.md` / `Agents.md` for guideline context
3. Historical PR comments on the same files are queried via GraphQL (single API call) and included as recurring-issue context
4. Diff is sent to available providers (Gemini API, OpenAI API) in parallel, with inline annotation compliance checks and false-positive exclusion rules
5. If `ANTHROPIC_API_KEY` is set, Claude synthesizes with confidence scoring (only issues ≥ 80 posted)
6. If not, results are concatenated directly
7. If the review includes concrete code fixes, they are posted as **inline PR review comments** with GitHub suggestion blocks (one-click accept). Remaining output is posted as the review body. Falls back to a regular PR comment if the Reviews API is unavailable

### CI Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `GEMINI_MODEL` | `gemini-2.0-flash` | Gemini model for CI review |
| `OPENAI_MODEL` | `gpt-4o` | OpenAI model for CI review |
| `ANTHROPIC_MODEL` | `claude-sonnet-4-20250514` | Claude model for synthesis |
| `MAX_DIFF_CHARS` | `32000` | Diff truncation limit |

### Security Notes

- **Fork PRs**: The workflow uses `pull_request` (not `pull_request_target`), so fork PRs cannot access your secrets. This is intentional — fork PRs are skipped.
- **API keys**: Use GitHub repository secrets. Never commit API keys to the repo.
- **Concurrency**: Only one review runs per PR at a time; new pushes cancel in-progress reviews.
- **Checksums**: `checksums.sha256` verifies file integrity against transit corruption or accidental modification. It does **not** protect against a compromised repository — for that, verify against the GitHub Release artifacts page.

### CLI Version Compatibility

The wrapper scripts depend on specific CLI behaviors that are not part of official stable APIs:

| CLI | Behavior Used | Verified Range |
|-----|---------------|----------------|
| Gemini CLI | `-p " "` headless mode (stdin + prompt) | v0.1.x – v0.3.x |
| Codex CLI | `codex exec - ` stdin mode | v0.100.x – v0.106.x |

If a CLI update breaks functionality, pin the working version or open an issue. For the March 25, 2026 Gemini service update, see the [notice in Prerequisites](#prerequisites).

---

## Cost Estimation

claude-prism is a local wrapper — it does not process or bill tokens itself. Each command may trigger one or more API calls to external providers (Codex, Gemini) via their CLIs. Claude Code's own orchestration tokens (reading files, building prompts, synthesizing results) are separate and covered by your Claude subscription or API plan.

### Token Consumption by Command

| Command | External Calls | Typical Input Tokens | Typical Output Tokens | Notes |
|---------|---------------|---------------------|----------------------|-------|
| `/pi-ask-codex` | 1 (Codex) | 500–2K | 500–2K | Scales with question complexity |
| `/pi-ask-gemini` | 1 (Gemini) | 500–2K | 500–2K | Scales with question complexity |
| `/pi-askall` | 2 (Codex + Gemini) | 500–2K each | 500–2K each | Both providers called in parallel |
| `/pi-fact-check` | 1 (Gemini) + N (WebSearch) | 1K–5K | 2K–8K | Scales with number of claims; WebSearch runs in parallel |
| `/pi-code-review` | 1 (Codex) | 2K–10K | 1K–4K | Scales with diff size |
| `/pi-ui-review` | 1 (Gemini) | 2K–10K | 1K–4K | Scales with file count |
| `/pi-ui-design` | 1 (Gemini) | 1K–3K | 3K–8K | Output-heavy (HTML generation) |
| `/pi-research` | 1 (Gemini) + 2–4 (WebSearch) | 1K–5K | 2K–8K | Dual-track search; scales with topic complexity |
| `/pi-multi-review` | 2 (Codex + Gemini) | Above ×2 | Above ×2 | Both providers called in parallel |
| `/pi-plan` | 0–2 (optional) | 1K–5K each | 1K–4K each | Providers consulted only if available |
Token ranges are approximate and vary with input size (diff length, file count, question complexity). Different providers use different tokenization methods — these figures are order-of-magnitude estimates, not billing-accurate counts.

### Controlling Costs

- **`--dry-run`** — test the request path without calling the provider (no tokens consumed)
- **`usage-summary.sh`** — review historical call counts and rough token volume:
```bash
~/.claude/scripts/usage-summary.sh --week
```
- **Provider pricing** — check current rates at your provider's pricing page:
- [OpenAI API Pricing](https://openai.com/api/pricing/)
- [Google AI Pricing](https://ai.google.dev/pricing)

---

## Supply Chain Security

claude-prism takes supply chain security seriously. In 2025-2026, `postinstall` scripts became the [#1 attack vector](https://snyk.io/articles/npm-security-best-practices-shai-hulud-attack/) for npm supply chain attacks — from the Shai-Hulud worm (800+ packages infected) to the [Axios RAT compromise](https://www.microsoft.com/en-us/security/blog/2026/04/01/mitigating-the-axios-npm-supply-chain-compromise/) (North Korean state actor). We designed our install pipeline to avoid these risks entirely.

### Defense Layers

| Layer | Protects Against | How |
|-------|-----------------|-----|
| **No `postinstall` scripts** | Silent execution on install | pnpm/Bun users aren't blocked; Socket.dev raises no flags |
| **Explicit user execution** | Unauthorized operations | `npx` or `claud-prism-aireview` requires deliberate action |
| **SHA256 checksums** | File tampering in transit | `install.sh` verifies every file before deploying; aborts on mismatch |
| **npm OIDC provenance** | Account hijacking | Packages can only be published from GitHub Actions CI, not manually |
| **Pre-install backup** | Accidental overwrites | Existing files backed up to `~/.claude/.multi-ai-backup-*` |

### Verify Integrity Yourself

```bash
# After cloning, verify all files match their checksums:
cd claude-prism
shasum -a 256 -c checksums.sha256

# Check npm package provenance:
npm audit signatures
```

---

## Privacy & Data Flow

claude-prism is a local Bash wrapper, not a hosted proxy or relay service. There is no intermediary server between your machine and the AI providers.

### Data Flow

```mermaid
sequenceDiagram
participant L as Your Machine
participant C as Claude Code
participant S as claude-prism scripts
participant P as Provider API (Google / OpenAI)

L->>C: User runs /pi-command
C->>C: Reads files, builds prompt
C->>S: Passes prompt + code context
Note over S: Logs metadata locally (timestamps, lengths only)
S->>P: HTTPS via provider CLI
P-->>S: AI response
S-->>C: Returns output
C-->>L: Presents results
```

### What Gets Sent to External Providers

- Code snippets, diffs, or file contents relevant to your command
- The prompt assembled by Claude Code (review instructions, context)
- Model selection metadata (model name, flags)

### What Stays Local

- **Logs**: `~/.claude/logs/multi-ai.log` records metadata only (timestamps, prompt/response byte lengths) — no code content
- **Review history**: `~/.claude/logs/review-insights.jsonl` — one structured JSON line per review. Each record is Claude's interpretation of the providers' output (category, severity, confidence, source, and an issue title derived from the AI response), not a raw transcript
- **Plans and research**: `.claude/pi-plans/` and `.claude/pi-research/` files stay on your machine
- **No telemetry**: claude-prism has no analytics, no phone-home, no intermediary server

### What We Don't Control

Each provider's data handling is governed by their own API/business terms, not by claude-prism:

- **Data retention** — whether and how long providers store your prompts/responses
- **Model training** — whether your data is used to improve their models (API terms typically exclude this, but verify your specific tier)
- **Sub-processors** — the cloud infrastructure providers use (AWS, Google Cloud, Azure)

Provider terms:
- [Anthropic Commercial Terms](https://www.anthropic.com/policies/commercial-terms)
- [OpenAI API Terms](https://openai.com/policies/row-terms-of-use/)
- [Google AI Terms](https://ai.google.dev/gemini-api/terms)

> **For regulated or confidential projects**: If your codebase is subject to HIPAA, SOC 2, NDA, or similar compliance requirements, verify the full chain — Claude Code terms, provider API terms, data retention settings, and your organization's internal approval process — before sending code to external APIs.

---

## FAQ

**Q: Does Claude actually call the external CLIs, or does it fake the results?**

With logging enabled (default), check `~/.claude/logs/multi-ai.log` to verify. Each call is timestamped with model name and prompt/response length.

**Q: What if I only have Gemini CLI installed?**

That's fine. All commands include graceful degradation with **structured error diagnostics** — if a provider is unavailable, the failure message includes the specific reason (TIMEOUT, RATE_LIMIT, AUTH_ERROR, PERMISSION, SANDBOX, NETWORK, CLI_ERROR, or CLI_NOT_FOUND) and suggests an alternative command. `/pi-multi-review` and `/pi-askall` will use Claude + the available provider (two perspectives instead of three). `/pi-code-review` will fall back to a Claude-only review with a caveat note and suggest `/pi-multi-review`. `/pi-fact-check` and `/pi-research` both use dual-track (Gemini + WebSearch simultaneously); if both fail, falls back to Claude's training data.

**Q: What if a provider returns an unexpected format?**

Claude handles it. If Codex or Gemini doesn't follow the requested emoji/score format, Claude extracts actionable insights from the raw text using semantic matching rather than format parsing. Scores show "—" in the comparison table when not provided.

**Q: How much does this cost?**

See the [Cost Estimation](#cost-estimation) section for per-command token consumption ranges and cost control tools.

**Q: Gemini CLI keeps timing out or responding very slowly?**

Likely caused by Pro model rate limiting. Two fixes: (1) Set `GEMINI_MODEL="gemini-3-flash-preview"` — Flash is faster and scores higher on coding benchmarks (SWE-bench: Flash 78% vs Pro 76.2%). (2) Set `GEMINI_API_KEY` with your own [AI Studio](https://aistudio.google.com/apikey) key for self-managed quota. See [Gemini CLI Service Update](#prerequisites).

**Q: Can I use this with other Claude Code setups?**

Yes. The commands and scripts are standalone — they only depend on `~/.claude/` directory conventions that Claude Code uses.

---

## Reflections

*First entry — 2026-03*

In the age of AI-assisted coding, most developers have access to the "big three" CLIs: Claude, Codex, and Gemini. After subscribing to Claude Code, I kept thinking: since I already have a powerful orchestrator at hand, why not leverage other providers' CLIs at the same time? Whether it's code review, technical research, or UI/UX design, having different AIs approach the same problem from different angles yields more comprehensive results than any single source.

I looked around, but the existing tools I found were either too heavy or didn't integrate well with Claude Code's workflow. So I decided to build my own.

It started as a few simple wrapper scripts to handle everyday review tasks. But as I kept building, more possibilities emerged: triple-provider adversarial review, review trend analysis, CI/CD automation... None of these were in the original plan, yet each one felt genuinely useful.

So here we are. I hope this tool helps you too.

*Update — 2026-04-03*

OpenAI released [codex-plugin-cc](https://github.com/openai/codex-plugin-cc). My first reaction was honestly a bit of a gut punch — wait, they're officially building a Codex plugin for Claude Code themselves?

But after studying it closely, I calmed down. The architectures are fundamentally different: theirs is a heavy-duty Node.js + JSONRPC + Unix socket broker setup; ours is a lightweight Bash shell script approach. The positioning is different too — they're deep single-provider integration, we're cross-provider orchestration. There's no replacement story here.

That said, one thing in their repo really caught my eye: `adversarial-review`. Instead of casting the AI as a neutral "Senior Reviewer," they explicitly tell it: "your job is to break confidence in this change, not to validate it." Seven attack surface categories, every finding must answer four questions, calibration rules demanding "prefer one strong finding over several weak ones" — this design philosophy gave me a lot of inspiration.

After all, we're just Vibe Coders — standing on the shoulders of giants to see a little further is nothing but a good thing. So I went ahead and reworked the prompt architecture for `pi-code-review` and `pi-multi-review`: adversarial stance, divided attack surfaces (Codex attacks security, Gemini attacks design/UX), finding bars, and calibration rules. The concepts were borrowed, but once they were fused with our existing confidence scoring framework and domain-aware weighting, they became something of our own.

That's the beauty of open source, I suppose.

*Update — 2026-04-03 (afternoon)*

Today I ran `npm install -g` on claude-prism myself, fully confident it was done — only to find that none of the commands had actually been deployed to `~/.claude/`. Turns out `npm install -g` only places the binary in your PATH; you still need to run `claud-prism-aireview` once to actually deploy everything.

This is probably the easiest pitfall for a Vibe Coder to stumble into: assuming `npm install` equals "fully installed." I even considered adding a `postinstall` script to auto-trigger `install.sh` and skip that extra step. But after digging into it, I learned that `postinstall` is now the #1 entry point for npm supply chain attacks — the March 2026 Axios incident (a North Korean state actor planted a RAT via postinstall; it was live for under three hours but reached millions of environments) is a sobering example. pnpm v10 and Bun now block all lifecycle scripts by default, and the entire ecosystem is systematically moving away from them.

So the final approach: no `postinstall`. Instead, make the README crystal clear that the two-step install is a deliberate security design, not laziness. In the process, I also realized we already had SHA256 checksum verification and npm OIDC provenance in place — the full security chain was more robust than I thought.

The upside of being a beginner is that every pitfall forces you to actually understand *why* things are designed a certain way, rather than just copying patterns without knowing the reason.

---

## Contributing

We welcome contributions! Whether it's bug fixes, new provider integrations, prompt improvements, or documentation — check out our [Contributing Guide](CONTRIBUTING.md) to get started.

This project follows the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code.

**Quick links:**

- [Development setup](CONTRIBUTING.md#getting-started)
- [Code standards](CONTRIBUTING.md#code-standards) (Bash 3.2+, ShellCheck, conventional commits)
- [Submitting a PR](CONTRIBUTING.md#submitting-changes)
- [Reporting issues](CONTRIBUTING.md#reporting-issues)

---

## Acknowledgments

`/pi-fact-check` started by referencing the `super-fact-checker` methodology from [newtype-os](https://github.com/newtype-01/newtype-os) (whose author is a YouTuber I really enjoy). We later realized borrowing someone else's framework didn't feel right, so we rewrote the whole thing from scratch with our own approach: cross-provider verification, adversarial evidence, and a simplified verdict system. Huge thanks for the starting point.

---

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for the full version history.

---

## License

This project is licensed under [MIT](LICENSE).

---

## Author

**tznthou** — [tznthou.com](https://tznthou.com) · [service@tznthou.com](mailto:service@tznthou.com)