https://github.com/geminimir/promptproof

Deterministic LLM testing for CI. Record→Replay + policy-as-code to block risky merges (PII, schema drift, regressions, budget creep).
https://github.com/geminimir/promptproof
anthropic ci github-actions guardrails llm openai prompt rag security testing
Last synced: 11 months ago
JSON representation
Deterministic LLM testing for CI. Record→Replay + policy-as-code to block risky merges (PII, schema drift, regressions, budget creep).
Host: GitHub
URL: https://github.com/geminimir/promptproof
Owner: geminimir
License: mit
Created: 2025-08-09T15:52:23.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-09-02T03:55:38.000Z (11 months ago)
Last Synced: 2025-09-02T05:36:24.580Z (11 months ago)
Topics: anthropic, ci, github-actions, guardrails, llm, openai, prompt, rag, security, testing
Language: TypeScript
Homepage: https://github.com/geminimir/promptproof?tab=readme-ov-file#-quick-start
Size: 382 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 15
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Roadmap: ROADMAP.md
Awesome Lists containing this project

README

          # PromptProof

Deterministic LLM testing for production reliability.  

**Record→Replay + policy‑as‑code** to catch PII leaks, schema drift, and behavioral regressions **before merge**.



  

    

  

   

  

    

  



[![CI](https://img.shields.io/github/actions/workflow/status/geminimir/promptproof/promptproof.yml?branch=main)](https://github.com/geminimir/promptproof/actions)

[![Action](https://img.shields.io/badge/Marketplace-promptproof--action-blue?logo=github)](https://github.com/marketplace/actions/promptproof-eval)

[![npm (CLI)](https://img.shields.io/npm/v/promptproof-cli?label=promptproof-cli)](https://www.npmjs.com/package/promptproof-cli)

[![npm (SDK)](https://img.shields.io/npm/v/promptproof-sdk-node?label=sdk--node)](https://www.npmjs.com/package/promptproof-sdk-node)

![node](https://img.shields.io/badge/node-%3E%3D18-brightgreen)

![license](https://img.shields.io/badge/license-MIT-green)

[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](#contributing)

## Try it in 60s 🚀

```bash

# Clone and try the example (expect a failure)

git clone https://github.com/geminimir/promptproof.git

cd promptproof

corepack enable && corepack prepare pnpm@9 --activate

pnpm i

pnpm run try:example

```

This runs PromptProof against a deliberately failing JSON output. **No API calls, no setup** - just pure validation.

Then fix it:

```bash

pnpm run fix:example  # Now it passes!

```

## Quickstart

### Run locally

```bash

npx promptproof-cli@latest eval -c promptproof.yaml --out report

```

### GitHub Action

```yaml

# .github/workflows/promptproof.yml

name: PromptProof

on: [pull_request]

permissions: { contents: read, pull-requests: write, security-events: write }

jobs:

  eval:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v4

      - uses: geminimir/promptproof-action@v0

        with:

          config: promptproof.yaml

          format: sarif  # optional: upload to Code Scanning

```

### One‑line recording (Node)

```ts

import OpenAI from 'openai'

import { withPromptProofOpenAI } from 'promptproof-sdk-node/openai'

const ai = withPromptProofOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY }), { suite: 'support-replies' })

```

This writes sanitized JSONL lines to `fixtures//outputs*.jsonl` for deterministic CI replay. No network calls during CI.

## Guarantees

- **Deterministic CI:** replay fixtures offline; **zero** network calls in CI.

- **Safety by default:** PII redaction on (emails/phones). The SDK never blocks your app if recording fails.

- **Provider‑agnostic:** we evaluate outputs, not vendors.

## Why not just JSON schema at runtime?

- We enforce **pre‑merge** gates (block risky PRs), not best‑effort runtime checks.

- **Replay** of real outputs removes flakiness (no live model calls in CI).

- **Budgets** catch cost/latency creep alongside quality rules.

## Examples

- Demo app: `examples/node-support-bot/`

- Fixtures: `fixtures/` (support replies, RAG, tool calls)

- Failure Zoo: `zoo/` — real cases with copy‑pasteable rules

## What it looks like

![Red→Green PR demo](./docs/assets/red-green.gif)

## 🎯 Key Features

- **Deterministic Replay**: Test against recorded LLM outputs with zero network calls in CI

- **Comprehensive Assertions**: JSON schemas, regex patterns, numeric bounds, string operations, list/set equality, file diffs, and custom checks

- **Regression Testing**: Snapshot baselines and automatic comparison to catch new failures and performance degradation

- **Cost Controls**: Budget gates for total cost, per-test cost, and latency with regression tracking

- **Flake Management**: Seed control and multiple runs with stability scoring for non-deterministic checks

- **CI/CD Integration**: GitHub Action that fails PRs on violations with detailed reporting

- **Provider Agnostic**: Works with OpenAI, Anthropic, and any HTTP-based LLM API

- **Privacy First**: Built-in PII redaction and offline evaluation

## 📋 Requirements

- Node.js >= 18.0.0

- npm >= 8.0.0

## 🚀 Quick Start

### Install Packages

```bash

# Install CLI for evaluation

npm install -g promptproof-cli@beta

# Install SDK for recording (in your project)

npm install promptproof-sdk-node@beta

```

### Initialize in Your Project

```bash

# Initialize project structure

promptproof init --suite support-replies

```

This creates:

- `promptproof.yaml` - Policy configuration

- `fixtures/` - Directory for recorded outputs

- `.github/workflows/promptproof.yml` - GitHub Action workflow

## 📝 Record → Replay Workflow

### Step 1: Record LLM Outputs (One Line Change!)

#### OpenAI Integration

```javascript

import OpenAI from 'openai'

import { withPromptProofOpenAI } from 'promptproof-sdk-node/openai'

const base = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

export const ai = withPromptProofOpenAI(base, { suite: 'support-replies' })

// Use normally - fixtures are recorded automatically

const response = await ai.chat.completions.create({

  model: 'gpt-4',

  messages: [{ role: 'user', content: 'Hello!' }]

})

```

#### Anthropic Integration

```javascript

import Anthropic from '@anthropic-ai/sdk'

import { withPromptProofAnthropic } from 'promptproof-sdk-node/anthropic'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

export const claude = withPromptProofAnthropic(anthropic, { suite: 'rag-answers' })

// Use normally - fixtures are recorded automatically

const response = await claude.messages.create({

  model: 'claude-3-sonnet-20240229',

  max_tokens: 1000,

  messages: [{ role: 'user', content: 'Hello!' }]

})

```

#### Generic HTTP Integration

```javascript

import { wrapFetch } from 'promptproof-sdk-node/http'

// Wrap global fetch to record any LLM API calls

globalThis.fetch = wrapFetch(globalThis.fetch, { suite: 'generic-llm' })

// All fetch calls to LLM APIs are automatically recorded

const response = await fetch('https://api.openai.com/v1/chat/completions', {

  method: 'POST',

  headers: { 'Authorization': `Bearer ${apiKey}` },

  body: JSON.stringify({ model: 'gpt-4', messages: [...] })

})

```

### Step 2: Fixtures Are Created Automatically

Each LLM call creates a sanitized record in `fixtures//outputs.jsonl`:

```json

{

  "schema_version": "pp.v1",

  "id": "auto-generated",

  "timestamp": "2024-08-10T12:34:56Z",

  "source": "dev",

  "input": {

    "prompt": "user: Hello!\nassistant: Hi there!",

    "params": { "model": "gpt-4", "temperature": 0.7 }

  },

  "output": { "text": "Hello! How can I help you today?" },

  "metrics": { "latency_ms": 812, "cost_usd": 0.0012 },

  "redaction": { "status": "sanitized" }

}

```

### Step 3: Define Your Contracts

```yaml

# promptproof.yaml

schema_version: pp.v1

fixtures: fixtures/support-replies/outputs.jsonl

checks:

  - id: no_pii

    type: regex_forbidden

    target: output.text

    patterns:

      - "[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}"  # No emails

      - "\\b\\+?\\d[\\d\\s().-]{7,}\\b"           # No phone numbers

  

  - id: response_schema

    type: json_schema

    target: output.json

    schema:

      type: object

      required: [status, message]

      properties:

        status: { type: string, enum: [success, error] }

        message: { type: string }

  

  - id: contains_disclaimer

    type: string_contains

    target: output.text

    expected: "We cannot guarantee"

    ignore_case: true

  

  - id: response_list_exact

    type: list_equality

    target: output.json.items

    expected: ["step1", "step2", "step3"]

    order_sensitive: true

budgets:

  cost_usd_per_run_max: 0.50

  cost_usd_total_max: 10.00  # Total cost gate

  latency_ms_p95_max: 2000

  cost_usd_total_pct_increase_max: 10  # Max 10% cost increase vs baseline

mode: warn  # Start with 'warn', switch to 'fail' after validation

```

### Step 4: Evaluate Against Contracts

```bash

# Local evaluation

promptproof eval -c promptproof.yaml

# With regression comparison against baseline

promptproof eval -c promptproof.yaml --regress

# With flake controls for non-deterministic checks

promptproof eval -c promptproof.yaml --seed 42 --runs 3

# Create a snapshot after successful run

promptproof snapshot promptproof.yaml --promote

# In CI (automatic via GitHub Action)

# Runs on every PR and blocks merge on violations

```

## ⚙️ Environment Variables

Control recording behavior with environment variables:

| Variable | Default | Description |

|----------|---------|-------------|

| `PP_RECORD` | `1` (dev), `0` (prod) | Master on/off switch |

| `PP_SAMPLE_RATE` | `1.0` | Record 0-100% of calls |

| `PP_SUITE` | from options | Override suite name |

| `PP_OUT` | `fixtures` | Custom output directory |

| `PP_SOURCE` | `NODE_ENV` | Environment label |

| `PP_SHARD_BY_PID` | `0` | Write to `outputs..jsonl` |

## 🔒 Safety & Privacy

- ✅ **Redaction ON by default** - emails, phones, SSNs masked

- ✅ **Never blocks your app** - recording failures are logged, not thrown

- ✅ **No secrets recorded** - API keys and auth headers excluded

- ✅ **Deterministic** - same input = same output (ignoring timestamp/id)

- ✅ **Production ready** - sampling controls, PID sharding for concurrency

## 📊 Example Output

```

✓ Evaluated 142 fixtures

✗ 3 violations found:

  [no_pii] Record #47: Found forbidden pattern (email) at output.text

  [response_schema] Record #89: Missing required field 'status'

  [latency_budget] P95 latency: 2341ms exceeds limit of 2000ms

📊 Regression Comparison

Baseline: 2024-01-15-stable

⚠ 2 new failures:

  • [string_contains] test-102: Expected string "disclaimer" not found

  • [cost_budget] Total cost $12.50 exceeds budget $10.00

✓ 1 fixed failures

↔ 0 unchanged failures

Cost & Performance:

Cost: ↑ $2.50 (25.0%)

P95 Latency: ↑ 341ms

Exit code: 1

```

## 🏗️ Architecture

```

App/Service → SDK Wrapper → fixtures/*.jsonl

                    ↓

Developer → PR → GitHub Action → CLI eval → Report → Pass/Fail Gate

```

## 🔧 CLI Commands

```bash

promptproof eval        # Run contract checks on fixtures

  --regress             # Compare against baseline snapshot

  --seed             # Set seed for non-deterministic checks

  --runs             # Run non-deterministic checks k times

promptproof snapshot   # Create evaluation snapshot

  --promote             # Promote to baseline

  --tag           # Custom snapshot tag

promptproof init        # Initialize project with templates

promptproof promote     # Convert logs to fixture format

promptproof redact      # Remove PII from fixtures

promptproof validate    # Validate fixture schema

```

> **Note**: Use `npx promptproof-cli@beta` or install globally with `npm install -g promptproof-cli@beta`

> 

> **SDK**: Install `promptproof-sdk-node@beta` in your project for automatic fixture recording

## 📦 Packages

### Core Packages

- **`promptproof-cli@beta`**: Command-line interface for evaluation

- **`promptproof-sdk-node@beta`**: SDK wrappers for OpenAI, Anthropic, HTTP

### Architecture

- **CLI**: Evaluates pre-recorded fixtures against contracts

- **SDK**: Automatically records LLM interactions to fixtures

- **Workflow**: SDK records → CLI evaluates → CI gates

### Coming Soon

- **`@promptproof/action`**: GitHub Action for CI integration

- **`@promptproof/evaluator`**: Core evaluation engine (bundled in CLI)

## 🎪 Failure Zoo

Browse real-world LLM failure cases in our [Failure Zoo](./zoo) - anonymized production incidents with patterns and mitigations.

## 🎭 Demo Project

See our [demo project](./promptproof-demo-project) for a complete working example:

- **Realistic LLM application** with support & RAG endpoints

- **SDK integration** with automatic fixture recording

- **CLI validation** with intentional failure modes

- **CI/CD integration** via GitHub Actions

- **Red → Green demonstrations** showing PromptProof in action

## Support & Community

- Issues: https://github.com/geminimir/promptproof/issues

- Discussions: [GitHub Discussions](https://github.com/geminimir/promptproof/discussions)

## 🤝 Contributing

We welcome contributions! See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.

## 📄 License

MIT License - see [LICENSE](./LICENSE) for details.

## 🔗 Links

- [Website](https://promptproof.io)

- [GitHub Repository](https://github.com/geminimir/promptproof)

- [CLI Package](https://www.npmjs.com/package/promptproof-cli) (`@beta`)

- [SDK Package](https://www.npmjs.com/package/promptproof-sdk-node) (`@beta`)

- [GitHub Action](https://github.com/geminimir/promptproof-action)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/geminimir/promptproof

Awesome Lists containing this project

README