https://github.com/piyushgupta344/llm-test-harness

Deterministic testing framework for LLM-powered apps — record/replay cassettes, eval scoring, regression testing
https://github.com/piyushgupta344/llm-test-harness

anthropic cassette deterministic eval llm openai prompt-testing testing vcr

Last synced: about 2 months ago
JSON representation

Deterministic testing framework for LLM-powered apps — record/replay cassettes, eval scoring, regression testing

Host: GitHub
URL: https://github.com/piyushgupta344/llm-test-harness
Owner: piyushgupta344
License: mit
Created: 2026-03-16T07:10:48.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-16T15:37:52.000Z (3 months ago)
Last Synced: 2026-03-16T20:48:36.704Z (3 months ago)
Topics: anthropic, cassette, deterministic, eval, llm, openai, prompt-testing, testing, vcr
Language: TypeScript
Size: 112 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md

Awesome Lists containing this project

README

          # llm-test-harness

Deterministic testing framework for LLM-powered apps. Record real API responses once, replay them forever — no flaky tests, no wasted API calls.

## Packages

| Package | Language | Registry |

|---------|----------|----------|

| [`packages/llm-test-harness`](./packages/llm-test-harness) | TypeScript | npm |

| [`packages/llm-test-harness-python`](./packages/llm-test-harness-python) | Python | PyPI |

Both packages use an **identical cassette format** (YAML) and **identical SHA-256 hashing algorithm**, so cassettes recorded in one language can be replayed in the other.

---

## Core Features

- **Record/Replay** — Wrap an Anthropic or OpenAI client. In `record` mode, calls are made and saved to a YAML cassette. In `replay` mode, cassette responses are returned without any network call.

- **Eval Scoring** — Score LLM output against metrics: `ExactMatch`, `Contains`, `ContainsAll`, `Regex`, `JSONSchema`, `Similarity`, `LLMJudge`, `Custom`.

- **Regression Baseline** — Save a snapshot of metric scores; future runs detect score degradation.

- **Cassette Modes** — `record`, `replay`, `passthrough`, `hybrid`.

---

## TypeScript Quick Start

```bash

npm install llm-test-harness

```

```typescript

import { Harness, Metrics } from 'llm-test-harness'

import Anthropic from '@anthropic-ai/sdk'

const harness = new Harness({ cassettesDir: './cassettes', mode: 'replay' })

const client = harness.wrap(new Anthropic())

const response = await client.messages.create({

  model: 'claude-haiku-4-5-20251001',

  max_tokens: 100,

  messages: [{ role: 'user', content: 'Say hello.' }],

})

const result = await harness.evaluate(response.content[0].text, [

  Metrics.contains('hello'),

  Metrics.regex(/^(hello|hi)/i),

])

expect(result.pass).toBe(true)

// Regression testing

harness.saveBaseline('chat-greeting', result)

const regression = harness.compareBaseline('chat-greeting', result)

expect(regression.hasRegression).toBe(false)

```

---

## Python Quick Start

```bash

pip install "llm-test-harness[anthropic]"

```

```python

from llm_test_harness import Harness, Metrics

import anthropic

harness = Harness(cassettes_dir='./cassettes', mode='replay')

client = harness.wrap(anthropic.Anthropic())

response = client.messages.create(

    model='claude-haiku-4-5-20251001',

    max_tokens=100,

    messages=[{'role': 'user', 'content': 'Say hello.'}]

)

result = harness.evaluate(response.content[0].text, [

    Metrics.contains('hello'),

    Metrics.regex(r'^(hello|hi)', flags=re.IGNORECASE),

])

assert result.passed

# Regression testing

harness.save_baseline('chat-greeting', result)

regression = harness.compare_baseline('chat-greeting', result)

assert not regression.has_regression

```

---

## Cassette Format

Cassettes are plain YAML files, human-readable and diffable:

```yaml

version: 1

interactions:

  - id: "sha256:a3f1c2d4..."

    request:

      provider: anthropic

      model: claude-haiku-4-5-20251001

      messages:

        - role: user

          content: Say hello.

      params:

        max_tokens: 100

        temperature: 0.0

    response:

      type: message

      content:

        - type: text

          text: Hello!

      usage:

        input_tokens: 14

        output_tokens: 2

      stop_reason: end_turn

    metadata:

      recorded_at: "2026-03-16T11:09:00.000Z"

      duration_ms: 423

```

---

## Metrics

| Metric | Pass condition | Score |

|--------|---------------|-------|

| `ExactMatch(expected)` | `text === expected` | 1 or 0 |

| `Contains(substr)` | substring present | 1 or 0 |

| `ContainsAll(substrs[])` | all substrings present | found/total |

| `Regex(pattern)` | pattern matches | 1 or 0 |

| `JSONSchema(schema)` | valid JSON + schema passes | 1 or 0 |

| `Similarity(ref, threshold?)` | score ≥ threshold (default 0.8) | normalized Levenshtein |

| `LLMJudge(rubric, client)` | score ≥ threshold (default 0.7) | 0–1 from judge LLM |

| `Custom(name, fn)` | user-defined | user-defined |

---

## Development

```bash

# TypeScript

cd packages/llm-test-harness

pnpm install

pnpm test

pnpm build

# Python

cd packages/llm-test-harness-python

python3 -m venv .venv && source .venv/bin/activate

pip install -e ".[dev]"

pytest tests/ -v

```

---

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/piyushgupta344/llm-test-harness

Awesome Lists containing this project

README