An open API service indexing awesome lists of open source software.

https://github.com/zer0contextlost/llmgate

Pre-deploy LLM regression testing for CI pipelines
https://github.com/zer0contextlost/llmgate

ci evaluation llm llmops pytest python regression-testing testing

Last synced: 2 months ago
JSON representation

Pre-deploy LLM regression testing for CI pipelines

Awesome Lists containing this project

README

          

# llmgate

**Pre-deploy LLM regression testing for CI pipelines.** Trace your LLM calls, then `llmgate diff baseline current` fails your PR if output quality dropped. No server, no account, SQLite only.

```python
import llmgate

@llmgate.trace
def answer(question: str) -> str:
return my_llm_call(question)
```

That's it. Every call is logged locally. When you change your prompt or swap models, run:

```bash
llmgate diff main feature-branch
```

If outputs degraded, the command exits 1 and your PR fails. No server, no account, no config.

---

## Install

```bash
pip install llmgate
```

## Usage

### 1. Trace your LLM calls

```python
import llmgate
import os

os.environ["LLMGATE_RUN_ID"] = "v1.0" # set per run, or use git SHA in CI

@llmgate.trace
def my_pipeline(query: str) -> str:
context = retrieve(query)
return llm.complete(f"{context}\n\n{query}")
```

### 2. Assert output quality

```python
output = my_pipeline("What is the capital of France?")

llmgate.assert_contains(output, "Paris")
llmgate.assert_output(output, lambda s: len(s) < 500, "response too long")
llmgate.assert_similarity(output, baseline, threshold=0.85)
```

### 3. Diff runs in CI

```bash
# See all recorded runs
llmgate runs

# Compare two runs — exits 1 if regressions found
llmgate diff v1.0 v1.1

# Inspect a specific run
llmgate show abc123
```

### 4. GitHub Actions

```yaml
- name: Run LLM eval suite
env:
LLMGATE_RUN_ID: ${{ github.sha }}
run: python examples/eval_suite.py

- name: Check for regressions
run: llmgate diff ${{ github.base_ref }} ${{ github.sha }}
```

---

## How it works

- All traces are stored in `.llmgate.db` (SQLite, commit it or cache it as a CI artifact)
- `@llmgate.trace` works with any function that returns a string, or OpenAI/Anthropic response objects
- `llmgate diff` computes token-level similarity between baseline and current outputs
- Nothing leaves your machine unless you choose to push the `.db` file

## CLI reference

```
llmgate runs # list all runs with stats
llmgate show # inspect calls in a run
llmgate diff # compare runs, exit 1 on regression
--threshold FLOAT # similarity threshold (default: 0.8)
--no-fail # report only, don't exit 1
```