https://github.com/javierdejesusda/checkllm
The pytest of LLM testing. Test LLM-powered applications with the same rigor as traditional software.
https://github.com/javierdejesusda/checkllm
ai-compliance ai-safety ai-testing anthropic hallucination llm llm-evaluation openai prompt-engineering pytest rag red-teaming
Last synced: 2 months ago
JSON representation
The pytest of LLM testing. Test LLM-powered applications with the same rigor as traditional software.
- Host: GitHub
- URL: https://github.com/javierdejesusda/checkllm
- Owner: javierdejesusda
- License: mit
- Created: 2026-03-28T20:07:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-18T11:07:44.000Z (3 months ago)
- Last Synced: 2026-04-18T12:25:16.351Z (3 months ago)
- Topics: ai-compliance, ai-safety, ai-testing, anthropic, hallucination, llm, llm-evaluation, openai, prompt-engineering, pytest, rag, red-teaming
- Language: Python
- Homepage: https://javierdejesusda.github.io/checkllm/
- Size: 1.82 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# CheckLLM: Reproducible Agent-Trajectory Evaluation
A deterministic, judge-free metric for scoring agent tool-call trajectories -- with AUROC 0.93 against synthetic ground truth and ~1500x faster than DeepEval's `ToolCorrectnessMetric`.
[](https://pypi.org/project/checkllm/) [](https://pypi.org/project/checkllm/) [](https://github.com/javierdejesusda/checkllm/blob/main/LICENSE) [](https://github.com/javierdejesusda/checkllm/actions/workflows/ci.yml) [](https://arxiv.org/abs/XXXX.YYYYY) [](https://doi.org/10.5281/zenodo.PLACEHOLDER) [](docs/benchmarks/competitor-comparison.md)
```bash
pip install checkllm
```
```python
from checkllm.metrics.trajectory_metric import TrajectoryMetric
# Expected plan vs. what the agent actually did
expected = ["search", "fetch", "parse", "respond"]
actual = ["search", "fetch", "parse", "fetch", "respond"]
metric = TrajectoryMetric(expected_trajectory=expected)
sub = metric.compute_subscores(actual)
print(f"ordering {sub.ordering:.2f}") # 0.80
print(f"loops {sub.loops:.2f}") # 1.00
print(f"coverage {sub.coverage:.2f}") # 1.00
print(f"unexpected {sub.unexpected:.2f}") # 1.00
print(f"overall {sub.overall:.2f}") # 0.92
```
No judge LLM. No API key. Bit-identical scores across runs. See [the 10-minute tutorial](docs/tutorials/evaluating-agents-in-10-minutes.md) for the full walkthrough.
## Why CheckLLM?
- **Deterministic** -- no judge LLM, no API cost, bit-identical scores across runs and machines.
- **Composite** -- 4-axis trajectory scoring (ordering, loops, coverage, unexpected), AUROC 0.93 [0.91, 0.94] on 500 trajectories.
- **OTel-compatible** -- ingest traces from any agent framework via OpenTelemetry GenAI semantic conventions.
Beyond trajectory evaluation, CheckLLM also ships the broader testing suite the project has always provided:
- **Zero learning curve** -- if you know pytest, you know checkllm. Just add a `check` parameter.
- **39 free deterministic checks** run instantly with zero API calls. No API key needed to start.
- **72 LLM-as-judge metrics** -- hallucination, faithfulness, trajectory, per-turn, dual-judge, and more.
- **151 red team vulnerability types** with 25 attack strategies -- the most comprehensive adversarial testing suite available.
- **17 compliance frameworks** -- OWASP LLM/API/Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, HIPAA, GDPR, and more.
- **Same checks everywhere** -- use them in tests, CI, and production guardrails.
## Quickstart
### Install
```bash
pip install checkllm
checkllm init --use-case rag # generates a tailored test file
```
### 1. Deterministic checks (free, instant)
```python
def test_basic_quality(check):
output = my_llm("Summarize this article.")
check.contains(output, "key finding")
check.max_tokens(output, limit=200)
check.no_pii(output)
check.is_json(output)
check.gleu(output, reference="Expected summary text.", threshold=0.5)
check.chrf(output, reference="Expected summary text.", threshold=0.4)
check.latency_check(start_time, end_time, max_ms=3000)
check.cost_check(input_tokens=500, output_tokens=200, model="gpt-4o", max_cost=0.05)
```
### 2. LLM-as-judge (deeper evaluation)
```python
def test_rag_quality(check):
output = my_rag("What causes climate change?")
context = retrieve_context("climate change")
check.hallucination(output, context=context)
check.faithfulness(output, context=context)
check.relevance(output, query="What causes climate change?")
check.toxicity(output)
```
### 3. Fluent chaining
```python
def test_with_chaining(check):
output = my_llm("Explain quantum physics simply.")
check.that(output) \
.contains("quantum") \
.max_tokens(200) \
.has_no_pii() \
.scores_above("relevance", 0.8, query="quantum physics")
```
### 4. Production guardrails
```python
from checkllm import Guard, CheckSpec
guard = Guard(checks=[
CheckSpec(check_type="no_pii"),
CheckSpec(check_type="max_tokens", params={"limit": 500}),
CheckSpec(check_type="toxicity"),
])
result = guard.validate(llm_output)
if not result.valid:
result.raise_on_failure()
```
### 5. YAML-based evaluation
```yaml
# checkllm.yaml
description: "Customer support chatbot evaluation"
judge:
backend: openai
model: gpt-4o
prompts:
- "You are a helpful support agent. Answer: {{query}}"
tests:
- vars:
query: "How do I return an item?"
assert:
- type: contains
value: "return policy"
- type: relevance
threshold: 0.8
- type: no_pii
- type: max_tokens
value: 500
settings:
budget: 5.0
```
```bash
checkllm eval-yaml checkllm.yaml
```
## How checkllm compares
> **Independent benchmark, not just feature counts.** On the public competitor leaderboard
> ([docs/benchmarks/competitor-comparison.md](docs/benchmarks/competitor-comparison.md))
> checkllm holds **rank 1 on every published row** against DeepEval and promptfoo:
> halubench/hallucination 0.783, ragtruth/hallucination 0.663,
> ragtruth/faithfulness 0.754, ragtruth/context_relevance 0.565, and
> truthfulqa/answer_relevancy 0.546 (ROC-AUC, gpt-4o-mini judge,
> 200 source rows per slice). Methodology is in
> [docs/benchmarks/methodology.md](docs/benchmarks/methodology.md);
> raw scores ship in `benchmarks/competitor_comparison/`.
### Feature comparison
| Feature | checkllm | DeepEval | Ragas | promptfoo |
|---------|----------|----------|-------|-----------|
| pytest native | Yes | Wrapper | No | No |
| Free deterministic checks | **39** | Limited | Limited | Yes |
| LLM-as-judge metrics | **72** | ~50 | ~40 | Custom |
| Red team vulnerability types | **151** | 40+ | 0 | 100+ |
| Attack strategies | **25** | 10+ | 0 | 30+ |
| Compliance frameworks | **17** | 3 | 0 | 10+ |
| Multi-provider judges | **15+ backends** | 13+ | ~6 | 50+ |
| Consensus judging | **7 strategies** | No | Dual-judge | No |
| Production guardrails | **Built-in** | No | No | API |
| Cost control & budgets | **Built-in** | No | No | Caching |
| Knowledge Graph synthesis | **Full pipeline** | No | Yes | No |
| Multilingual prompts | **20 languages** | No | Yes | No |
| Prompt optimization | **4 algorithms** | 4 | 2 | No |
| YAML config evaluation | **Yes** | No | No | Yes |
| Streaming evaluation | **Token-by-token** | No | No | No |
| Regression detection | **Statistical (p-values)** | No | No | No |
| DPO export | **Yes** | No | No | No |
| Telemetry / phoning home | **None** | PostHog + Sentry | None | Telemetry |
| Independence | **Fully independent** | YC-backed | YC-backed | OpenAI-owned |
## All metrics by category
### RAG Evaluation (14 metrics)
`hallucination` `faithfulness` `faithfulness_hhem` `context_relevance` `context_entity_recall` `contextual_precision` `contextual_recall` `answer_completeness` `groundedness` `nonllm_context_precision` `nonllm_context_recall` `quoted_spans_alignment` `nv_context_relevance` `nv_response_groundedness`
### General Quality (12 metrics)
`relevance` `coherence` `fluency` `consistency` `correctness` `factual_correctness` `sentiment` `toxicity` `bias` `summarization` `nv_answer_accuracy` `prompt_alignment`
### Completeness & Instruction Following (5 metrics)
`response_completeness` `instruction_following` `instruction_completeness` `conversation_completeness` `topic_adherence`
### Agent & Tool Evaluation (12 metrics)
`task_completion` `tool_accuracy` `tool_call_f1` `plan_adherence` `plan_quality` `step_efficiency` `knowledge_retention` `goal_accuracy` `trajectory_goal_success` `trajectory_tool_sequence` `trajectory_step_count` `trajectory_tool_args_match`
### Per-Turn Conversation (3 metrics)
`turn_relevancy` `turn_faithfulness` `turn_coherence`
### Multimodal (6 metrics)
`image_relevance` `image_helpfulness` `image_coherence` `text_to_image` `image_editing` `image_reference`
### Structured Output (4 metrics)
`code_correctness` `sql_equivalence` `comparative_quality` `datacompy_score`
### Role & Safety (3 metrics)
`role_adherence` `role_violation` `non_advice`
### MCP & Tool-Specific (3 metrics)
`mcp_use` `mcp_task_completion` `multi_turn_mcp_use`
### Specialized (3 metrics)
`g_eval` `noise_sensitivity` `rubric`
### Deterministic Checks (39, zero API cost)
`contains` `not_contains` `starts_with` `ends_with` `regex` `exact_match` `exact_match_strict` `min_tokens` `max_tokens` `min_words` `max_words` `min_chars` `max_chars` `min_sentences` `max_sentences` `is_json` `json_schema` `is_xml` `is_yaml` `is_html` `no_pii` `language` `readability` `similarity` `bleu` `rouge_l` `meteor` `gleu` `chrf` `latency_check` `cost_check` `string_distance` `perplexity` `is_valid_python` `is_url` `has_url` `word_count` `char_count` `sentence_count`
## Red teaming & adversarial testing
```python
from checkllm.redteam import RedTeamer, VulnerabilityType
from checkllm.redteam_strategies import StrategyType
red = RedTeamer()
report = await red.scan(
target=my_llm_function,
vulnerability_types=[
VulnerabilityType.PROMPT_INJECTION,
VulnerabilityType.JAILBREAK,
VulnerabilityType.PII_LEAKAGE,
VulnerabilityType.DATA_EXFILTRATION,
],
strategies=[StrategyType.BASE64, StrategyType.CRESCENDO, StrategyType.PERSONA],
attacks_per_type=5,
)
print(report.summary())
print(report.risk_summary()) # CVSS severity breakdown
```
**151 vulnerability types** across 12 categories: prompt injection, jailbreak, PII leakage, harmful content, encoding attacks, privilege escalation, agentic AI attacks, brand & reputation, industry compliance, and more.
**25 attack strategies**: BASE64, ROT13, HEX, LEETSPEAK, MORSE, HOMOGLYPH, CRESCENDO (multi-turn escalation), JAILBREAK_TREE, JAILBREAK_META, JAILBREAK_COMPOSITE, BEST_OF_N, PERSONA, HYPOTHETICAL, ROLEPLAY, LAYER (composable chaining), and more.
### Coding agent security
```python
from checkllm.redteam_coding_agents import CodingAgentScanner
scanner = CodingAgentScanner(judge=judge)
report = await scanner.scan(target=my_coding_agent)
# Tests: repo prompt injection, sandbox escape, secret leakage, verifier sabotage
```
## Compliance frameworks
```python
from checkllm.compliance_frameworks import ComplianceScanner, ComplianceFramework
scanner = ComplianceScanner(judge=judge)
report = await scanner.scan(
target=my_llm,
frameworks=[
ComplianceFramework.OWASP_LLM_TOP10,
ComplianceFramework.OWASP_AGENTIC_TOP10,
ComplianceFramework.EU_AI_ACT,
ComplianceFramework.HIPAA,
],
)
print(report.summary())
```
**17 frameworks**: OWASP LLM Top 10, OWASP API Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, NIST AI RMF, NIST CSF, HIPAA, GDPR, PCI-DSS, SOC2, ISO 27001, COPPA, FERPA, CCPA, DoD AI Ethics.
## Knowledge Graph test generation
```python
from checkllm.knowledge_graph import KGTestGenerator, EntityExtractor, SimilarityBuilder
gen = KGTestGenerator(judge=judge)
samples = await gen.generate(
documents=["doc1 text...", "doc2 text..."],
num_samples=50,
synthesizers={"single_hop": 0.4, "multi_hop_abstract": 0.3, "multi_hop_specific": 0.3},
personas=5,
)
cases = gen.to_cases(samples)
```
Build a knowledge graph from your documents, then generate diverse test cases with single-hop, multi-hop abstract, and multi-hop specific queries. Supports persona variation, query styles (web search, misspelled, conversational), and configurable complexity.
## Multilingual evaluation
```python
from checkllm.multilingual import PromptAdapter, detect_language
adapter = PromptAdapter(judge=judge)
translated = await adapter.adapt(template=my_prompt, target_language="ja")
adapter.save_translations("translations/ja.json")
lang = detect_language("Esto es un texto en espanol.") # "es"
```
Supports 20+ languages with automatic prompt adaptation. Language detection uses Unicode character-range analysis with LLM fallback.
## Prompt optimization
```python
from checkllm.optimize import create_optimizer
optimizer = create_optimizer("miprov2", judge=judge) # or "genetic", "copro", "simba"
result = await optimizer.optimize(
prompt="Summarize this document.",
test_cases=my_test_cases,
metric_fn=my_metric,
num_candidates=10,
)
print(f"Improved from {result.initial_score:.2f} to {result.best_score:.2f}")
```
Four optimization algorithms: Genetic (evolutionary), MIPROv2 (instruction + demonstration), COPRO (failure-driven iterative), SIMBA (similarity-based adaptation).
## Multi-provider judges
```python
from checkllm import create_judge
judge = create_judge("openai", model="gpt-4o")
judge = create_judge("anthropic", model="claude-sonnet-4-6")
judge = create_judge("gemini", model="gemini-2.0-flash")
judge = create_judge("ollama", model="llama3.1") # Free, local
judge = create_judge("litellm", model="any-model") # 100+ models
judge = create_judge("deepseek")
judge = create_judge("groq")
judge = create_judge("fireworks")
```
Auto-detection: set `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, or have Ollama running -- checkllm picks the best judge automatically.
## Consensus judging
```python
from checkllm import ConsensusJudge
judges = [("gpt4", gpt4_judge), ("claude", claude_judge), ("gemini", gemini_judge)]
consensus = ConsensusJudge(judges, strategy="majority") # or mean, median, unanimous, min, max, weighted
```
## Cost control
```bash
checkllm estimate tests/ # See costs before running
checkllm run tests/ --budget 5.0 # Cap spend at $5
checkllm run tests/ --dry-run # Estimate without executing
```
## Configuration
```toml
# pyproject.toml
[tool.checkllm]
judge_backend = "auto"
judge_model = "gpt-4o"
default_threshold = 0.8
budget = 10.0
cache_enabled = true
engine = "auto"
```
## CLI
| Command | Description |
|---------|-------------|
| `checkllm init` | Scaffold a project (`--use-case`, `--ci`) |
| `checkllm run` | Run tests (`--budget`, `--dry-run`, `--snapshot`) |
| `checkllm eval-yaml` | Run YAML-based evaluation |
| `checkllm estimate` | Estimate costs before running |
| `checkllm watch` | Re-run on file changes |
| `checkllm report` | Generate HTML report |
| `checkllm snapshot` | Save baseline for regression detection |
| `checkllm diff` | Compare snapshots |
| `checkllm history` | View run history and trends |
| `checkllm list-metrics` | Show all available checks and metrics |
| `checkllm cache` | Manage judge response cache |
| `checkllm dashboard` | Launch web dashboard |
## Framework integrations
```python
# LangChain
from checkllm.integrations.langchain import CheckllmCallbackHandler
chain.invoke(input, config={"callbacks": [CheckllmCallbackHandler(checks=["no_pii"])]})
# CrewAI
from checkllm.integrations.crewai import CheckllmCrewCallback
# OpenAI Agents SDK
from checkllm.integrations.openai_agents import CheckllmRunHandler
# Claude Agent SDK
from checkllm.integrations.claude_agents import CheckllmAgentHandler
# PydanticAI
from checkllm.integrations.pydantic_ai import CheckllmResultValidator
# LlamaIndex
from checkllm.integrations.llama_index import CheckllmCallbackHandler
```
## Custom metrics
```python
from checkllm import metric, CheckResult
@metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
words = len(output.split())
return CheckResult(
passed=words <= max_words,
score=min(1.0, max_words / max(words, 1)),
reasoning=f"{words} words (limit: {max_words})",
cost=0.0, latency_ms=0, metric_name="brevity",
)
```
## Citing CheckLLM
If you use CheckLLM's trajectory metric in academic work, please cite the companion paper:
```bibtex
@article{dejesus2026checkllm,
title = {{CheckLLM}: Reproducible Agent-Trajectory Evaluation at Scale},
author = {de Jesus, Javier},
journal = {arXiv preprint arXiv:XXXX.YYYYY},
year = {2026},
doi = {10.5281/zenodo.PLACEHOLDER},
url = {https://github.com/javierdejesusda/checkllm}
}
```
The arXiv ID and Zenodo DOI placeholders will be replaced once the paper-v1 tag is cut. See [`CITATION.cff`](CITATION.cff) for the canonical citation metadata.
## License
MIT