https://github.com/gojiplus/understudy
Scenario Testing for AI Agents
https://github.com/gojiplus/understudy
agent-eval agent-evaluation agentic evaluation google-adk simulation
Last synced: 3 months ago
JSON representation
Scenario Testing for AI Agents
- Host: GitHub
- URL: https://github.com/gojiplus/understudy
- Owner: gojiplus
- License: mit
- Created: 2026-03-08T02:10:40.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-31T01:11:43.000Z (3 months ago)
- Last Synced: 2026-03-31T04:36:34.819Z (3 months ago)
- Topics: agent-eval, agent-evaluation, agentic, evaluation, google-adk, simulation
- Language: Python
- Homepage: https://gojiplus.github.io/understudy/
- Size: 2.08 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
## understudy: Scenario Testing for AI Agents
[](https://badge.fury.io/py/understudy)
[](https://pepy.tech/project/understudy)
[](https://www.python.org/downloads/)
[](https://gojiplus.github.io/understudy/)
[](https://opensource.org/licenses/MIT)
Understudy is a scenario-driven testing framework for AI agents that simulates realistic multi-turn users, runs those scenes against an agent through a simple app adapter, records a structured execution trace of messages, tool calls, and handoffs, and then evaluates behavior with deterministic checks, optional LLM judges, and run reports.
## How It Works
Testing with understudy is **4 steps**:
1. **Wrap your agent** — Adapt your agent (ADK, LangGraph, HTTP) to understudy's interface
2. **Mock your tools** — Register handlers that return test data instead of calling real services
3. **Write scenes** — YAML files defining what the simulated user wants and what you expect
4. **Run and assert** — Execute simulations, check traces, generate reports
The key insight: **assert against the trace, not the prose**. Don't check what the agent said—check what it did (tool calls).
**See real examples:**
- [Example scene](https://github.com/gojiplus/understudy/blob/main/example/scenes/return_eligible_backpack.yaml) — YAML defining a test scenario
- [ADK test file](https://github.com/gojiplus/understudy/blob/main/example/adk/test_returns.py) — pytest assertions against traces
- [LangGraph test file](https://github.com/gojiplus/understudy/blob/main/example/langgraph/test_returns.py) — same tests, different framework
- [Example report](https://htmlpreview.github.io/?https://github.com/gojiplus/understudy/blob/main/example/langgraph/report/index.html) — HTML report with metrics and transcripts
## Installation
```bash
pip install understudy[all]
```
## Quick Start
### 1. Wrap your agent
```python
from understudy.adk import ADKApp
from my_agent import agent
app = ADKApp(agent=agent)
```
### 2. Mock your tools
Your agent has tools that call external services. Mock them for testing:
```python
from understudy.mocks import MockToolkit
mocks = MockToolkit()
@mocks.handle("lookup_order")
def lookup_order(order_id: str) -> dict:
return {"order_id": order_id, "items": [...], "status": "delivered"}
@mocks.handle("create_return")
def create_return(order_id: str, item_sku: str, reason: str) -> dict:
return {"return_id": "RET-001", "status": "created"}
```
### 3. Write a scene
Create `scenes/return_backpack.yaml`:
```yaml
id: return_eligible_backpack
description: Customer wants to return a backpack
starting_prompt: "I'd like to return an item please."
conversation_plan: |
Goal: Return the hiking backpack from order ORD-10031.
- Provide order ID when asked
- Return reason: too small
persona: cooperative
max_turns: 15
expectations:
required_tools:
- lookup_order
- create_return
forbidden_tools:
- issue_refund
```
### 4. Run simulation
```python
from understudy import Scene, run
scene = Scene.from_file("scenes/return_backpack.yaml")
trace = run(app, scene, mocks=mocks)
assert trace.called("lookup_order")
assert trace.called("create_return")
assert not trace.called("issue_refund")
```
Or with pytest (define `app` and `mocks` fixtures in conftest.py):
```bash
pytest test_returns.py -v
```
## Suites and Batch Runs
Run multiple scenes with multiple simulations per scene:
```python
from understudy import Suite, RunStorage
suite = Suite.from_directory("scenes/")
storage = RunStorage()
# Run each scene 3 times and tag for comparison
results = suite.run(
app,
mocks=mocks,
storage=storage,
n_sims=3,
tags={"version": "v1"},
)
print(f"{results.pass_count}/{len(results.results)} passed")
```
## Simulation and Evaluation
Understudy separates simulation (generating traces) from evaluation (checking traces). Use together or separately:
### Combined (most common)
```bash
understudy run \
--app mymodule:agent_app \
--scene ./scenes/ \
--n-sims 3 \
--junit results.xml
```
### Separate workflows
Generate traces only:
```bash
understudy simulate \
--app mymodule:agent_app \
--scenes ./scenes/ \
--output ./traces/ \
--n-sims 3
```
Evaluate existing traces:
```bash
understudy evaluate \
--traces ./traces/ \
--output ./results/ \
--junit results.xml
```
Python API:
```python
from understudy import simulate_batch, evaluate_batch
# Generate traces
traces = simulate_batch(
app=agent_app,
scenes="./scenes/",
n_sims=3,
output="./traces/",
)
# Evaluate later
results = evaluate_batch(
traces="./traces/",
output="./results/",
)
```
## CLI Commands
```bash
# Run simulations
understudy run --app mymodule:app --scene ./scenes/
understudy simulate --app mymodule:app --scenes ./scenes/
understudy evaluate --traces ./traces/
# View results
understudy list
understudy show
understudy summary
# Compare runs by tag
understudy compare --tag version --before v1 --after v2
# Generate reports
understudy report -o report.html
understudy compare --tag version --before v1 --after v2 --html comparison.html
# Interactive browser
understudy serve --port 8080
# HTTP simulator server (for browser/UI testing)
understudy serve-api --port 8000
# Cleanup
understudy delete
understudy clear
```
## LLM Judges
For qualities that can't be checked deterministically:
```python
from understudy.judges import Judge
empathy_judge = Judge(
rubric="The agent acknowledged frustration and was empathetic while enforcing policy.",
samples=5,
)
result = empathy_judge.evaluate(trace)
assert result.score == 1
```
Built-in rubrics:
```python
from understudy.judges import (
TOOL_USAGE_CORRECTNESS,
POLICY_COMPLIANCE,
TONE_EMPATHY,
ADVERSARIAL_ROBUSTNESS,
TASK_COMPLETION,
)
```
## Report Contents
The `understudy summary` command shows:
- **Pass rate** — percentage of scenes that passed all expectations
- **Avg turns** — average conversation length
- **Tool usage** — distribution of tool calls across runs
- **Agents** — which agents were invoked
The HTML report (`understudy report`) includes:
- All metrics above
- Full conversation transcripts
- Tool call details with arguments
- Expectation check results
- Judge evaluation results (when used)
## Documentation
See the [full documentation](https://gojiplus.github.io/understudy) for:
- [Installation guide](https://gojiplus.github.io/understudy/installation.html)
- [Writing scenes](https://gojiplus.github.io/understudy/tutorial/scenes.html)
- [ADK integration](https://gojiplus.github.io/understudy/adk-integration.html)
- [LangGraph integration](https://gojiplus.github.io/understudy/langgraph-integration.html)
- [HTTP client for deployed agents](https://gojiplus.github.io/understudy/tutorial/http.html)
- [API reference](https://gojiplus.github.io/understudy/api/index.html)
## License
MIT