{"id":47720647,"url":"https://github.com/gojiplus/understudy","last_synced_at":"2026-04-02T19:24:39.102Z","repository":{"id":343070646,"uuid":"1175665029","full_name":"gojiplus/understudy","owner":"gojiplus","description":"Scenario Testing for AI Agents","archived":false,"fork":false,"pushed_at":"2026-03-31T01:11:43.000Z","size":2176,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-31T04:36:34.819Z","etag":null,"topics":["agent-eval","agent-evaluation","agentic","evaluation","google-adk","simulation"],"latest_commit_sha":null,"homepage":"https://gojiplus.github.io/understudy/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gojiplus.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-08T02:10:40.000Z","updated_at":"2026-03-31T01:11:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gojiplus/understudy","commit_stats":null,"previous_names":["gojiplus/understudy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gojiplus/understudy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojiplus%2Funderstudy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojiplus%2Funderstudy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojiplus%2Funderstudy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojiplus%2Funderstudy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gojiplus","download_url":"https://codeload.github.com/gojiplus/understudy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojiplus%2Funderstudy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31314375,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-eval","agent-evaluation","agentic","evaluation","google-adk","simulation"],"created_at":"2026-04-02T19:24:38.339Z","updated_at":"2026-04-02T19:24:39.093Z","avatar_url":"https://github.com/gojiplus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## understudy: Scenario Testing for AI Agents\n\n[![PyPI version](https://badge.fury.io/py/understudy.svg)](https://badge.fury.io/py/understudy)\n[![PyPI Downloads](https://static.pepy.tech/personalized-badge/understudy?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=GREEN\u0026left_text=downloads)](https://pepy.tech/project/understudy)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![Documentation](https://github.com/gojiplus/understudy/actions/workflows/docs.yml/badge.svg)](https://gojiplus.github.io/understudy/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\nUnderstudy is a scenario-driven testing framework for AI agents that simulates realistic multi-turn users, runs those scenes against an agent through a simple app adapter, records a structured execution trace of messages, tool calls, and handoffs, and then evaluates behavior with deterministic checks, optional LLM judges, and run reports.\n\n## How It Works\n\nTesting with understudy is **4 steps**:\n\n1. **Wrap your agent** — Adapt your agent (ADK, LangGraph, HTTP) to understudy's interface\n2. **Mock your tools** — Register handlers that return test data instead of calling real services\n3. **Write scenes** — YAML files defining what the simulated user wants and what you expect\n4. **Run and assert** — Execute simulations, check traces, generate reports\n\nThe key insight: **assert against the trace, not the prose**. Don't check what the agent said—check what it did (tool calls).\n\n**See real examples:**\n- [Example scene](https://github.com/gojiplus/understudy/blob/main/example/scenes/return_eligible_backpack.yaml) — YAML defining a test scenario\n- [ADK test file](https://github.com/gojiplus/understudy/blob/main/example/adk/test_returns.py) — pytest assertions against traces\n- [LangGraph test file](https://github.com/gojiplus/understudy/blob/main/example/langgraph/test_returns.py) — same tests, different framework\n- [Example report](https://htmlpreview.github.io/?https://github.com/gojiplus/understudy/blob/main/example/langgraph/report/index.html) — HTML report with metrics and transcripts\n\n## Installation\n\n```bash\npip install understudy[all]\n```\n\n## Quick Start\n\n### 1. Wrap your agent\n\n```python\nfrom understudy.adk import ADKApp\nfrom my_agent import agent\n\napp = ADKApp(agent=agent)\n```\n\n### 2. Mock your tools\n\nYour agent has tools that call external services. Mock them for testing:\n\n```python\nfrom understudy.mocks import MockToolkit\n\nmocks = MockToolkit()\n\n@mocks.handle(\"lookup_order\")\ndef lookup_order(order_id: str) -\u003e dict:\n    return {\"order_id\": order_id, \"items\": [...], \"status\": \"delivered\"}\n\n@mocks.handle(\"create_return\")\ndef create_return(order_id: str, item_sku: str, reason: str) -\u003e dict:\n    return {\"return_id\": \"RET-001\", \"status\": \"created\"}\n```\n\n### 3. Write a scene\n\nCreate `scenes/return_backpack.yaml`:\n\n```yaml\nid: return_eligible_backpack\ndescription: Customer wants to return a backpack\n\nstarting_prompt: \"I'd like to return an item please.\"\nconversation_plan: |\n  Goal: Return the hiking backpack from order ORD-10031.\n  - Provide order ID when asked\n  - Return reason: too small\n\npersona: cooperative\nmax_turns: 15\n\nexpectations:\n  required_tools:\n    - lookup_order\n    - create_return\n  forbidden_tools:\n    - issue_refund\n```\n\n### 4. Run simulation\n\n```python\nfrom understudy import Scene, run\n\nscene = Scene.from_file(\"scenes/return_backpack.yaml\")\ntrace = run(app, scene, mocks=mocks)\n\nassert trace.called(\"lookup_order\")\nassert trace.called(\"create_return\")\nassert not trace.called(\"issue_refund\")\n```\n\nOr with pytest (define `app` and `mocks` fixtures in conftest.py):\n\n```bash\npytest test_returns.py -v\n```\n\n## Suites and Batch Runs\n\nRun multiple scenes with multiple simulations per scene:\n\n```python\nfrom understudy import Suite, RunStorage\n\nsuite = Suite.from_directory(\"scenes/\")\nstorage = RunStorage()\n\n# Run each scene 3 times and tag for comparison\nresults = suite.run(\n    app,\n    mocks=mocks,\n    storage=storage,\n    n_sims=3,\n    tags={\"version\": \"v1\"},\n)\nprint(f\"{results.pass_count}/{len(results.results)} passed\")\n```\n\n## Simulation and Evaluation\n\nUnderstudy separates simulation (generating traces) from evaluation (checking traces). Use together or separately:\n\n### Combined (most common)\n\n```bash\nunderstudy run \\\n  --app mymodule:agent_app \\\n  --scene ./scenes/ \\\n  --n-sims 3 \\\n  --junit results.xml\n```\n\n### Separate workflows\n\nGenerate traces only:\n\n```bash\nunderstudy simulate \\\n  --app mymodule:agent_app \\\n  --scenes ./scenes/ \\\n  --output ./traces/ \\\n  --n-sims 3\n```\n\nEvaluate existing traces:\n\n```bash\nunderstudy evaluate \\\n  --traces ./traces/ \\\n  --output ./results/ \\\n  --junit results.xml\n```\n\nPython API:\n\n```python\nfrom understudy import simulate_batch, evaluate_batch\n\n# Generate traces\ntraces = simulate_batch(\n    app=agent_app,\n    scenes=\"./scenes/\",\n    n_sims=3,\n    output=\"./traces/\",\n)\n\n# Evaluate later\nresults = evaluate_batch(\n    traces=\"./traces/\",\n    output=\"./results/\",\n)\n```\n\n## CLI Commands\n\n```bash\n# Run simulations\nunderstudy run --app mymodule:app --scene ./scenes/\nunderstudy simulate --app mymodule:app --scenes ./scenes/\nunderstudy evaluate --traces ./traces/\n\n# View results\nunderstudy list\nunderstudy show \u003crun_id\u003e\nunderstudy summary\n\n# Compare runs by tag\nunderstudy compare --tag version --before v1 --after v2\n\n# Generate reports\nunderstudy report -o report.html\nunderstudy compare --tag version --before v1 --after v2 --html comparison.html\n\n# Interactive browser\nunderstudy serve --port 8080\n\n# HTTP simulator server (for browser/UI testing)\nunderstudy serve-api --port 8000\n\n# Cleanup\nunderstudy delete \u003crun_id\u003e\nunderstudy clear\n```\n\n## LLM Judges\n\nFor qualities that can't be checked deterministically:\n\n```python\nfrom understudy.judges import Judge\n\nempathy_judge = Judge(\n    rubric=\"The agent acknowledged frustration and was empathetic while enforcing policy.\",\n    samples=5,\n)\n\nresult = empathy_judge.evaluate(trace)\nassert result.score == 1\n```\n\nBuilt-in rubrics:\n\n```python\nfrom understudy.judges import (\n    TOOL_USAGE_CORRECTNESS,\n    POLICY_COMPLIANCE,\n    TONE_EMPATHY,\n    ADVERSARIAL_ROBUSTNESS,\n    TASK_COMPLETION,\n)\n```\n\n## Report Contents\n\nThe `understudy summary` command shows:\n- **Pass rate** — percentage of scenes that passed all expectations\n- **Avg turns** — average conversation length\n- **Tool usage** — distribution of tool calls across runs\n- **Agents** — which agents were invoked\n\nThe HTML report (`understudy report`) includes:\n- All metrics above\n- Full conversation transcripts\n- Tool call details with arguments\n- Expectation check results\n- Judge evaluation results (when used)\n\n## Documentation\n\nSee the [full documentation](https://gojiplus.github.io/understudy) for:\n- [Installation guide](https://gojiplus.github.io/understudy/installation.html)\n- [Writing scenes](https://gojiplus.github.io/understudy/tutorial/scenes.html)\n- [ADK integration](https://gojiplus.github.io/understudy/adk-integration.html)\n- [LangGraph integration](https://gojiplus.github.io/understudy/langgraph-integration.html)\n- [HTTP client for deployed agents](https://gojiplus.github.io/understudy/tutorial/http.html)\n- [API reference](https://gojiplus.github.io/understudy/api/index.html)\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgojiplus%2Funderstudy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgojiplus%2Funderstudy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgojiplus%2Funderstudy/lists"}