{"id":43892725,"url":"https://github.com/alepot55/agentrial","last_synced_at":"2026-02-11T22:00:51.289Z","repository":{"id":336711812,"uuid":"1150739123","full_name":"alepot55/agentrial","owner":"alepot55","description":"Statistical evaluation framework for AI agents","archived":false,"fork":false,"pushed_at":"2026-02-06T19:49:26.000Z","size":3588,"stargazers_count":8,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-09T21:58:05.640Z","etag":null,"topics":["agent-evaluation","ai-agents","ai-testing","ci-cd","confidence-intervals","llm","llm-evaluation","mlops","non-deterministic","pytest","python","quality-assurance","statistical-testing","testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alepot55.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-05T16:25:29.000Z","updated_at":"2026-02-08T10:43:14.000Z","dependencies_parsed_at":"2026-02-07T18:00:34.022Z","dependency_job_id":null,"html_url":"https://github.com/alepot55/agentrial","commit_stats":null,"previous_names":["alepot55/agentrial"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/alepot55/agentrial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alepot55%2Fagentrial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alepot55%2Fagentrial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alepot55%2Fagentrial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alepot55%2Fagentrial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alepot55","download_url":"https://codeload.github.com/alepot55/agentrial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alepot55%2Fagentrial/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29316357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-10T20:44:44.282Z","status":"ssl_error","status_checked_at":"2026-02-10T20:44:43.393Z","response_time":65,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","ai-agents","ai-testing","ci-cd","confidence-intervals","llm","llm-evaluation","mlops","non-deterministic","pytest","python","quality-assurance","statistical-testing","testing"],"created_at":"2026-02-06T17:06:40.235Z","updated_at":"2026-02-10T21:01:03.025Z","avatar_url":"https://github.com/alepot55.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003eagentrial\u003c/h1\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cstrong\u003eThe pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes.\u003c/strong\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/agentrial/\"\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/agentrial?color=blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\u003cimg alt=\"License: MIT\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://www.python.org/downloads/\"\u003e\u003cimg alt=\"Python 3.11+\" src=\"https://img.shields.io/badge/python-3.11+-blue.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/alepot55/agentrial/actions\"\u003e\u003cimg alt=\"Tests\" src=\"https://img.shields.io/badge/tests-450%20passed-brightgreen\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://marketplace.visualstudio.com/items?itemName=alepot55.agentrial-vscode\"\u003e\u003cimg alt=\"VS Code\" src=\"https://img.shields.io/badge/VS%20Code-marketplace-blue\"\u003e\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/p\u003e\n\nYour agent passes Monday, fails Wednesday. Same prompt, same model. LLMs show up to [72% variance across runs](https://arxiv.org/abs/2407.02100) even at temperature=0.\n\n**agentrial** runs your agent N times and gives you statistics, not luck.\n\n```bash\npip install agentrial\nagentrial init\nagentrial run\n```\n\n```\n╭──────────────────────────────────────────────────────────────────────────╮\n│ my-agent - FAILED                                                        │\n╰───────────────────────────────────────────────────────── Threshold: 85% ─╯\n┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ Test Case            ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃\n┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ easy-multiply        │    100.0% │ (72.2%-100.0%)   │  $0.0005 │       320ms │\n│ tool-selection       │     90.0% │ (59.6%-98.2%)    │  $0.0006 │       450ms │\n│ multi-step-task      │     70.0% │ (39.7%-89.2%)    │  $0.0011 │       890ms │\n│ ambiguous-query      │     50.0% │ (23.7%-76.3%)    │  $0.0008 │       670ms │\n└──────────────────────┴───────────┴──────────────────┴──────────┴─────────────┘\n\nFailure Attribution:\n  tool-selection: Step 0 — called 'calculate' instead of 'lookup_country_info' (p=0.003)\n  multi-step-task: Step 2 — missing second tool call 'calculate' after lookup (p=0.01)\n  ambiguous-query: Step 0 — tool selection inconsistent across runs (p\u003c0.001)\n\nOverall Pass Rate: 77.5% (68.4%-84.5%) — BELOW THRESHOLD\nTotal Cost: $0.0600\n```\n\nThat 100% on `easy-multiply`? Wilson CI says it's actually 72-100% with 10 trials. That `multi-step-task` at 70%? Step 2 is the bottleneck. Now you know what to fix.\n\n---\n\n## Why this exists\n\nEvery agent framework demos 90%+ accuracy. Run those agents 100 times on the same task, pass rates drop to 60-80% with wide variance. Benchmarks measure one run; production sees thousands.\n\nNo existing tool combines trajectory evaluation, multi-trial statistics, and CI/CD integration in a single open-source package. LangSmith requires paid accounts and LangChain lock-in. Promptfoo doesn't do multi-trial with confidence intervals. DeepEval and Arize don't do step-level failure attribution.\n\nagentrial fills that gap: open-source, free, local-first, works with any framework.\n\n---\n\n## Core features\n\n**Statistical rigor by default.** Every evaluation runs N trials with Wilson confidence intervals. Bootstrap resampling for cost/latency. Benjamini-Hochberg correction for multiple comparisons. No single-run pass/fail.\n\n**Step-level failure attribution.** When tests fail, agentrial compares trajectories from passing and failing runs. Fisher exact test identifies the specific step where behavior diverges. You see \"Step 2 tool selection is the problem\" instead of \"test failed.\"\n\n**Real cost tracking.** Token usage from API response metadata, not estimates. 45+ models across Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek. Cost-per-correct-answer as a first-class metric — the number that actually matters for production.\n\n**Regression detection.** Fisher exact test on pass rates, Mann-Whitney U on cost/latency. Catches statistically significant drops between versions. Exit code 1 blocks your PR in CI.\n\n**Agent Reliability Score.** A single 0-100 composite metric that combines accuracy (40%), consistency (20%), cost efficiency (10%), latency (10%), trajectory quality (10%), and recovery (10%). One number to track across releases — like Lighthouse for agents.\n\n**Production monitoring.** Deploy `agentrial monitor` as a cron job or sidecar. CUSUM and Page-Hinkley detectors catch drift in pass rate, cost, and latency. Kolmogorov-Smirnov test detects distribution shifts. Alerts before users notice.\n\n**Local-first.** Data never leaves your machine. No accounts, no SaaS, no telemetry.\n\n---\n\n## Writing tests\n\n```yaml\nsuite: my-agent-tests\nagent: my_module.agent       # Python import path\ntrials: 10\nthreshold: 0.85\n\ncases:\n  - name: basic-math\n    input:\n      query: \"What is 15 * 37?\"\n    expected:\n      output_contains: [\"555\"]\n      tool_calls:\n        - tool: calculate\n\n  - name: capital-lookup\n    input:\n      query: \"What is the capital of Japan?\"\n    expected:\n      output_contains: [\"Tokyo\"]\n\n  - name: error-handling\n    input:\n      query: \"Divide 10 by 0\"\n    expected:\n      output_contains_any: [\"undefined\", \"cannot\", \"error\"]\n    max_cost: 0.05\n    max_latency_ms: 5000\n```\n\nFor complex assertions, use the fluent Python API:\n\n```python\nfrom agentrial import expect, AgentInput\n\nresult = agent(AgentInput(query=\"Book a flight to Rome\"))\n\nexpect(result).succeeded() \\\n    .tool_called(\"search_flights\", params_contain={\"destination\": \"FCO\"}) \\\n    .cost_below(0.15) \\\n    .latency_below(5000)\n```\n\nAll assertion types: `output_contains`, `output_contains_any`, `exact_match`, `regex`, `tool_calls` with `params_contain`, per-step expectations via `step_expectations`. See [full docs](https://github.com/alepot55/agentrial/wiki).\n\n---\n\n## Wrapping your agent\n\nagentrial needs a callable: `AgentInput -\u003e AgentOutput`. Native adapters handle the wiring.\n\n```python\n# LangGraph\nfrom agentrial.runner.adapters import wrap_langgraph_agent\nagent = wrap_langgraph_agent(your_compiled_graph)\n\n# CrewAI\nfrom agentrial.runner.adapters import wrap_crewai_agent\nagent = wrap_crewai_agent(crew)\n\n# Custom — implement the protocol directly\nfrom agentrial.types import AgentInput, AgentOutput, AgentMetadata\n\ndef agent(input: AgentInput) -\u003e AgentOutput:\n    return AgentOutput(\n        output=\"result\", steps=[],\n        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),\n        success=True,\n    )\n```\n\n| Framework | Adapter | What it captures |\n|---|---|---|\n| **LangGraph** | `wrap_langgraph_agent` | Callbacks, trajectory, tokens, cost |\n| **CrewAI** | `wrap_crewai_agent` | Task-level trajectory, crew cost |\n| **AutoGen** | `wrap_autogen_agent` | v0.4+ and legacy pyautogen |\n| **Pydantic AI** | `wrap_pydantic_ai_agent` | Tool calls, response parts, tokens |\n| **OpenAI Agents SDK** | `wrap_openai_agents_agent` | Runner integration, tool calls |\n| **smolagents (HF)** | `wrap_smolagents_agent` | Dict and object log formats |\n| **Any OTel agent** | Automatic | Span capture via OTel SDK |\n| **Custom** | `AgentInput -\u003e AgentOutput` | Whatever you return |\n\n---\n\n## CI/CD\n\n```yaml\nname: Agent Evaluation\non: [push, pull_request]\n\njobs:\n  eval:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-python@v5\n        with:\n          python-version: \"3.11\"\n      - run: pip install agentrial \u0026\u0026 pip install -e .\n      - run: agentrial run --trials 10 --threshold 0.85 -o results.json\n        env:\n          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n```\n\nRegression detection between runs:\n\n```bash\nagentrial compare results.json --baseline baseline.json\n```\n\nFisher exact test (p \u003c 0.05) detects statistically significant regressions. Exit code 1 blocks the PR.\n\n---\n\n## Advanced features\n\n### Trajectory flame graphs\n\n```bash\nagentrial run --flamegraph                         # Terminal\nagentrial run --flamegraph --html flamegraph.html   # Interactive HTML\n```\n\nVisualize agent execution paths across trials. See where passing and failing runs diverge, step by step.\n\n### LLM-as-Judge\n\n```bash\nagentrial run --judge\n```\n\nA second LLM evaluates response quality with calibrated scoring. Krippendorff's alpha for inter-rater reliability, t-distribution CI for score estimates. Calibration protocol runs before scoring to ensure consistency.\n\n### Snapshot testing\n\n```bash\nagentrial snapshot update     # Save current behavior as baseline\nagentrial snapshot check      # Compare against baseline\n```\n\nFisher exact test on pass rates, Mann-Whitney U on cost/latency, Benjamini-Hochberg correction across all comparisons.\n\n### MCP security scanner\n\n```bash\nagentrial security scan --mcp-config servers.json\n```\n\nAudits MCP server configurations for 6 vulnerability classes: prompt injection, tool shadowing, data exfiltration, permission escalation, rug pull, configuration weakness.\n\n### Pareto frontier\n\n```bash\nagentrial pareto --models claude-3-haiku,gpt-4o-mini,gemini-flash\n```\n\nFind the optimal cost-accuracy trade-off across models. ASCII plot in terminal.\n\n### Prompt version control\n\n```bash\nagentrial prompt track prompts/v2.txt\nagentrial prompt diff v1 v2\n```\n\nTrack, diff, and compare prompt versions with statistical significance testing between them.\n\n### Benchmark registry\n\n```bash\nagentrial publish results.json --agent-name my-agent --agent-version 1.0.0\nagentrial verify --agent-name my-agent --agent-version 1.0.0 --suite-name my-suite\n```\n\nPublish evaluation results as verifiable benchmark files with SHA-256 integrity checksums.\n\n### Multi-agent evaluation\n\nDelegation accuracy, handoff fidelity, redundancy rate, cascade failure depth, communication efficiency — five metrics for multi-agent systems.\n\n### Dashboard\n\n```bash\nagentrial dashboard\n```\n\nLocal FastAPI dashboard for browsing results, comparing runs, and tracking trends.\n\n### Eval packs\n\n```bash\nagentrial packs list\n```\n\nDomain-specific evaluation packages via Python entry points. Install a pack, get specialized test templates and evaluators.\n\n---\n\n## VS Code extension\n\nBrowse test suites, run evaluations, view flame graphs, and compare snapshots from your editor. Install from the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=alepot55.agentrial-vscode) or search \"agentrial\" in extensions.\n\n---\n\n## Statistical methods\n\n| Method | Purpose |\n|---|---|\n| **Wilson score interval** | Pass rate CI — accurate at 0%, 100%, and small N |\n| **Bootstrap resampling** | Cost/latency CI — non-parametric, 500 iterations |\n| **Fisher exact test** | Regression detection and failure attribution (p \u003c 0.05) |\n| **Mann-Whitney U** | Cost/latency comparison between versions |\n| **Benjamini-Hochberg** | False discovery rate control for multiple comparisons |\n| **CUSUM / Page-Hinkley** | Sequential change-point detection for production monitoring |\n| **Kolmogorov-Smirnov** | Distribution shift detection |\n| **Krippendorff's alpha** | Inter-rater reliability for LLM-as-Judge |\n\nFailure attribution works by grouping trials into pass/fail, comparing tool call distributions at each step, and identifying the step with the lowest p-value as the most likely divergence point.\n\n---\n\n## CLI reference\n\n```bash\nagentrial init                              # Scaffold sample project\nagentrial run                               # Run all tests\nagentrial run tests/ --trials 20            # Custom trials\nagentrial run -o results.json               # JSON export\nagentrial run --flamegraph                  # Trajectory flame graphs\nagentrial run --judge                       # LLM-as-Judge evaluation\nagentrial compare results.json -b base.json # Regression detection\nagentrial baseline results.json             # Save baseline\nagentrial snapshot update / check           # Snapshot testing\nagentrial security scan --mcp-config c.json # MCP security scan\nagentrial pareto --models m1,m2,m3          # Cost-accuracy Pareto frontier\nagentrial prompt track/diff/list            # Prompt version control\nagentrial monitor --baseline snap.json      # Production drift detection\nagentrial ars results.json                  # Agent Reliability Score\nagentrial publish results.json --agent-name X --agent-version Y\nagentrial verify --agent-name X --agent-version Y --suite-name Z\nagentrial packs list                        # Installed eval packs\nagentrial dashboard                         # Local dashboard\n```\n\n---\n\n## How it compares\n\n| | agentrial | Promptfoo | LangSmith | DeepEval | Arize Phoenix |\n|---|---|---|---|---|---|\n| Multi-trial with CI | **Free** | — | $39/mo | — | — |\n| Confidence intervals | Wilson CI | — | — | — | — |\n| Step-level failure attribution | Fisher exact | — | — | — | Partial |\n| Framework-agnostic | 6 adapters + OTel | Yes | LangChain only | Yes | Yes |\n| Cost-per-correct-answer | Yes | — | — | — | — |\n| LLM-as-Judge with calibration | Krippendorff α | — | Yes | Yes | — |\n| Composite reliability score | ARS (0-100) | — | — | — | — |\n| MCP security scanning | 6 vuln classes | — | — | — | — |\n| Production drift detection | CUSUM + PH + KS | — | — | — | Partial |\n| VS Code extension | Yes | — | — | — | — |\n| Local-first | Yes | Yes | No | No | Self-host option |\n\n---\n\n## Contributing\n\n```bash\ngit clone https://github.com/alepot55/agentrial.git\ncd agentrial\npip install -e \".[dev]\"\npytest                    # 450 tests\nruff check .\nmypy agentrial/\n```\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for details.\n\n---\n\n## License\n\n[MIT](LICENSE)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falepot55%2Fagentrial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falepot55%2Fagentrial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falepot55%2Fagentrial/lists"}