https://github.com/sbroenne/pytest-aitest
A pytest plugin for validating whether language models can actually understand and operate your interfaces: MCP servers, system prompts, agent skills and tools.
https://github.com/sbroenne/pytest-aitest
agents ai llm mcp model-context-protocol pytest python testing
Last synced: 4 months ago
JSON representation
A pytest plugin for validating whether language models can actually understand and operate your interfaces: MCP servers, system prompts, agent skills and tools.
- Host: GitHub
- URL: https://github.com/sbroenne/pytest-aitest
- Owner: sbroenne
- License: mit
- Created: 2026-02-01T08:05:34.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-11T21:31:32.000Z (5 months ago)
- Last Synced: 2026-02-12T04:46:33.606Z (5 months ago)
- Topics: agents, ai, llm, mcp, model-context-protocol, pytest, python, testing
- Language: Python
- Homepage:
- Size: 1.84 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
# pytest-aitest
[](https://pypi.org/project/pytest-aitest/)
[](https://pypi.org/project/pytest-aitest/)
[](https://github.com/sbroenne/pytest-aitest/actions/workflows/ci.yml)
[](https://opensource.org/licenses/MIT)
**Test your AI interfaces. AI analyzes your results.**
A pytest plugin for validating whether language models can understand and operate your MCP servers, tools, prompts, and skills.
## Why?
Your MCP server passes all unit tests. Then an LLM tries to use it and picks the wrong tool, passes garbage parameters, or ignores your system prompt.
**Because you tested the code, not the AI interface.** For LLMs, your API is tool descriptions, schemas, and prompts — not functions and types. Traditional tests can't validate them.
## How It Works
Write tests as natural language prompts. An **Agent** bundles an LLM with your tools — you assert on what happened:
```python
from pytest_aitest import Agent, Provider, MCPServer
async def test_weather_query(aitest_run):
agent = Agent(
provider=Provider(model="azure/gpt-5-mini"),
mcp_servers=[MCPServer(command=["python", "-m", "my_weather_server"])],
)
result = await aitest_run(agent, "What's the weather in Paris?")
assert result.success
assert result.tool_was_called("get_weather")
```
If the test fails, your tool descriptions need work — not your code.
## AI-Powered Reports
AI analyzes your results and tells you **what to fix**: which model to deploy, how to improve tool descriptions, where to cut costs. [See a sample report →](https://sbroenne.github.io/pytest-aitest/reports/05_hero.html)
> **Deploy: gpt-5-mini** — Highest pass rate at ~4–6x lower cost than gpt-4.1. gpt-4.1 disqualified due to failed core transfer test and session-planning failure.
## Quick Start
Install:
```bash
uv add pytest-aitest
```
Configure in `pyproject.toml`:
```toml
[tool.pytest.ini_options]
addopts = """
--aitest-summary-model=azure/gpt-5.2-chat
"""
```
Set credentials and run:
```bash
export AZURE_API_BASE=https://your-resource.openai.azure.com/
az login
pytest tests/
```
## Features
- **MCP Server Testing** — Real models against real tool interfaces
- **CLI Server Testing** — Wrap CLIs as testable tool servers
- **Agent Comparison** — Compare models, prompts, skills, and server versions
- **Agent Leaderboard** — Auto-ranked by pass rate and cost
- **Multi-Turn Sessions** — Test conversations that build on context
- **AI Analysis** — Actionable feedback on tool descriptions, prompts, and costs
- **100+ LLM Providers** — Any model via [LiteLLM](https://docs.litellm.ai/docs/providers) (Azure, OpenAI, Anthropic, Google, and more)
- **Semantic Assertions** — AI judge via [pytest-llm-assert](https://github.com/sbroenne/pytest-llm-assert)
## Documentation
📚 **[Full Documentation](https://sbroenne.github.io/pytest-aitest/)**
## Requirements
- Python 3.11+
- pytest 9.0+
- An LLM provider (Azure, OpenAI, Anthropic, etc.)
## Acknowledgments
Inspired by [agent-benchmark](https://github.com/mykhaliev/agent-benchmark).
## License
MIT