{"id":50916735,"url":"https://github.com/matheusrf96/agentspec","last_synced_at":"2026-06-16T16:01:52.304Z","repository":{"id":362170269,"uuid":"1257705166","full_name":"matheusrf96/agentspec","owner":"matheusrf96","description":"Spec-driven evaluation framework for AI agents","archived":false,"fork":false,"pushed_at":"2026-06-03T00:32:18.000Z","size":66,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-03T02:09:49.952Z","etag":null,"topics":["agent-evaluation","ai-agents","benchmark","deepseek","evaluation-framework","llm-agents","llm-evaluation","mcp","openai","pytest","spec-driven","testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/matheusrf96.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-02T23:52:43.000Z","updated_at":"2026-06-03T00:32:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/matheusrf96/agentspec","commit_stats":null,"previous_names":["matheusrf96/agentspec"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/matheusrf96/agentspec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matheusrf96%2Fagentspec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matheusrf96%2Fagentspec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matheusrf96%2Fagentspec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matheusrf96%2Fagentspec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/matheusrf96","download_url":"https://codeload.github.com/matheusrf96/agentspec/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matheusrf96%2Fagentspec/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34412795,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","ai-agents","benchmark","deepseek","evaluation-framework","llm-agents","llm-evaluation","mcp","openai","pytest","spec-driven","testing"],"created_at":"2026-06-16T16:01:50.301Z","updated_at":"2026-06-16T16:01:52.297Z","avatar_url":"https://github.com/matheusrf96.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# agentspec\n\n**Spec-driven evaluation framework for AI agents.** Define agent behaviors as YAML specs, run agents against test cases, and get scored results. Built for DeepSeek V4 and any OpenAI-compatible API.\n\n```bash\npip install -e .\nagentspec run examples/tool-calling.yaml\n```\n\n## Why\n\nTesting AI agents is hard. Agent behavior is probabilistic, tool selection is unpredictable, and output format varies. `agentspec` brings software engineering rigor to agent development with **spec-driven evaluation** — the same philosophy as spec-driven development (SDD), applied to agents.\n\n## How it works\n\n```\nspec.yaml → SpecParser → TestRunner → Agent (OpenAI-compatible API) → Assertions → Scored report\n```\n\n1. Write a YAML spec describing expected agent behavior\n2. `agentspec` runs each test case against the agent\n3. Assertions evaluate tool usage, output content, latency, and more\n4. A scored report shows pass/fail per test with details\n\n## Quick start\n\n```bash\n# Install\npip install -e .\n\n# Set your API key\nexport DEEPSEEK_API_KEY=sk-...\n\n# Run the built-in example spec\nagentspec run examples/tool-calling.yaml\n\n# Validate a spec without running\nagentspec validate examples/tool-calling.yaml\n\n# Generate a template spec\nagentspec init --name \"My Agent\" \u003e my-agent.yaml\n```\n\n## CLI reference\n\n```\nagentspec run \u003cfile\u003e            Run agent evaluation against a spec\n  --model                       Override model (default: deepseek-v4-pro)\n  --base-url                    Override API base URL\n  --verbose / -v                Show per-assertion details\n  --output terminal|json        Output format (default: terminal)\n\nagentspec validate \u003cfile\u003e       Validate spec YAML without running\n\nagentspec init                  Generate a template spec.yaml\n  --name                        Spec name (default: \"My Agent Eval\")\n  --model                       Default model (default: deepseek-v4-pro)\n```\n\n## Spec format\n\n```yaml\nname: \"Financial Agent Eval\"\nmodel: deepseek-v4-pro\nsystem_prompt: \"You are a financial assistant.\"\n\ntests:\n  - name: \"fetches stock price\"\n    prompt: \"What is AAPL at?\"\n    assertions:\n      - type: tool_called\n        tool_name: get_stock_price\n      - type: output_matches\n        pattern: '\\$\\d+\\.?\\d*'\n      - type: latency_under\n        max_seconds: 30\n\n  - name: \"handles unknown gracefully\"\n    prompt: \"What is FAKE123?\"\n    assertions:\n      - type: output_contains_any\n        values: [\"not found\", \"unknown\", \"invalid\"]\n        match: any\n```\n\n### Assertion types\n\n| Type | Fields | Description |\n|------|--------|-------------|\n| `tool_called` | `tool_name`, `args?` | Assert a specific tool was called (optionally with matching args) |\n| `output_contains` | `value`, `case_sensitive?` | Assert output contains a substring |\n| `output_contains_any` | `values`, `match: any\\|all?` | Assert output contains at least one (or all) substrings |\n| `output_matches` | `pattern` | Assert output matches a regex pattern |\n| `latency_under` | `max_seconds` | Assert response time under threshold |\n| `output_json_schema` | `schema` | Assert output is valid JSON matching a JSON Schema |\n\n## Configuration\n\n| Environment variable | Default | Description |\n|---------------------|---------|-------------|\n| `DEEPSEEK_API_KEY` | — | API key for DeepSeek (or other provider) |\n| `LLM_BASE_URL` | `https://api.deepseek.com` | API base URL for any OpenAI-compatible provider |\n\nOverride per-run with `--model` and `--base-url` flags.\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────┐\n│ spec.yaml                                            │\n│  ├── name, model, system_prompt                       │\n│  └── tests[]                                         │\n│       ├── name, prompt                                │\n│       └── assertions[]                                │\n└─────────┬───────────────────────────────────────────┘\n          │ spec.parse()\n          ▼\n┌─────────────────────┐\n│  TestRunner          │\n│  ├── iterates tests  │\n│  └── calls adapter   │\n└─────────┬───────────┘\n          │ adapter.run()\n          ▼\n┌──────────────────────┐\n│  Agent (DeepSeek V4)  │\n│  └── returns response │\n└─────────┬────────────┘\n          │ evaluate_assertion()\n          ▼\n┌──────────────────────┐\n│  Scorer + Reporter    │\n│  └── terminal / JSON  │\n└──────────────────────┘\n```\n\n### Adding new assertion types\n\n1. Add the enum value to `AssertionType` in `agentspec/spec.py`\n2. Create a Pydantic model for the assertion with `type: Literal[AssertionType.NEW_TYPE]`\n3. Add it to the `Assertion` union type\n4. Add an `_eval_new_type()` function in `agentspec/assertions.py`\n5. Add the match case to `evaluate_assertion()`\n\n### Adding new adapters\n\n1. Create a new file in `agentspec/adapters/` that inherits `AgentAdapter`\n2. Implement the `run()` async method\n3. Add `--adapter` CLI option to `agentspec/cli.py`\n\n## Development\n\n```bash\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npython -m pytest tests/ -v\n\n# Run tests with coverage\npython -m pytest tests/ --cov=agentspec\n```\n\n## Roadmap\n\nSee the full [roadmap](docs/roadmap.md) in the docs for planned features across all tiers.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatheusrf96%2Fagentspec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatheusrf96%2Fagentspec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatheusrf96%2Fagentspec/lists"}