{"id":50614909,"url":"https://github.com/deevus/dokimasia","last_synced_at":"2026-06-06T07:04:11.070Z","repository":{"id":358048005,"uuid":"1239564470","full_name":"deevus/dokimasia","owner":"deevus","description":"Generic agent end-to-end evaluation harness","archived":false,"fork":false,"pushed_at":"2026-05-15T11:42:01.000Z","size":88,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T13:32:33.152Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deevus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-15T08:04:19.000Z","updated_at":"2026-05-15T11:42:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/deevus/dokimasia","commit_stats":null,"previous_names":["deevus/dokimasia"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/deevus/dokimasia","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deevus%2Fdokimasia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deevus%2Fdokimasia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deevus%2Fdokimasia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deevus%2Fdokimasia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deevus","download_url":"https://codeload.github.com/deevus/dokimasia/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deevus%2Fdokimasia/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33972420,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-06T07:04:10.450Z","updated_at":"2026-06-06T07:04:11.056Z","avatar_url":"https://github.com/deevus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dokimasia\n\nDokimasia is a pytest-first acceptance testing harness for agent capabilities.\n\nUse it to prove that an agent can safely use an MCP server, CLI, skill, or workflow before you roll that capability out to developers or employees.\n\nA passing Dokimasia test gives you three kinds of evidence:\n\n1. **Capability evidence** — the agent discovered and used the intended capability, skill, MCP server, tool, or workflow.\n2. **Audited operation evidence** — the external operation happened through an approved or audited path.\n3. **Independent domain oracle** — the final business or system state was verified outside the agent's trace or claims.\n\nDokimasia is not a model benchmark. It is CI-style acceptance testing for agent capabilities that mutate real systems.\n\n## Why Dokimasia?\n\nTeams are giving agents access to MCP servers, internal CLIs, reusable skills, and workflow automations. Before broad rollout, they need evidence that agents use those capabilities correctly, safely, and observably.\n\nA trace alone is not enough. A passing CLI or API integration test is not enough. A good final answer from the model is not enough. Dokimasia combines capability evidence, audited operations, and independent state verification in ordinary pytest suites.\n\n## What Dokimasia is not\n\n- Not a model benchmark for comparing model quality.\n- Not just an observability or trace viewer.\n- Not just a CLI/API integration test.\n- Not a generic eval harness trying to grade every agent behavior.\n\nDokimasia tests whether an agent can use a specific capability safely and produce the expected external state.\n\n## Good fit\n\nUse Dokimasia when you need to test whether an agent can:\n\n- use an internal MCP server before enabling it for employees;\n- use a coding-agent skill before rolling it out to developers;\n- create or modify records in systems such as GitHub, Jira, Linear, Slack, AWS, or ServiceNow;\n- follow approved CLIs, tools, or workflows instead of ad hoc shell or API paths.\n\n## Domain-neutral by design\n\nDokimasia provides generic mechanics: agent turns, artifacts, traces, command spies, layout helpers, environment helpers, and cleanup safety checks.\n\nYour project provides provisioning, audit normalization, state verification, and domain-specific fixtures. This keeps the core independent of GitHub, Forgejo, Jira, Slack, or any other product while letting each suite encode its own business oracle.\n\n## Python usage\n\nAuthor Dokimasia suites as ordinary pytest modules. Installing Dokimasia registers its pytest plugin, so test modules can request the `doki_factory` and `doki` fixtures directly. Use plain Python setup code, project-owned fixtures, pytest marks, and normal pytest assertions instead of loading declarative scenario files:\n\n```python\nfrom pathlib import Path\n\nimport pytest\n\nfrom dokimasia.agents.pi import PiAdapter\nfrom dokimasia.pytest import assert_invoked, cmd\n\nISSUE_CREATE = cmd.match(\"tea\", pattern=[(\"issues\", \"issue\"), \"create\"])\n\n\n@pytest.mark.agent_e2e\ndef test_agent_creates_issue(doki_factory, prepared_repo):\n    doki = doki_factory(\n        agent=PiAdapter(skills_dir=Path(\"skills\")),\n        workspace=prepared_repo,\n    )\n\n    result = doki.run(\"Create the requested issue\")\n\n    assert result.ok, result.failure_summary\n    assert result.has_skill_loaded(\"create-issue\")\n    assert_invoked(result, ISSUE_CREATE)\n```\n\nProject suites provide provisioning, audit normalization, independent state verification, and fixtures for their own domain objects.\n\nFor Pi skill tests, `skills_dir` points at the skills under test. Dokimasia passes that directory to Pi and uses reads of `SKILL.md` files under it as skill-loaded evidence for assertions such as `result.has_skill_loaded(...)`.\n\nConfigure named built-in agent CLI options at `doki_factory` creation time. `model` and `extra_args` work with `claude-code`:\n\n```python\ndoki = doki_factory(\n    agent=\"claude-code\",\n    model=\"claude-sonnet-4\",\n    extra_args=[\"--allowedTools\", \"Read\"],\n)\n```\n\nFor Pi skill tests, configure Pi-specific CLI options on the explicit adapter:\n\n```python\ndoki = doki_factory(\n    agent=PiAdapter(\n        skills_dir=Path(\"skills\"),\n        provider=\"anthropic\",\n        model=\"claude-sonnet-4\",\n        thinking=\"high\",\n        extra_args=[\"--models\", \"claude-*\"],\n    ),\n)\n```\n\nThe same options can be set with environment variables. `doki_factory(env={...})` overrides the pytest process environment, and explicit `doki_factory` or adapter arguments override both:\n\n```bash\nDOKIMASIA_MODEL=deepseek/deepseek-v4-flash uv run pytest\n```\n\nSupported variables are `DOKIMASIA_PROVIDER`, `DOKIMASIA_MODEL`, `DOKIMASIA_THINKING`, and `DOKIMASIA_EXTRA_ARGS`. `DOKIMASIA_EXTRA_ARGS` is shell-split, so quoted values are preserved.\n\nFor custom adapters, instantiate and configure the adapter directly, then pass it as `agent=`.\n\n## Pytest invocation matchers\n\nUse `dokimasia.pytest.cmd` to define static matchers for observed executable invocations. An invocation can be a PATH-spied host command such as `tea`, or a repo-relative action script such as `actions/issues/lock.py`. Matchers are safe to create at module import time:\n\n```python\nfrom dokimasia.pytest import assert_invoked, cmd\n\nISSUE_CREATE = cmd.match(\n    \"tea\",\n    pattern=[(\"issues\", \"issue\", \"i\"), (\"create\", \"c\")],\n)\nLOCK_ACTION = cmd.match(\n    \"actions/issues/lock.py\",\n    pattern=[\"1\", \"spam\"],\n    mode=\"exact\",\n)\n\nassert ISSUE_CREATE.matches({\"executable\": \"tea\", \"argv\": [\"--repo\", \"org/repo\", \"issue\", \"create\"]})\nassert LOCK_ACTION.matches({\"action\": \"actions/issues/lock.py\", \"argv\": [\"1\", \"spam\"]})\n```\n\n`pattern=` accepts token groups; `patterns=` accepts explicit alternatives. Matching modes are `ordered` (default, gaps allowed), `contains` (unordered containment), `span` (contiguous span), `prefix`, and `exact`. Use `where=` for custom predicates and `label=` to override generated labels such as `tea.issues.create`.\n\nUse `assert_invoked(result, matcher)` as the preferred general assertion against observed `result.commands` in pytest tests. By default it requires at least one successful matching invocation. Use keyword-only `times=`, `min=`, `max=`, and `exit=\"success\" | \"failure\" | \"any\"` for count and exit-status constraints:\n\n```python\nassert_invoked(result, ISSUE_CREATE)\nassert_invoked(result, LOCK_ACTION, times=1)\nassert_invoked(result, LOCK_ACTION, max=0, exit=\"any\")  # did not run\n```\n\n`assert_command_ran(result, matcher)` remains available for compatibility with suites that only assert command-shaped evidence. New suites should prefer `assert_invoked(...)` because PATH spies and file spies both produce normalized invocation evidence on `result.commands`.\n\nStatic spy specs declare wrappers for pytest suites that need audited host commands earlier in `PATH`:\n\n```python\nfrom dokimasia.pytest import cmd\n\nTEA = cmd.spy(\"tea\")\nISSUE_CREATE = TEA.match(pattern=[(\"issues\", \"issue\", \"i\"), (\"create\", \"c\")])\n\ndef test_issue_flow(doki_factory):\n    doki = doki_factory(spies=[TEA])\n```\n\n`doki_factory(spies=[...])` resolves the real executable before adding wrapper directories to `PATH`, materializes wrappers under the fixture artifact area, and only prepends the spy `bin` directory when spies are explicitly registered. `cmd.spy(\"name\")` records audit events with `source=\"name\"` by default; pass `source=` when the audit source should differ from the executable name.\n\nInvocation evidence proves that an approved executable path was used. It is not a substitute for independent domain-state verification. A good end-to-end test still asserts the resulting issue, ticket, cloud resource, or other business state through an oracle outside the agent's trace and stdout.\n\n\n## Agent tool-call assertions\n\nUse generic tool-call evidence when an acceptance suite needs to prove that an\nagent selected the intended agent-side tool. Dokimasia adapters expose these\ncalls as normalized `TraceEvent(kind=\"tool.call\")` entries on\n`result.trace_events`, and pytest helpers make the common assertions concise:\n\n```python\nfrom dokimasia.pytest import assert_tool_called, assert_tool_not_called, tool_calls\n\nresult = doki.run(\"Inspect checkout flow before editing\")\n\nassert_tool_called(result, tool=\"get_file_skeleton\")\nassert_tool_called(\n    result,\n    tool=\"get_function\",\n    where=lambda event: \"finalizeCheckout\" in event.raw.get(\"args\", {}).get(\"function_names\", []),\n)\nassert_tool_not_called(result, tool=\"read\")\n```\n\n`tool_calls(result, tool=..., where=...)` returns the matching trace events\nunchanged, so suites can inspect raw adapter arguments or write ordering checks.\nThe `raw` shape is adapter-specific; use it for suite-local checks rather than\nportable Dokimasia contracts.\n`assert_tool_called(...)` supports `times=`, `min=`, and `max=` count\nconstraints. `assert_tool_not_called(...)` is shorthand for requiring zero\nmatching calls.\n\nTool-call evidence proves agent tool selection. It is not a substitute for\ncommand invocation evidence, MCP operation evidence, or independent domain-state\nverification.\n\n## MCP acceptance testing\n\nUse normalized MCP evidence when an acceptance suite needs to prove that an agent\ncalled the intended MCP server and tool. Dokimasia exposes adapter-independent\nMCP calls on `result.mcp_calls` and assertion helpers such as\n`assert_mcp_called(...)` and `assert_mcp_not_called(...)`.\n\nSee `docs/mcp-acceptance.md` for the full MCP acceptance testing workflow,\nincluding Claude Code `mcp__server__tool` traces, Pi MCP extension support,\n`nicobailon/pi-mcp-adapter` proxy and direct-tool modes, the optional role of MCP\nproxies, `doki-ledger` oracle tests, and opt-in live-agent E2E configuration.\n\n## File spies for action scripts\n\nUse file spies when the executable under test is a repo-relative action script rather than a host command discovered through `PATH`. The suite replaces the action file in the disposable test workspace with a wrapper. The wrapper forwards to the real project-owned script, records a JSONL event through `DOKIMASIA_COMMAND_LOG`, and preserves the real script's exit code. Production action scripts do not import Dokimasia; only the test workspace wrapper depends on Dokimasia's generated instrumentation.\n\nPython action example:\n\n```python\nfrom dokimasia.pytest import assert_invoked, cmd\nfrom dokimasia.suite import create_file_spy\n\nLOCK_ACTION = cmd.match(\"actions/issues/lock.py\", pattern=[\"1\", \"spam\"], mode=\"exact\")\n\n\ndef test_agent_locks_issue(doki_factory, workspace_repo, real_repo):\n    create_file_spy(\n        wrapper_path=workspace_repo / \"actions/issues/lock.py\",\n        real_executable=real_repo / \"actions/issues/lock.py\",\n        invocation_name=\"actions/issues/lock.py\",\n        source=\"issue-action\",\n    )\n\n    result = doki_factory(workspace=workspace_repo).run(\"Lock issue 1 as spam\")\n\n    assert result.ok, result.failure_summary\n    assert_invoked(result, LOCK_ACTION)\n```\n\nNode action example:\n\n```python\nfrom dokimasia.pytest import assert_invoked, cmd\nfrom dokimasia.suite import create_node_file_spy\n\nLOCK_ACTION = cmd.match(\"actions/issues/lock.js\", pattern=[\"1\", \"spam\"], mode=\"exact\")\n\n\ndef test_agent_locks_issue_with_node_action(doki_factory, workspace_repo, real_repo):\n    create_node_file_spy(\n        wrapper_path=workspace_repo / \"actions/issues/lock.js\",\n        real_script=real_repo / \"actions/issues/lock.js\",\n        invocation_name=\"actions/issues/lock.js\",\n        source=\"issue-action\",\n        node_runner=\"node\",\n    )\n\n    result = doki_factory(workspace=workspace_repo).run(\"Lock issue 1 as spam\")\n\n    assert result.ok, result.failure_summary\n    assert_invoked(result, LOCK_ACTION)\n```\n\nShell action example:\n\n```python\nfrom dokimasia.pytest import assert_invoked, cmd\nfrom dokimasia.suite import create_shell_file_spy\n\nLOCK_ACTION = cmd.match(\"actions/issues/lock.sh\", pattern=[\"1\", \"spam\"], mode=\"exact\")\n\n\ndef test_agent_locks_issue_with_shell_action(doki_factory, workspace_repo, real_repo):\n    create_shell_file_spy(\n        wrapper_path=workspace_repo / \"actions/issues/lock.sh\",\n        real_script=real_repo / \"actions/issues/lock.sh\",\n        invocation_name=\"actions/issues/lock.sh\",\n        source=\"issue-action\",\n        shell_runner=\"sh\",\n    )\n\n    result = doki_factory(workspace=workspace_repo).run(\"Lock issue 1 as spam\")\n\n    assert result.ok, result.failure_summary\n    assert_invoked(result, LOCK_ACTION)\n```\n\nThe recorded event includes normalized fields such as `action`, `argv`, `cwd`, `pid`, `phase`, `source`, `exit_code`, and `timestamp`. `Doki.run(...)` loads those events into the returned result, so assertions use `assert_invoked(result, LOCK_ACTION)` directly rather than reading audit files in the test.\n\n## Examples\n\n### Real-world suites\n\n- [`tea-skills`](https://github.com/deevus/tea-skills) uses Dokimasia to test Pi skills for Forgejo/Gitea workflows. The suite runs Pi against real skill files in a mocked repository, wraps the `tea` CLI and bundled action scripts as audited invocations, and verifies the final issue/dependency/lock state with project-owned oracles. See [`tests/e2e/test_agent_e2e.py`](https://github.com/deevus/tea-skills/blob/main/tests/e2e/test_agent_e2e.py).\n\n  ```python\n  result = doki.run('Create a Forgejo issue titled \"...\"')\n\n  assert result.ok, result.failure_summary\n  assert result.has_skill_loaded(\"create-issue\")\n  assert_invoked(result, ISSUE_CREATE, times=1)\n  assert_single_issue_matches(mock_tea.load_state()[\"issues\"], title=title, body=body)\n  ```\n\n- [`pi-wayfinder`](https://github.com/deevus/pi-wayfinder) uses Dokimasia to test agent tool choice. The suite gives Pi a small source file and asserts that it uses Wayfinder's structured navigation tools before falling back to broad file reads. See [`tests/agent/test_wayfinder_tool_choice.py`](https://github.com/deevus/pi-wayfinder/blob/main/tests/agent/test_wayfinder_tool_choice.py).\n\n  ```python\n  result = doki.run(\n      \"Explore src/checkout.ts and explain how finalizeCheckout computes its result.\"\n  )\n\n  assert result.ok, result.failure_summary\n  assert_tool_called(result, tool=\"get_file_skeleton\", where=_targets_checkout_source)\n  assert_tool_called(result, tool=\"get_function\", where=_targets_finalize_checkout)\n  ```\n\n### Deterministic MCP fixture\n\n`doki-ledger` is a local stateful MCP server example for acceptance suites that need a deterministic MCP mutation target. It exposes a `record_transaction` tool, persists ledger entries to a pytest-controlled JSON file, and provides Python oracle helpers such as `balance_cents()` and `read_entries()` so tests can verify final state without trusting an agent trace. See `examples/doki-ledger/README.md`.\n\n\n## Suite authoring helpers\n\nThe `dokimasia.suite` namespace contains generic suite assembly helpers. These helpers cover common mechanics that many end-to-end suites need while staying independent of any product, service, CLI, issue tracker, or project workflow.\n\nAvailable helper modules:\n\n- `dokimasia.suite.layout` creates run ids and artifact directories.\n- `dokimasia.suite.spy` creates audited command wrappers that can be prepended to `PATH`.\n- `dokimasia.suite.safety` checks caller-supplied cleanup policies before deleting disposable resources.\n- `dokimasia.suite.env` composes `PATH` values and discovers required host executables.\n\nProjects provide provisioning, audit normalization, and state verification. Project-specific resource names, executable choices, audit roots, and state assertions stay in the project suite. Dokimasia only provides the generic helper boundary that suite authors compose around those project-specific functions.\n\nA typical suite composes the helpers in this order:\n\n```python\nfrom pathlib import Path\n\nfrom dokimasia.suite.env import env_with_path_prepend, require_executable\nfrom dokimasia.suite.layout import create_run_id, prepare_run_root, prepare_scenario_dir\nfrom dokimasia.suite.safety import assert_scoped_disposable_name\nfrom dokimasia.suite.spy import create_spy\n\nrun_id = create_run_id()\nrun_root = prepare_run_root(Path(\".e2e-artifacts\"), run_id)\nscenario_dir = prepare_scenario_dir(run_root / \"artifacts\", \"Create record\")\n\nresource_name = f\"suite-{run_id}\"\nassert_scoped_disposable_name(resource_name, required_prefix=\"suite-\", run_id=run_id)\n\nreal_cli = require_executable(\"example-cli\")\nspy = create_spy(\n    root=run_root / \"spy\",\n    executable_name=\"example-cli\",\n    real_executable=real_cli,\n    audit_log=scenario_dir / \"audit.jsonl\",\n    source=\"example-cli\",\n)\nenv = env_with_path_prepend(spy.path_prefix)\n```\n\nThe example uses placeholder names only. Real suites should keep domain-specific provisioning, command normalization, and state assertions outside Dokimasia.\n\n## Suite layout helpers\n\nUse layout helpers for domain-neutral run ids and artifact directories:\n\n```python\nfrom pathlib import Path\n\nfrom dokimasia.suite.layout import create_run_id, prepare_run_root, prepare_scenario_dir\n\nrun_id = create_run_id()\nrun_root = prepare_run_root(Path(\".e2e-artifacts\"), run_id)\nscenario_dir = prepare_scenario_dir(run_root / \"artifacts\", \"Create record\")\n```\n\n`prepare_run_root` creates `\u003cbase\u003e/\u003crun-id\u003e`. `prepare_scenario_dir` creates a safe hyphenated directory name such as `Create-record`.\n\n\n## Suite safety helpers\n\nUse safety helpers before destructive cleanup of disposable resources. The caller supplies the suite policy so Dokimasia stays domain-neutral:\n\n```python\nfrom dokimasia.suite.safety import assert_scoped_disposable_name\n\nassert_scoped_disposable_name(\n    \"suite-abc123\",\n    required_prefix=\"suite-\",\n    run_id=\"abc123\",\n)\n```\n\nThe helper raises `ValueError` when the name is missing the required prefix or run id, and the error includes the refused resource name.\n\n\n## Suite environment helpers\n\nUse environment helpers when a suite needs to compose PATH values or require a host executable:\n\n```python\nfrom dokimasia.suite.env import env_with_path_prepend, require_executable\n\nreal_cli = require_executable(\"example-cli\")\nenv = env_with_path_prepend(\".e2e-artifacts/run/spy/bin\", {\"PATH\": \"/usr/bin\"})\n```\n\n`env_with_path_prepend` preserves existing `PATH` content and handles empty or absent values. `require_executable` raises `FileNotFoundError` with the missing executable name when lookup fails.\n\n## Suite command spy\n\nUse `create_spy` when a suite needs to put an audited wrapper earlier in `PATH` while forwarding to the real executable:\n\n```python\nfrom pathlib import Path\nimport os\n\nfrom dokimasia.suite.spy import create_spy\n\nspy = create_spy(\n    root=Path(\".e2e-artifacts/run/spy\"),\n    executable_name=\"example-cli\",\n    real_executable=Path(\"/usr/local/bin/example-cli\"),\n    audit_log=Path(\".e2e-artifacts/run/audit.jsonl\"),\n    source=\"example-cli\",\n)\n\nenv = spy.env_with_path(os.environ)\n```\n\nThe wrapper records JSONL invocation events with `source`, `argv`, `cwd`, `pid`, `phase`, `exit_code`, and `timestamp`, then exits with the real executable's status.\n\n## Development setup\n\n```bash\nuv sync\n```\n\nRun tests:\n\n```bash\nuv run pytest\n```\n\nRun package commands inside the uv-managed environment:\n\n```bash\nuv run python -c \"import dokimasia; print(dokimasia.__name__)\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeevus%2Fdokimasia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeevus%2Fdokimasia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeevus%2Fdokimasia/lists"}