An open API service indexing awesome lists of open source software.

https://github.com/launchapp-dev/animus-plugin-testkit

Conformance test harness + benchmarks for Animus provider plugins
https://github.com/launchapp-dev/animus-plugin-testkit

Last synced: 10 days ago
JSON representation

Conformance test harness + benchmarks for Animus provider plugins

Awesome Lists containing this project

README

          

# animus-plugin-testkit

Conformance test harness, mock CLIs, and benchmark suite for
[Animus](https://github.com/launchapp-dev) plugins
(v0.4.x stdio JSON-RPC protocol).

This repo lets a plugin author run the same set of scenarios the official
Animus CI runs, locally, without any network, API keys, or a live LLM
account. Every scenario drives the plugin through the same handshake the real
daemon uses and validates the streaming notification + response shape against
[`animus-protocol`](https://github.com/launchapp-dev/animus-protocol).

Status: **v0.3.0** — provider, subject, transport, trigger, and log-storage
plugin conformance suites; concurrent-cancel dispatcher; oai-style scenario
variants.

## Crates

| Crate | Description |
| ------------------------------------------- | ------------------------------------------------------------------- |
| `testkit-core` | Shared types: `ScenarioFile`, `TestResult`, `MatrixReport`. |
| `plugin-harness` | Bin `animus-plugin-harness` — runs scenarios against a plugin. |
| `plugin-bench` | Bin `animus-plugin-bench` — TTFT, throughput, end-to-end duration. |
| `provider-conformance` | Baseline scenarios for provider plugins (10 scenarios). |
| `subject-conformance` | Baseline scenarios for subject backend plugins (5 scenarios). |
| `transport-conformance` | Baseline scenarios for transport backend plugins (4 scenarios). |
| `trigger-conformance` | Baseline scenarios for trigger backend plugins (3 scenarios). |
| `log-storage-conformance` | Baseline scenarios for log storage backend plugins (4 scenarios). |
| `mock-cli-claude` / `-codex` / `-gemini` / | Fake LLM CLIs that emit canonical streams for each scenario. |
| `mock-cli-opencode` | |
| `mock-cli-oai` | Mock OpenAI-compatible HTTP server for `animus-provider-oai`. |

## Quickstart

```bash
# 1. Build the workspace (mocks + harness + bench).
cargo build --release --workspace

# 2. Build the plugin you want to test (here: animus-provider-claude).
cd ../animus-provider-claude
cargo build --release
cd -

# 3. Run conformance — provider (default), subject, transport, trigger, or log-storage.
./target/release/animus-plugin-harness conformance \
--plugin ../animus-provider-claude/target/release/animus-provider-claude

./target/release/animus-plugin-harness conformance --kind subject \
--plugin ../animus-subject-default/target/release/animus-subject-default

./target/release/animus-plugin-harness conformance --kind transport \
--plugin ../animus-transport-http/target/release/animus-transport-http

./target/release/animus-plugin-harness conformance --kind trigger \
--plugin ../animus-trigger-webhook/target/release/animus-trigger-webhook

./target/release/animus-plugin-harness conformance --kind log-storage \
--plugin ../animus-log-storage-file/target/release/animus-log-storage-file

# 4. (Optional) save a machine-readable report.
./target/release/animus-plugin-harness conformance \
--plugin ../animus-provider-claude/target/release/animus-provider-claude \
--report-json ./report-claude.json
```

The harness injects `CLAUDE_BIN`, `CODEX_BIN`, `GEMINI_BIN`,
`OPENCODE_BIN`, and `MOCK_SCENARIO` into the plugin's environment before
spawning, so the provider plugin transparently uses our mock CLIs instead
of the real binaries on `$PATH`. **No network, no API keys.**

## Provider scenarios

The 10 baseline scenarios live in [`scenarios/`](./scenarios/):

| Name | What it exercises |
| -------------------------- | -------------------------------------------------------------------------- |
| `streaming-short` | 3-delta short completion, final aggregated text. |
| `streaming-medium` | ~40 deltas — sanity check buffering. |
| `streaming-long` | ~300 deltas — back-pressure, large output assembly. |
| `tool-call-single` | One `tool_use` + `tool_result` round-trip surrounded by output. |
| `tool-call-parallel` | Two parallel `tool_use` blocks resolved in one envelope. |
| `tool-call-single-oai` | Stateless OpenAI-style: `ToolCall` only, host owns tool execution. |
| `tool-call-parallel-oai` | Same as above, parallel. |
| `error-recovery` | Mid-stream garbled line that the provider parser must ignore. |
| `cancellation` | Concurrent dispatcher issues `agent/cancel` mid-flight (see below). |
| `resume-session` | `agent/resume` against a prior session id. |

Plugins that don't advertise the relevant `$harness/*` capability for a
scenario are SKIPPED (not failed). Plugins opt in by adding to their
`initialize` capabilities:

- `$harness/cancellation-loop-v2` — opt-in to the v0.3.0 concurrent-cancel test
- `$harness/oai-style` — opt-in to the stateless OpenAI tool-call scenarios

## Cancellation: concurrent dispatcher (v0.3.0)

The harness now spawns a side-task per scenario when `cancel_after_ms` is
set. It watches for the first notification to learn the session id, waits
the configured delay, then issues `agent/cancel { session_id }` via the
same stdio pipe. The plugin should terminate the run with
`BackendError::Cancelled` (`REQUEST_CANCELLED`, `-32002`) or emit a
non-recoverable `error` notification within the scenario timeout.

A `fake-cancellable-plugin` test fixture lives at
`crates/plugin-harness/src/bin/fake_cancellable_plugin.rs` and is
exercised by `crates/plugin-harness/tests/cancellation.rs` to verify the
wire dance end-to-end without depending on any real provider.

## Subject / transport / trigger / log-storage conformance (v0.3.0)

Each is a separate crate that exports `pub fn baseline_scenarios() ->
Vec` plus a `pub async fn run_conformance(plugin_path:
&Path) -> Result`. External CI pipelines can depend on the
crate directly:

```toml
[dev-dependencies]
subject-conformance = { git = "https://github.com/launchapp-dev/animus-plugin-testkit", tag = "v0.3.0" }
```

| Suite | Scenarios |
| --------- | ---------------------------------------------------------------------- |
| Subject | `handshake`, `advertise-kinds`, `subject-list`, `subject-crud-round-trip`, `subject-watch-stream` |
| Transport | `handshake`, `start-shutdown`, `schema-health`, `serve-and-accept` |
| Trigger | `handshake`, `watch-fires-event`, `event-payload-shape` |
| Log storage | `handshake`, `schema-health`, `store-query-round-trip`, `tail-replay` |

Trigger backends that need an external stimulus (a webhook POST, a Slack
message, a cron tick) will SKIP `watch-fires-event` and
`event-payload-shape`. The handshake still PASSes.

Log-storage conformance runs each scenario in an isolated temporary
directory and sets `ANIMUS_LOG_FILE_PATH` to a scenario-local JSONL file
before spawning the plugin.

## Smoke tests (proof the harness works)

Captured 2026-05-24 against v0.3.0:

```text
$ animus-plugin-harness conformance --kind subject \
--plugin animus-subject-default/target/release/animus-subject-default

==> conformance report: animus-subject-default v0.1.1
kind: subject_backend protocol: 1.0.0
[PASS] handshake 0ms
[PASS] advertise-kinds 0ms
[PASS] subject-list 8ms
[PASS] subject-crud-round-trip 8ms
[PASS] subject-watch-stream 8ms
summary: total 5 passed 5 failed 0 skipped 0
OVERALL: PASS
```

```text
$ animus-plugin-harness conformance --kind transport \
--plugin animus-transport-http/target/release/animus-transport-http

==> conformance report: animus-transport-http v0.1.0
kind: transport_backend protocol: 1.0.0
[PASS] handshake 0ms
[PASS] start-shutdown 6ms
[PASS] schema-health 5ms
[PASS] serve-and-accept 83ms
summary: total 4 passed 4 failed 0 skipped 0
OVERALL: PASS
```

```text
$ animus-plugin-harness conformance --kind trigger \
--plugin animus-trigger-webhook/target/release/animus-trigger-webhook

==> conformance report: animus-trigger-webhook v0.1.1
kind: trigger_backend protocol: 1.0.0
[PASS] handshake 0ms
[SKIP] watch-fires-event 0ms
[SKIP] event-payload-shape 0ms
summary: total 3 passed 1 failed 0 skipped 2
OVERALL: PASS
```

```text
$ animus-plugin-harness conformance \
--plugin animus-provider-claude/target/release/animus-provider-claude

==> conformance report: animus-provider-claude v0.2.1
kind: provider protocol: 1.0.0
[SKIP] cancellation 34ms
[PASS] error-recovery 409ms
[PASS] resume-session 395ms
[PASS] streaming-long 402ms
[PASS] streaming-medium 391ms
[PASS] streaming-short 406ms
[PASS] tool-call-parallel 399ms
[SKIP] tool-call-parallel-oai 34ms
[PASS] tool-call-single 374ms
[SKIP] tool-call-single-oai 33ms
summary: total 10 passed 7 failed 0 skipped 3
OVERALL: PASS
```

The 3 SKIPs are expected: animus-provider-claude does not advertise the
opt-in capabilities `$harness/cancellation-loop-v2` or
`$harness/oai-style`. Provider plugins that want those tests to run
should advertise those capabilities in their `initialize` capabilities
list once their backend supports the relevant semantics.

## Adding scenarios

Drop a YAML file into `scenarios/` (or your own directory and pass
`--scenarios `):

```yaml
name: my-scenario
description: ...
timeout_ms: 8000
method: run # or `resume`
requires_capabilities: ["agent/resume"] # optional gate
cancel_after_ms: 100 # optional — triggers the concurrent-cancel dispatcher
request:
prompt: "hello"
model: claude-sonnet-4-6
expected_notifications:
- kind: output
contains: "Hello"
expected_response:
output_contains: "Hello"
exit_code: 0
mock:
tool: claude
mock_scenario: streaming-short
```

See [`docs/writing-scenarios.md`](./docs/writing-scenarios.md) for the full
matcher surface.

## Benchmarks

```bash
./target/release/animus-plugin-bench \
--plugin ../animus-provider-claude/target/release/animus-provider-claude \
--iterations 10 \
--scenario streaming-medium
```

The benchmark runner can also compare provider plugins, mock scenarios, and
model ids as a matrix:

```bash
./target/release/animus-plugin-bench \
--plugin ../animus-provider-claude/target/release/animus-provider-claude \
--plugin ../animus-provider-oai/target/release/animus-provider-oai \
--suite full \
--model claude-sonnet-4-6,gpt-5-mini \
--iterations 20 \
--warmup 2 \
--report-json ./bench-report.json \
--report-csv ./bench-summary.csv
```

Suites:

| Suite | Scenarios |
| ----- | --------- |
| `smoke` | `streaming-short` |
| `streaming` | `streaming-short`, `streaming-medium`, `streaming-long` |
| `tools` | `tool-call-single`, `tool-call-parallel` |
| `full` | `streaming`, `tools`, `error-recovery` |

Reports include TTFT (time-to-first-token), end-to-end duration, p95 latency,
notification count, output bytes, and throughput per
plugin/scenario/model cell. `--mock-scenario` remains as a backward-compatible
alias for `--scenario`. There's also a `criterion` micro-bench for the scenario
loader (`cargo bench -p plugin-bench`).

## CI Integration

See [`docs/ci-integration.md`](./docs/ci-integration.md). The shipped
[`provider-matrix.yml`](.github/workflows/matrix.yml) workflow runs the
harness against every published `launchapp-dev/animus-provider-*` repo every
Monday morning.

## Known Limitations

- **Trigger conformance is shallow without an external stimulus.** The
`webhook` backend (and any backend that needs an inbound HTTP POST,
Slack event, cron tick, ...) will SKIP `watch-fires-event` and
`event-payload-shape`. A future revision could spin up a stimulus
injector per backend kind.
- **Subject CRUD requires create+get to share state.** The harness keeps a
single plugin process alive across the round-trip so backends with
in-process state (the default task store) can be exercised. Backends
that persist to a global path may surface cross-test contamination.
- **Provider plugins must advertise opt-in capabilities** for the new
cancellation + oai-style tests to run. Plugins that don't are SKIPPED
(not failed). Update each plugin's `initialize` capabilities to opt in.
- The harness depends only on published `animus-protocol` crates — it
intentionally has **no dependency** on the in-tree `animus-cli`.

## Versioning

This release is pinned to `animus-protocol v0.1.9`. Protocol bumps in the
patch range remain compatible; minor bumps may require harness changes.

## License

MIT — see [LICENSE](./LICENSE).