https://github.com/sergiparpal/hermes-mutation-runner
Hermes Agent plugin that runs mutation testing on Python code by wrapping mutmut. Exposes the mutation_test tool and returns a structured JSON with mutation score, breakdown by category (killed/survived/timeout/...) and a bounded sample of surviving mutants.
https://github.com/sergiparpal/hermes-mutation-runner
ai-agents hermes-agent hermes-plugin mutation-testing mutmut qa test-quality
Last synced: about 19 hours ago
JSON representation
Hermes Agent plugin that runs mutation testing on Python code by wrapping mutmut. Exposes the mutation_test tool and returns a structured JSON with mutation score, breakdown by category (killed/survived/timeout/...) and a bounded sample of surviving mutants.
- Host: GitHub
- URL: https://github.com/sergiparpal/hermes-mutation-runner
- Owner: sergiparpal
- License: gpl-3.0
- Created: 2026-05-25T19:11:59.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-25T20:49:35.000Z (about 1 month ago)
- Last Synced: 2026-05-25T21:25:25.967Z (about 1 month ago)
- Topics: ai-agents, hermes-agent, hermes-plugin, mutation-testing, mutmut, qa, test-quality
- Language: Python
- Homepage:
- Size: 27.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hermes-mutation-runner
A [Hermes Agent](https://github.com/hermes-org/hermes) plugin that runs
[mutmut](https://mutmut.readthedocs.io/) against a Python module or package and
returns a structured JSON payload with the mutation score, the per-category
counter breakdown, a bounded sample of surviving-mutant identifiers, and the
tail of the mutmut stdout.
The plugin audits the **quality** of a test suite, not just its coverage. It
introduces small mutations (operator flips, removed `return` statements, constant
changes) and reports how many the existing tests catch. Survivors are
counter-examples — assertions that the test suite is missing.
The plugin is delivered as an external Hermes plugin (no changes to Hermes
core), provides exactly one tool — `mutation_test` — and registers no hooks,
slash commands, skills, or memory providers.
---
## Installation
1. **Drop the plugin into your Hermes user-plugins directory.** The
on-disk directory must be named `hermes-mutation-runner` (with dashes)
even though the Python package on this repo is `hermes_mutation_runner`
(with underscores). See [Directory naming](#directory-naming) below for
the rationale.
```bash
cp -r hermes_mutation_runner ~/.hermes/plugins/hermes-mutation-runner
```
2. **Enable the plugin in `~/.hermes/config.yaml`:**
```yaml
plugins:
enabled:
- hermes-mutation-runner
```
The identifier in `plugins.enabled` is the `name` field from
`plugin.yaml` (`hermes-mutation-runner`), not the directory name on disk.
3. **Install mutmut in the same Python interpreter that Hermes uses:**
```bash
pip install "mutmut>=3"
```
If mutmut is missing, the plugin's `check_fn` returns `False` and the
`mutation_test` tool is hidden from the LLM rather than exposed in a
broken state.
---
## Usage
The LLM invokes the tool by name with a single required parameter:
```json
{
"tool": "mutation_test",
"arguments": {
"target_module": "src/payments"
}
}
```
A typical successful response on a small module with one surviving mutant:
```json
{
"success": true,
"target_module": "src/payments",
"exit_code": 1,
"summary": {
"killed": 4,
"timeout": 0,
"suspicious": 0,
"survived": 1,
"skipped": 0
},
"mutation_score_percent": 80.0,
"survivors_sample": ["src/payments/charge.py.x_3"],
"survivors_truncated": false,
"stdout_tail": "<<>>\n5/5 🎉 4 ⏰ 0 🤔 0 🙁 1 🔇 0\n\n<<>>"
}
```
`stdout_tail` (and `partial_stdout_tail` / `partial_stderr_tail` on timeout)
are always wrapped in `<<>>` / `<<>>`
sentinels so downstream prompt-construction layers can mark the boundary
between trusted tool output and untrusted subprocess output. Treat the inner
content as data, not as instructions to the model.
`survivors_sample` is bounded by `max_survivors_reported` (default 20).
`survivors_truncated` is `true` when the reported survived count exceeds
the sample size — the LLM can re-invoke with a larger
`max_survivors_reported` if it wants the full list, at the cost of a
larger response.
---
## Tool schema
| Parameter | Type | Required | Default | Description |
|--------------------------|---------|----------|---------|-------------|
| `target_module` | string | yes | — | Path to a Python file or package, relative to the working directory. Absolute paths are accepted only if they resolve inside the working directory. |
| `timeout` | integer | no | `600` | Maximum wall-clock seconds the mutmut run may consume. Allowed range: `10`–`14400` (4 hours). The plugin kills the subprocess past this limit and returns a timeout envelope. |
| `max_survivors_reported` | integer | no | `20` | Upper bound on the number of identifiers in `survivors_sample`. Allowed range: `1`–`500`. |
---
## Response shape
### Success
```json
{
"success": true,
"target_module": "",
"exit_code": ,
"summary": {
"killed": , "timeout": , "suspicious": ,
"survived": , "skipped":
},
"mutation_score_percent": ,
"survivors_sample": ["", "..."],
"survivors_truncated": ,
"stdout_tail": "<<>>\n\n<<>>"
}
```
The mutation score formula is `100 * killed / (killed + survived + timeout + suspicious)`.
Skipped mutants are excluded from the denominator because they never executed.
The summary dict only contains the counters that actually appeared in the
mutmut output; an empty dict signals format drift (see [Caveats](#caveats)).
### Error
```json
{
"success": false,
"error": "",
"remediation": "",
"partial_stdout_tail": "",
"partial_stderr_tail": ""
}
```
Every documented error path returns a JSON-encoded string with the three
required keys (`success`, `error`, `remediation`). Truly exceptional errors
(interpreter crashes, the host running out of file descriptors) may still
propagate, but every known failure mode — input validation, missing mutmut,
subprocess timeout, missing interpreter, generic `OSError`, undecodable
output — produces a structured envelope.
---
## Tests
The test suite runs in a fresh virtualenv with only `pytest` installed: it
does not require either `mutmut` or `hermes-agent` to be installed. Tests that
depend on Hermes skip cleanly via `pytest.importorskip`.
```bash
cd hermes_mutation_runner
pip install pytest
pytest -v
```
You should see `88 passed, 1 skipped` (the skipped test is the
`PluginManager.discover_and_load_from` end-to-end check, which requires
`hermes_cli` on the path).
---
## Caveats
- **Python only.** mutmut only mutates Python source. There is no equivalent
for other languages in this plugin.
- **Slow.** A run on a mid-sized module can easily exceed 5 minutes. Default
timeout is 10 minutes; lower it when you know the target is small.
- **Requires a real, passing test suite.** Mutmut runs your tests once per
mutant. If `pytest` is broken on the main branch, the report is
meaningless.
- **mutmut v3 output format.** The summary parser targets the v3 emoji
counters (`🎉 killed`, `⏰ timeout`, `🤔 suspicious`, `🙁 survived`,
`🔇 skipped`). If you see `summary: {}` and
`mutation_score_percent: null` on a non-trivial module, mutmut has
probably changed its output format — check `mutmut --version` and open
an issue.
- **Mutmut writes to `.mutmut-cache`** in the working directory. The cache is
incremental by design (re-runs only mutate changed files); delete it
manually if you need a clean slate.
### Privilege surface
The plugin runs with the full privileges of the Hermes process. Specifically:
- **Spawns two subprocesses** per invocation: `python -m mutmut run` (always)
and `python -m mutmut results` (only when the run reports at least one
surviving mutant). Both are launched with `shell=False`; arguments are
passed as a list — no shell interpolation.
- **Reads files inside the working directory only.** `target_module` is
resolved via `os.path.realpath` and rejected if it lands outside the cwd,
blocking both absolute paths to system files (`/etc/passwd`) and `..`-based
traversal (`../../etc/passwd`). The validated relative path is what gets
passed to mutmut, not the raw LLM input. Paths that resolve to the project
root itself (`.`, the absolute cwd) are also rejected — `target_module`
must point to a file or subpackage *inside* the project, never the project
as a whole. Control characters (NUL, newlines, tabs, U+0000–U+001F) in the
path are rejected before any filesystem access.
- **TOCTOU defense.** The plugin snapshots `(st_dev, st_ino)` of the
resolved target before spawning mutmut and re-snapshots after. If the
identity changed during the run — a symlink swap, a delete-then-recreate,
any concurrent inode change — the plugin discards the mutmut output and
returns an error envelope rather than surfacing potentially attacker-influenced
results to the LLM.
- **Writes nothing directly.** Mutmut (the subprocess) writes its cache to
`.mutmut-cache` inside the cwd; the plugin itself writes no files.
- **No network access.** No outbound HTTP, no DNS lookups.
- **No secrets read by the plugin.** The plugin does not read
`~/.hermes/.env`, `auth.json`, or any other credential store.
- **Env-var allowlist for the subprocess.** Only a curated set of
operational variables is forwarded from the parent: `PATH`, `HOME`,
`USER`, `LOGNAME`, `SHELL`, `TMPDIR`/`TMP`/`TEMP`, `LANG`, `LC_*`, `TERM`,
`PWD`, `OLDPWD`, `PYTHONPATH`, `PYTHONHOME`, `VIRTUAL_ENV`. The plugin
always sets `PYTHONDONTWRITEBYTECODE=1`. Everything else — including
secrets like `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `AWS_*`, `GITHUB_TOKEN`
— is **dropped** before the child sees it. If your test suite legitimately
needs a specific variable (e.g. a test-database URL), opt it in by
listing the names in the parent `MUTATION_RUNNER_FORWARD_ENV` env var
(comma-separated): `MUTATION_RUNNER_FORWARD_ENV=DATABASE_URL,REDIS_URL`.
- **Bounded output.** Each subprocess stream is capped at 256 KB at the pipe
layer through a threaded sliding-window reader — the child cannot OOM the
Hermes process even if its test suite prints unbounded log volume.
`stdout_tail` is then truncated to 1500 bytes for the response;
`survivors_sample` is capped at `max_survivors_reported` (hard limit 500);
`timeout` is hard-capped at 4 hours.
- **Sentinel-delimited subprocess output.** Every field that surfaces raw
mutmut output to the LLM (`stdout_tail`, `partial_stdout_tail`,
`partial_stderr_tail`) is wrapped in `<<>>` /
`<<>>` so prompt-construction layers can mark the
trust boundary. The cwd itself is treated as fully trusted: anything
mutmut chooses to print — including snippets of project source code —
will surface to the calling LLM, so do not run this plugin against a
project tree containing untrusted code.
---
## Directory naming
The plugin lives in this repository as `hermes_mutation_runner/`
(underscores) because:
- Python's import system does not accept dashes in package names.
- The plugin's `__init__.py` uses relative imports
(`from .handlers import ...`) which require the directory to be a valid
Python package.
When you copy the plugin into `~/.hermes/plugins/`, rename the directory to
`hermes-mutation-runner` (dashes) to match the canonical Hermes plugin
identifier — that is what the `name` field in `plugin.yaml` and the
`plugins.enabled` entry both refer to.
---
## Roadmap
- **v0.2** — add a `backend` parameter selecting between `mutmut` (default)
and [`cosmic-ray`](https://cosmic-ray.readthedocs.io/) for a richer
operator set and distributed runs.
- **v0.3** — compose with `hermes-coverage-history` (catalog plugin #10) to
surface week-over-week mutation-score regressions without taking on a
storage dependency in this plugin.
- **v0.4** — optional `pre_approval_request` gate, useful if mutmut ever
grows a mode that mutates source files directly (bypassing the cache).
---
## License
GPL-3.0-or-later. See `LICENSE` at the repository root.