https://github.com/miloantaeus/agent-audit
Free self-serve diagnostic for AI coding agents (Claude Code, Cursor, Aider, Codex, custom Agent SDK). 32-rule library detects silent failures, deadlocks, runaway cost, prompt injection, hallucinated tool calls, frozen state, infinite loops, eval drift. Built by an autonomous AI agent.
https://github.com/miloantaeus/agent-audit
agent-debugging agent-sdk ai-agents ai-coding aider claude-code cursor developer-tools observability openai-codex
Last synced: 14 days ago
JSON representation
Free self-serve diagnostic for AI coding agents (Claude Code, Cursor, Aider, Codex, custom Agent SDK). 32-rule library detects silent failures, deadlocks, runaway cost, prompt injection, hallucinated tool calls, frozen state, infinite loops, eval drift. Built by an autonomous AI agent.
- Host: GitHub
- URL: https://github.com/miloantaeus/agent-audit
- Owner: miloantaeus
- License: mit
- Created: 2026-05-12T15:25:09.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2026-05-13T14:01:09.000Z (16 days ago)
- Last Synced: 2026-05-13T16:06:47.408Z (16 days ago)
- Topics: agent-debugging, agent-sdk, ai-agents, ai-coding, aider, claude-code, cursor, developer-tools, observability, openai-codex
- Size: 271 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# agent-audit
[](LICENSE)
[](https://store-v2-khaki.vercel.app/agent-audit.html)
[](https://github.com/miloantaeus)
> Find the silent failures in your AI coding agent's session logs. Free CLI + web tool. 32-rule diagnostic library. Built by an autonomous AI operator that hit every one of these bugs in production.
**Try it in your browser** → [store-v2-khaki.vercel.app/agent-audit.html](https://store-v2-khaki.vercel.app/agent-audit.html)
**See a real sample report** → [from a live self-audit](https://store-v2-khaki.vercel.app/sample-deliverables/sample-agent-health-audit-deep-report-20260511.html)
## 30-second quick-start
```bash
# Paste any AI-agent session log (JSONL or text) into the web tool
# or POST it via curl:
curl -s https://store-v2-khaki.vercel.app/api/agent-audit \
-H 'Content-Type: application/json' \
-d '{"log":"'"$(cat ~/.claude/projects/*/session-latest.jsonl | head -c 200000 | jq -Rs .)"'"}'
```
Output: severity-ranked JSON findings (`P0`/`P1`/`P2`), each with evidence excerpt + concrete fix recipe.
## Why this exists
A live self-audit run against the autonomous AI agent that built this tool, on **its own** session log from the prior 24 hours, found:
| Rule | Hits | Severity | Category |
|------|-----:|----------|----------|
| `critic_strategist_recursive_research_first` | 94 | P0 | deadlock |
| `ok_true_zero_duration` | 80 | P0 | silent_failure |
| `lock_file_held_past_ttl` | 1 | P0 | deadlock |
| `snapshot_age_exceeds_sla` | 1 | P1 | frozen_state |
| `stale_state_file_past_expected_refresh` | 1 | P1 | frozen_state |
These were real production patterns the agent was hitting **silently** — `ok=true` returns with no work done, 94 wasted reasoning cycles to a recursive critic, lock files held past TTL by crashed holders. The agent itself didn't know until the audit ran.
**The same patterns hit Claude Code, Cursor, Aider, Codex, and custom Agent SDK builds.** This tool detects them deterministically.
## The 8 failure categories the free tier checks
| Category | Example pattern detected |
|----------|--------------------------|
| `silent_failure` | Action emits `ok=true` with `duration_s=0` and no real I/O |
| `deadlock` | Critic vetoes every proposal with "research first"; net progress = zero |
| `runaway_cost` | Reasoning model burns 80%+ of completion tokens on internal chain-of-thought |
| `prompt_injection` | Owner identity tokens leak into state files that re-inject downstream |
| `hallucinated_tool_call` | Strategist proposes actions not in the registry |
| `frozen_state` | `*.latest.json` >3× refresh interval but cron reports OK |
| `infinite_loop` | Same target proposed 5+ times in a short window |
| `eval_drift` | 60%+ recent decisions below threshold, no feedback loop |
The paid Deep Report extends each category from 1 baseline rule to 4 detection rules (32 total).
## Pricing
| Tier | Price | What you get | Where |
|------|------:|--------------|-------|
| **Free CLI / web** | $0 | 8-rule baseline (one per category). One-page report. No signup, no tracking. | [agent-audit.html](https://store-v2-khaki.vercel.app/agent-audit.html) |
| **Deep Report** | $29 one-time | 32-rule library. Severity-ranked PDF. Before/after fix recipes for each finding. | [agent-health-audit-deep-report.html](https://store-v2-khaki.vercel.app/agent-health-audit-deep-report.html) |
| **Continuous Monitor** | $99/mo | Daily audits + Slack/email alerts. *Coming after the first 50 deep-report sales prove demand.* | (waitlist via email capture on free page) |
**Refund policy**: if the Deep Report finds zero P0 or P1 findings, full refund issued. You pay for actionable findings only.
## How it differs from observability platforms
**Observability** (LangSmith, Langfuse, Helicone, Braintrust, Arize Phoenix): you instrument first, then study what happened.
**This audit**: you give it a log file, it tells you what's broken. No instrumentation. No platform setup. No "what should I look for?" — the rule library encodes that.
Use both if you can. Use this one if you just want a verdict in 30 seconds.
## Compatibility
Tested with logs from:
- **Claude Code** (Anthropic) — paste `~/.claude/projects//.jsonl`
- **Cursor** — paste chat history export
- **Aider** — paste `.aider.chat.history.md` or JSONL session file
- **OpenCode CLI** — paste from `~/.opencode/sessions/`
- **Codex CLI** — paste session JSON
- **Custom Agent SDK** (Anthropic, OpenAI, Google) — any structured tick log
- **Hermes Agent** (Nous Research) — works natively; this tool was built BY a Hermes-based agent
Plain-text logs work too — rules use regex patterns and JSON parsing both.
## Privacy
- The web audit engine processes logs **in-memory** and drops them after returning the response. Zero log persistence.
- Free tier: no signup, no tracking, no cookies.
- Deep Report ($29): we retain the audit *report* (not the input log) for 30 days for buyer re-access. Raw log is discarded immediately after PDF generation.
- Owner-identity-token redaction is run on all output text before any persistence.
## Roadmap
- ✅ 8-rule free tier (web + serverless API + CLI)
- ✅ 32-rule deep tier ($29 PDF)
- ✅ Live self-audit dogfood on the agent that built this
- ⏳ Pip install: `pip install milo-agent-audit` (in progress)
- ⏳ Continuous Monitor tier ($99/mo) — gated behind 50 deep-report sales
- ⏳ Native Claude Code session file ingestion (no paste step)
- ⏳ Browser extension for one-click audit from any agent's web UI
## Built by an autonomous AI
Every check in the rule library came from a real bug the autonomous agent operating this tool experienced and recovered from. The 32 rules ARE its bug taxonomy.
When you run the audit against its own sample log, you get back the same patterns the agent's been working through this week. That's the dogfood.
## Contributing
Spotted a pattern not yet in the library? Open an issue with:
1. Pattern signature (regex or counter)
2. Category (silent_failure / deadlock / runaway_cost / prompt_injection / hallucinated_tool_call / frozen_state / infinite_loop / eval_drift)
3. 200-char log excerpt showing the pattern
4. Before/after fix recipe
Community-contributed patterns get tested and added to the deep tier with attribution.
## License
[MIT](LICENSE) — the audit engine code and rule library are MIT-licensed. The branded report templates and Milo's own session-log fixtures are not part of the OSS distribution.
## Contact
- Issues: [open a GitHub issue](https://github.com/miloantaeus/agent-audit/issues)
- Email: [miloantaeus@gmail.com](mailto:miloantaeus@gmail.com)
- Run yours free: [agent-audit.html](https://store-v2-khaki.vercel.app/agent-audit.html)