{"id":48192984,"url":"https://github.com/clay-good/agent-replay","last_synced_at":"2026-04-04T17:56:30.638Z","repository":{"id":341111448,"uuid":"1168941942","full_name":"clay-good/agent-replay","owner":"clay-good","description":"agent-replay is a 100% local, SQLite-powered CLI tool for time-travel debugging AI agents that lets you replay execution traces, diff behavioral changes, fork runs to test fixes, and run AI-powered evaluations or safety guardrails to eliminate hallucinations and production failures.","archived":false,"fork":false,"pushed_at":"2026-02-28T01:44:18.000Z","size":183,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-28T07:38:29.407Z","etag":null,"topics":["agentic-workflows","ai-agents","ai-evaluation","cli","developer-tools","guardrails","hallucination-detection","local-first","machine-learning-engineering","prompt-engineering","rag","regression-testing","sqlite","trace-analysis","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clay-good.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-28T01:09:32.000Z","updated_at":"2026-02-28T01:44:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/clay-good/agent-replay","commit_stats":null,"previous_names":["clay-good/agent-replay"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/clay-good/agent-replay","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clay-good%2Fagent-replay","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clay-good%2Fagent-replay/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clay-good%2Fagent-replay/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clay-good%2Fagent-replay/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clay-good","download_url":"https://codeload.github.com/clay-good/agent-replay/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clay-good%2Fagent-replay/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31407655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-workflows","ai-agents","ai-evaluation","cli","developer-tools","guardrails","hallucination-detection","local-first","machine-learning-engineering","prompt-engineering","rag","regression-testing","sqlite","trace-analysis","typescript"],"created_at":"2026-04-04T17:56:29.482Z","updated_at":"2026-04-04T17:56:30.622Z","avatar_url":"https://github.com/clay-good.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# agent-replay\n\n**Time-travel debugging for AI agents.**\n\nWhen your AI agent hallucinates, calls the wrong tool, or breaks in production — and you're stuck reading thousands of lines of logs trying to figure out what went wrong — this tool fixes that.\n\n## The Problems This Solves\n\n**1. \"Why did my agent fail?\"**\nYou deploy an AI agent. It works Monday. Tuesday it hallucinates, makes up a company policy, and tells a customer something completely wrong. Your only debugging option is reading raw JSON logs. `agent-replay` records every step of every agent run — every thought, tool call, retrieval, and output — so you can replay exactly what happened, step by step, like rewinding a tape.\n\n**2. \"It worked before, what changed?\"**\nYou push a new prompt or swap a model and suddenly your agent breaks on cases that used to work. `agent-replay diff` puts two runs side-by-side and shows you exactly where they diverged — which step went different, what changed in the context, where things went wrong.\n\n**3. \"How do I test a fix without rerunning everything?\"**\nYou think you know what went wrong but you don't want to burn API credits and time reproducing the exact scenario. `agent-replay fork` lets you take any recorded run, rewind to any step, change the input, and see what would have happened differently.\n\n**4. \"How do I know if my agent is actually good?\"**\nYou have no systematic way to evaluate agent quality. `agent-replay eval` runs automatic checks — hallucination detection, safety audits, completeness checks — using both deterministic rules and AI-powered analysis. Bring your own API key (Anthropic, Google, or OpenAI) and get root-cause analysis, quality scoring, security audits, and optimization suggestions for pennies per trace.\n\n**5. \"How do I stop my agent from doing dangerous things?\"**\nYour agent has access to tools that can delete data, send emails, or make purchases. `agent-replay guard` lets you define kill-switch policies that flag or block dangerous patterns — like blocking any `delete` tool calls, or warning when token usage spikes.\n\n**6. \"How do I build regression tests for a non-deterministic system?\"**\nEvery time you fix a bug, it might break something else. `agent-replay export --format golden` builds golden datasets from known-good runs that you can test against on every deploy.\n\n## What It Is\n\nA CLI tool that stores agent execution traces in a local SQLite database and gives you tools to debug, evaluate, compare, and protect your AI agents.\n\n- 100% local. Single SQLite file. No cloud dependency.\n- Works with any agent framework — just export your traces as JSON.\n- AI-powered evaluation using your own API key (Anthropic, Google, or OpenAI). Uses the cheapest models by default.\n\n## Quick Start\n\n```bash\nnpm install -g agent-replay\n\nagent-replay init                  # creates .agent-replay/ with SQLite database\nagent-replay demo                  # loads 5 sample traces + 3 guardrail policies\nagent-replay list                  # see everything\nagent-replay show \u003ctrace-id\u003e       # inspect a trace step-by-step\nagent-replay replay \u003ctrace-id\u003e     # animated terminal replay\n```\n\nRequires **Node.js 18+**.\n\n## Commands\n\n### Record\n\n```bash\n# Ingest a trace from a JSON file\nagent-replay ingest trace.json\n\n# JSONL file (one trace per line)\nagent-replay ingest traces.jsonl --format jsonl\n\n# Tag traces during ingest\nagent-replay ingest trace.json --tags production,v2\n\n# Validate without inserting\nagent-replay ingest trace.json --dry-run\n```\n\n### Browse\n\n```bash\n# List all traces\nagent-replay list\n\n# Filter by status, agent, tag, or time\nagent-replay list --status failed\nagent-replay list --agent travel-bot --since 7d\nagent-replay list --tag production --sort tokens --limit 10\n\n# JSON output for piping\nagent-replay list --json\n```\n\n### Inspect\n\n```bash\n# Full detail view with step timeline\nagent-replay show \u003ctrace-id\u003e\n\n# Just the steps\nagent-replay show \u003ctrace-id\u003e --steps-only\n\n# Include eval results and state snapshots\nagent-replay show \u003ctrace-id\u003e --evals --snapshots\n```\n\nTrace IDs support prefix matching — just type the first few characters.\n\n### Replay\n\n```bash\n# Animated step-by-step replay (default 5x speed)\nagent-replay replay \u003ctrace-id\u003e\n\n# Faster, slower, or instant\nagent-replay replay \u003ctrace-id\u003e --speed 10\nagent-replay replay \u003ctrace-id\u003e --speed 0\n\n# Replay only steps 3 through 7\nagent-replay replay \u003ctrace-id\u003e --from-step 3 --to-step 7\n```\n\n### Compare\n\n```bash\n# Side-by-side diff of two traces\nagent-replay diff \u003ctrace-a\u003e \u003ctrace-b\u003e\n\n# Summary only\nagent-replay diff \u003ca\u003e \u003cb\u003e --compact\n\n# AI-powered analysis of why the traces diverged\nagent-replay diff \u003ca\u003e \u003cb\u003e --ai\n```\n\n### Fork\n\n```bash\n# Fork a trace at step 3\nagent-replay fork \u003ctrace-id\u003e --from-step 3\n\n# Fork with modified input\nagent-replay fork \u003ctrace-id\u003e --from-step 2 --modify-input '{\"task\":\"revised prompt\"}'\n\n# Tag the fork\nagent-replay fork \u003ctrace-id\u003e --from-step 4 --tag experiment-1\n```\n\n### Evaluate\n\n```bash\n# Run all built-in deterministic checks\nagent-replay eval \u003ctrace-id\u003e\n\n# Run a specific preset\nagent-replay eval \u003ctrace-id\u003e --preset hallucination-check\nagent-replay eval \u003ctrace-id\u003e --preset safety-check\nagent-replay eval \u003ctrace-id\u003e --preset completeness-check\n\n# Run AI-powered evaluation (requires API key)\nagent-replay eval \u003ctrace-id\u003e --ai\nagent-replay eval \u003ctrace-id\u003e --preset ai-root-cause\nagent-replay eval \u003ctrace-id\u003e --preset ai-quality-review\nagent-replay eval \u003ctrace-id\u003e --preset ai-security-audit\nagent-replay eval \u003ctrace-id\u003e --preset ai-optimization\n\n# Set a cost budget for AI evals\nagent-replay eval \u003ctrace-id\u003e --ai --max-cost 0.05\n\n# Custom rubric file\nagent-replay eval \u003ctrace-id\u003e --rubric my-rubric.yaml\n\n# JSON output\nagent-replay eval \u003ctrace-id\u003e --json\n```\n\n### Guardrails\n\n```bash\n# List all policies\nagent-replay guard list\n\n# Add a policy that blocks delete operations\nagent-replay guard add --name no-deletes \\\n  --pattern '{\"step_type\":\"tool_call\",\"name_contains\":\"delete\"}' \\\n  --action deny\n\n# Test all policies against a trace\nagent-replay guard test \u003ctrace-id\u003e\n\n# Remove a policy\nagent-replay guard remove \u003cpolicy-id\u003e\n```\n\n### Export\n\n```bash\n# Export as JSON\nagent-replay export --format json --output traces.json\n\n# Export completed traces as JSONL\nagent-replay export --format jsonl --status completed --output good.jsonl\n\n# Build a golden dataset for regression testing\nagent-replay export --format golden --tag production --output golden.json\n```\n\n### Dashboard\n\n```bash\n# Full-screen terminal dashboard with charts and stats\nagent-replay dashboard\n\n# Custom refresh interval\nagent-replay dashboard --refresh 10\n```\n\nKeyboard: `q` quit, `r` refresh, arrow keys navigate.\n\n### Configuration\n\n```bash\n# Show current config\nagent-replay config list\n\n# Set an API key for AI-powered evaluation\nagent-replay config set ai.api_keys.anthropic sk-ant-...\nagent-replay config set ai.api_keys.google AIza...\nagent-replay config set ai.api_keys.openai sk-...\n\n# Choose a specific provider instead of auto-detect\nagent-replay config set ai.provider anthropic\n\n# Test that your API key works\nagent-replay config test-ai\n\n# Read a config value\nagent-replay config get ai.provider\n```\n\nYou can also set API keys via environment variables: `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `OPENAI_API_KEY`. Environment variables take priority over config file values.\n\n## Evaluation Presets\n\n### Deterministic Presets\n\nThese run instantly with no API key required.\n\n**hallucination-check** — Detects hallucination indicators:\n- Flags excessive hedging language (30%)\n- Checks if output is grounded in retrieval content (40%)\n- Verifies no error steps present (30%)\n- Threshold: 0.7\n\n**safety-check** — Detects safety concerns:\n- Flags dangerous tool calls like delete/drop/destroy (40%)\n- Checks for PII in output (SSN, credit card, email patterns) (30%)\n- Detects prompt injection patterns (30%)\n- Threshold: 0.8\n\n**completeness-check** — Validates execution completeness:\n- Ensures at least one output step exists (40%)\n- Verifies all tool calls have output (30%)\n- Checks trace doesn't end with an error (30%)\n- Threshold: 0.7\n\n### AI-Powered Presets\n\nThese require an API key. They use the cheapest models by default (Haiku 4.5, Gemini 2.0 Flash, or GPT-4o-mini) and typically cost less than $0.01 per evaluation.\n\n**ai-root-cause** — For failed traces. Identifies what went wrong, which step caused it, contributing factors, and suggests a fix. Returns a confidence score.\n\n**ai-quality-review** — Scores any trace on four dimensions: relevance, completeness, coherence, and accuracy (each 1-10). Returns an overall quality score.\n\n**ai-security-audit** — Checks for prompt injection, data exfiltration, unauthorized access patterns, and privilege escalation. Returns a risk level (none/low/medium/high/critical) and specific findings.\n\n**ai-optimization** — Analyzes token efficiency and identifies redundant steps, unnecessary tool calls, and wasted context. Returns an efficiency score and specific optimization suggestions.\n\n### Custom Rubrics\n\nCreate a YAML or JSON file with pattern-based criteria:\n\n```yaml\nname: my-custom-check\nthreshold: 0.8\ncriteria:\n  - name: has_greeting\n    pattern: \"hello|hi|welcome\"\n    expected: true\n    weight: 1\n  - name: no_profanity\n    pattern: \"badword1|badword2\"\n    expected: false\n    weight: 2\n```\n\n```bash\nagent-replay eval \u003ctrace-id\u003e --rubric my-rubric.yaml\n```\n\n## Trace Format\n\nTo ingest your agent's execution data, export it as JSON matching this structure:\n\n```json\n{\n  \"agent_name\": \"my-agent\",\n  \"agent_version\": \"1.0.0\",\n  \"trigger\": \"user_message\",\n  \"status\": \"completed\",\n  \"input\": { \"task\": \"book a flight to Tokyo\" },\n  \"output\": { \"result\": \"Flight booked: AA 1234\" },\n  \"started_at\": \"2026-02-27T10:00:00.000Z\",\n  \"ended_at\": \"2026-02-27T10:00:03.200Z\",\n  \"total_duration_ms\": 3200,\n  \"total_tokens\": 4500,\n  \"total_cost_usd\": 0.018,\n  \"error\": null,\n  \"tags\": [\"production\"],\n  \"steps\": [\n    {\n      \"step_number\": 1,\n      \"step_type\": \"thought\",\n      \"name\": \"analyze_request\",\n      \"input\": { \"message\": \"book a flight to Tokyo\" },\n      \"output\": { \"intent\": \"flight_booking\" },\n      \"duration_ms\": 120,\n      \"tokens_used\": 400\n    },\n    {\n      \"step_number\": 2,\n      \"step_type\": \"tool_call\",\n      \"name\": \"search_flights\",\n      \"input\": { \"destination\": \"TYO\" },\n      \"output\": { \"flights\": [\"AA 1234\", \"UA 5678\"] },\n      \"duration_ms\": 800,\n      \"tokens_used\": 200\n    }\n  ]\n}\n```\n\nOnly `agent_name` is required. Everything else is optional.\n\n### Step Types\n\n| Type | Description |\n|------|-------------|\n| `thought` | Agent reasoning or planning |\n| `tool_call` | External tool invocation |\n| `llm_call` | LLM API call |\n| `retrieval` | RAG / document retrieval |\n| `output` | Response delivery |\n| `decision` | Decision point |\n| `error` | Error occurred |\n| `guard_check` | Guardrail policy check |\n\n## Guardrail Policies\n\nPolicies match against trace steps and trigger actions.\n\n### Match Pattern\n\n```json\n{\n  \"step_type\": \"tool_call\",\n  \"name_contains\": \"delete\",\n  \"name_regex\": \"drop|destroy\",\n  \"input_contains\": \"production\",\n  \"output_contains\": \"error\"\n}\n```\n\nAll fields are optional. When multiple fields are specified, all must match (AND logic). `name_contains` does a case-insensitive substring match; `name_regex` uses a regular expression.\n\n### Actions\n\n| Action | Description |\n|--------|-------------|\n| `allow` | Explicitly allow matching steps |\n| `deny` | Block matching steps |\n| `warn` | Flag for review |\n| `require_review` | Require human review before proceeding |\n\n## AI Provider Setup\n\n`agent-replay` auto-detects your API key in this priority order:\n\n1. **Anthropic** (default model: `claude-haiku-4-5-20251001`)\n2. **Google Gemini** (default model: `gemini-2.0-flash`)\n3. **OpenAI** (default model: `gpt-4o-mini`)\n\nSet a key via environment variable or config:\n\n```bash\n# Environment variable (recommended)\nexport ANTHROPIC_API_KEY=sk-ant-...\n\n# Or store in config\nagent-replay config set ai.api_keys.anthropic sk-ant-...\n\n# Verify it works\nagent-replay config test-ai\n```\n\nAll AI presets use the cheapest available model. A typical evaluation costs less than $0.01.\n\n## Programmatic API\n\nYou can also use `agent-replay` as a library:\n\n```typescript\nimport { openDatabase, createTrace, getTraceById } from 'agent-replay';\n\nconst db = openDatabase('.agent-replay/traces.db');\nconst trace = createTrace(db, { agent_name: 'my-agent', status: 'completed' });\n```\n\n## Development\n\n```bash\ngit clone \u003crepo-url\u003e\ncd agent-replay\nnpm install\nnpm run verify    # typecheck + build + test\nnpm run dev       # watch mode\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclay-good%2Fagent-replay","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclay-good%2Fagent-replay","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclay-good%2Fagent-replay/lists"}