{"id":50235582,"url":"https://github.com/palios-taey/claude-code-api-watchdog","last_synced_at":"2026-05-26T19:04:17.087Z","repository":{"id":360326405,"uuid":"1249585797","full_name":"palios-taey/claude-code-api-watchdog","owner":"palios-taey","description":"Auto-recover Claude Code sessions stuck on transient API errors. Outer-loop watchdog for unattended tmux Claude Code with usage-limit discrimination, exponential backoff, and 10-retry cap. Single file, no third-party deps.","archived":false,"fork":false,"pushed_at":"2026-05-25T23:30:30.000Z","size":192,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T01:23:21.702Z","etag":null,"topics":["agent","anthropic","automation","claude-code","exponential-backoff","python","retry","tmux","unattended","watchdog"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/palios-taey.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-25T21:24:21.000Z","updated_at":"2026-05-25T23:30:33.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/palios-taey/claude-code-api-watchdog","commit_stats":null,"previous_names":["palios-taey/claude-code-api-watchdog"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/palios-taey/claude-code-api-watchdog","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/palios-taey%2Fclaude-code-api-watchdog","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/palios-taey%2Fclaude-code-api-watchdog/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/palios-taey%2Fclaude-code-api-watchdog/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/palios-taey%2Fclaude-code-api-watchdog/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/palios-taey","download_url":"https://codeload.github.com/palios-taey/claude-code-api-watchdog/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/palios-taey%2Fclaude-code-api-watchdog/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33534592,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"ssl_error","status_checked_at":"2026-05-26T15:22:15.568Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","anthropic","automation","claude-code","exponential-backoff","python","retry","tmux","unattended","watchdog"],"created_at":"2026-05-26T19:04:16.192Z","updated_at":"2026-05-26T19:04:17.072Z","avatar_url":"https://github.com/palios-taey.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# claude-code-api-watchdog\n\nAuto-recover Claude Code sessions that get stuck on **transient API errors**.\n\n## The problem\n\nIf you run Claude Code in automation — overnight loops, agent fleets, unattended\ntasks — you've hit this: the session dies at 2am because the Anthropic API\nhiccupped for a few seconds. A `529 overloaded`, a transient `429` that is\nexplicitly *not* your usage limit, a `500`, a connection reset. The TUI stops at\nthe prompt waiting for a human to type \"Continue\". Your loop is dead until you\nwake up.\n\nThis watchdog watches your Claude Code tmux sessions and types `Continue` for\nyou when it detects a *transient*-error state at the prompt — a heuristic that\ndeliberately requires the error to persist across two consecutive polls before\nacting, and that can in principle misfire on a pane *displaying* error text\n(see \"How detection works (and its limits)\" below). It distinguishes transient\nerrors from real usage limits (which it leaves alone, because hammering a usage\nlimit just wastes attempts) and backs off exponentially.\n\n\u003e **Point this at UNATTENDED automation sessions — not the interactive session\n\u003e you're actively working in.** Detection is pane-scrape; the safe scope is\n\u003e \"panes where a human isn't typing.\" Run it with `--dry-run` first to watch\n\u003e what it *would* do before you trust it live.\n\n## What it does\n\n- Polls each named tmux session every N seconds (`tmux capture-pane`)\n- Detects a stuck transient-API-error state at the prompt\n- Requires the transient-error state on **two consecutive polls** before acting\n  (debounce against single-frame redraws and momentarily-displayed error text)\n- Injects `Continue` with **exponential backoff** (2s → 4s → … → 120s cap) and a\n  **10-attempt cap**, then escalates and stops (no infinite hammering within an\n  error episode; a session that flaps healthy↔error re-arms per episode)\n- Pre-clears the input line before typing `Continue`, so a misfire can't append\n  to a half-typed human instruction\n- Resets the moment the error clears, so the next error starts fresh\n- Leaves **real usage limits** alone (detects the \"resets at X\" / \"Rate limit\n  reached\" state and waits instead of spamming)\n- Dismisses the \"How is Claude doing?\" feedback overlay if it blocks the prompt\n- **Auto-restart of a dead session is OFF by default** (escalate-only). Opt in\n  with `--resume-cmd`. Use `--no-restart` to force nudge-only for specific\n  sessions even when you've enabled restart globally (e.g. anything that posts,\n  sends, or pays — a resume could re-fire the last action)\n- `--dry-run` mode logs every keystroke it *would* send without sending one\n\n### Disabling\n\nTo stop the watchdog, stop the process: `Ctrl-C` if you ran it in a terminal,\n`systemctl --user stop claude-code-api-watchdog` if you ran it as a service.\nThere is no \"monitor without injecting\" runtime flag — the watchdog's value\nis the `Continue` injection; if you don't want injection, don't run it.\n`--dry-run` is for validating what it *would* do during initial trust-building,\nnot for ongoing production observability.\n\n## Requirements\n\n- Claude Code running inside **named tmux sessions**\n- Python 3.10+\n- `tmux` and `pstree` on `PATH` (used for pane capture, send-keys, and\n  process-tree liveness checks). `pstree` is in `psmisc` on Debian/Ubuntu and\n  is preinstalled on most server distros; install it explicitly if missing.\n- No third-party Python deps. (Escalation is just an external command you\n  provide — wire it to whatever notifier you like.)\n\n## Install\n\n```bash\ncurl -O https://raw.githubusercontent.com/palios-taey/claude-code-api-watchdog/main/watchdog.py\n# or clone the repo\ngit clone https://github.com/palios-taey/claude-code-api-watchdog.git\n```\n\nIt's a single file. Copy it anywhere.\n\n## Run\n\n```bash\n# FIRST: dry-run. Watches your sessions and logs every keystroke it WOULD\n# send, without sending any. Run this for a while and confirm it only \"acts\"\n# on genuinely stuck sessions before you trust it live.\npython3 watchdog.py --sessions mybot,worker1,worker2 --dry-run\n\n# live (escalate-only on dead processes by default)\npython3 watchdog.py --sessions mybot,worker1,worker2\n\n# with everything (incl. opt-in auto-restart of dead sessions)\npython3 watchdog.py \\\n    --sessions mybot,worker1,worker2 \\\n    --interval 30 \\\n    --no-restart mybot \\\n    --resume-cmd \"claude --resume latest --dangerously-skip-permissions\" \\\n    --escalate-cmd \"/usr/local/bin/notify-me\"\n```\n\nOr configure entirely by environment (CLI flags win):\n\n```bash\nexport CCW_SESSIONS=mybot,worker1,worker2\nexport CCW_INTERVAL=30\nexport CCW_MAX_ATTEMPTS=10\nexport CCW_NO_RESTART=mybot\npython3 watchdog.py\n```\n\n### Run it as a service\n\nsystemd user unit (`~/.config/systemd/user/claude-code-api-watchdog.service`):\n\n```ini\n[Unit]\nDescription=claude-code-api-watchdog\nAfter=default.target\n\n[Service]\nExecStart=/usr/bin/python3 /path/to/watchdog.py\nEnvironment=CCW_SESSIONS=mybot,worker1,worker2\nRestart=on-failure\n\n[Install]\nWantedBy=default.target\n```\n\n```bash\nsystemctl --user enable --now claude-code-api-watchdog\n```\n\n## Configuration reference\n\n| Flag | Env | Default | Meaning |\n|---|---|---|---|\n| `--sessions` | `CCW_SESSIONS` | (required) | comma-separated tmux session names |\n| `--dry-run` | `CCW_DRY_RUN` | off | log keystrokes it WOULD send; send nothing. Run this first. |\n| `--interval` | `CCW_INTERVAL` | `30` | poll seconds |\n| `--no-restart` | `CCW_NO_RESTART` | (none) | sessions to nudge-only, never auto-restart |\n| `--resume-cmd` | `CCW_RESUME_CMD` | **(empty = escalate-only)** | how to relaunch a dead session. Empty default never auto-restarts; opt in explicitly (e.g. `claude --resume latest --dangerously-skip-permissions`). |\n| `--escalate-cmd` | `CCW_ESCALATE_CMD` | (none) | command run on escalation; receives the message as one arg |\n| — | `CCW_CONFIRM_POLLS` | `2` | consecutive transient-error polls required before acting |\n| — | `CCW_MAX_ATTEMPTS` | `10` | Continue attempts before escalating + halting |\n| — | `CCW_BACKOFF_BASE` | `2` | backoff base seconds |\n| — | `CCW_BACKOFF_CAP` | `120` | backoff cap seconds |\n| — | `CCW_PROXIMITY` | `20` | error must be within N lines of the prompt to count |\n| — | `CCW_DEAD_THRESHOLD` | `300` | seconds without a Claude process before restart |\n\n## How detection works (and its limits)\n\nDetection is **pane-scrape**: the watchdog reads the bottom ~50 lines of the\ntmux pane and looks for a transient-error pattern within `CCW_PROXIMITY` lines of\nthe prompt marker. It is deliberately conservative on several axes:\n\n- The patterns are anchored to Claude Code's `API Error:` rendering wherever\n  possible, so a developer who merely has `529` / `overloaded` / `502` /\n  `ECONNRESET` on screen (i.e. anyone writing or debugging HTTP error handling —\n  this tool's exact audience) does **not** trip it.\n- An error that has scrolled past `CCW_PROXIMITY` lines above the prompt is\n  ignored.\n- The state must persist across `CCW_CONFIRM_POLLS` (default 2) consecutive polls\n  before any keystroke is sent — a single-frame redraw or a momentarily-shown\n  error won't act.\n- Before typing `Continue`, the input line is cleared (`Ctrl-U`) so a misfire\n  can't append to a half-typed human instruction.\n\nCaveats, stated plainly:\n- It requires tmux. No tmux, no watchdog.\n- Pane-scrape is a heuristic. Point it at **unattended automation sessions**,\n  not the interactive session you're working in. The pattern list, proximity,\n  and confirm-polls are all tunable; they won't catch a TUI rendering this\n  tool's author never saw. Run `--dry-run` first. PRs welcome.\n- The `Continue` submit chain (clear line → literal `Continue` → Enter →\n  Kitty-protocol CSI-u Enter) is what reliably submits across Claude Code's Ink\n  TUI states. If a future Claude Code changes its input handling, this may need\n  updating.\n- A \"busy/working-spinner\" guard is shipped: panes showing active-generation\n  markers (`esc to interrupt`) are classified healthy and never injected into,\n  even if stale error text lingers in nearby scrollback from a just-recovered\n  failure. The two-poll confirmation provides additional debounce on top of\n  this. The exact working-indicator token is version-specific to Claude Code;\n  if Anthropic changes it, this guard needs updating.\n\n## Why not just retry inside Claude Code?\n\nClaude Code does retry transient errors internally — but when the retry budget\nis exhausted, it surfaces the error to the TUI and stops. This watchdog is the\nouter loop that handles the \"it gave up and is now waiting for a human\" case.\n(See the public issue traffic on exactly this:\n[anthropics/claude-code#60577](https://github.com/anthropics/claude-code/issues/60577),\n[#50841](https://github.com/anthropics/claude-code/issues/50841),\n[#44481](https://github.com/anthropics/claude-code/issues/44481).)\n\n## Prior art \u0026 how this differs\n\nThis is not the first tool to nudge Claude Code, and it doesn't claim to be.\nOther tools either supervise Claude Code broadly or handle subscription\nusage-limit waits:\n- `claude-auto-retry` — waits out subscription rate-limit resets, then continues\n- `claude-tmux-orchestration` — a full orchestration system with an embedded\n  rate-limit watchdog\n- Claude Code \"supervisor\" tools — broader hook/triage systems\n\nThis one is intentionally the **smallest practical version of one specific\nrecovery loop**: transient API-error stalls in tmux, with usage-limit\ndiscrimination, exponential backoff, an attempt cap, and escalation — in a\nsingle dependency-free file you can read in one sitting. If a heavier\norchestration tool already fits your workflow, use that.\n\n## License\n\nApache-2.0. See `LICENSE`.\n\n## Not affiliated with Anthropic\n\nThis is an unofficial, third-party tool. It is not affiliated with, endorsed by,\nor sponsored by Anthropic. \"Claude\" and \"Claude Code\" are trademarks of\nAnthropic; they are used here only descriptively, to identify the tool this\nwatchdog monitors.\n\n## Status\n\nExtracted and generalized from a private multi-agent fleet that runs ~10 Claude\nCode instances unattended. The core recovery logic (backoff + cap + transient-vs-\nusage-limit discrimination) is what keeps that fleet alive overnight.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpalios-taey%2Fclaude-code-api-watchdog","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpalios-taey%2Fclaude-code-api-watchdog","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpalios-taey%2Fclaude-code-api-watchdog/lists"}