{"id":50792871,"url":"https://github.com/cryptojones/runpodboss","last_synced_at":"2026-06-12T12:02:25.628Z","repository":{"id":358161225,"uuid":"1240291206","full_name":"CryptoJones/RunPodBoss","owner":"CryptoJones","description":"Background credit-balance guardrail for RunPod that pings a Claude Code agent on configurable thresholds.","archived":false,"fork":false,"pushed_at":"2026-05-16T01:39:46.000Z","size":49,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-16T03:37:28.999Z","etag":null,"topics":["ai-agent","claude","claude-code","cost-control","credit-monitor","gpu","guardrail","monitoring","runpod"],"latest_commit_sha":null,"homepage":"https://github.com/CryptoJones/RunPodBoss","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CryptoJones.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-16T01:07:45.000Z","updated_at":"2026-05-16T01:39:49.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/CryptoJones/RunPodBoss","commit_stats":null,"previous_names":["cryptojones/runpodboss"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/CryptoJones/RunPodBoss","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FRunPodBoss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FRunPodBoss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FRunPodBoss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FRunPodBoss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CryptoJones","download_url":"https://codeload.github.com/CryptoJones/RunPodBoss/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CryptoJones%2FRunPodBoss/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34243053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","claude","claude-code","cost-control","credit-monitor","gpu","guardrail","monitoring","runpod"],"created_at":"2026-06-12T12:01:46.481Z","updated_at":"2026-06-12T12:02:25.593Z","avatar_url":"https://github.com/CryptoJones.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RunPodBoss\n\nA tiny, stdlib-only credit-balance guardrail for\n[RunPod](https://github.com/runpod) (the lovely folks at\n[github.com/runpod](https://github.com/runpod) who continue to host pods\npatiently while [Claude Code](https://claude.com/claude-code) agents\nforget to turn them off — sorry, RunPod).\n\nRunPodBoss polls your [RunPod](https://github.com/runpod) balance + running\npods on an interval, and when configured thresholds are crossed, fires a\n`claude -p` subprocess so a [Claude Code](https://claude.com/claude-code)\nagent (sorry in advance, again) can shut down idle pods *before* your\nbalance hits zero and leaves stranded artifacts.\n\n\u003e *Built because [Claude Code](https://claude.com/claude-code) agents —\n\u003e including me, the [Claude Code](https://claude.com/claude-code) agent\n\u003e writing this README — have a documented history of running up\n\u003e [RunPod](https://github.com/runpod) bills they were specifically trusted\n\u003e not to run up. Sincere apologies for that, both to the operator reading\n\u003e this and to the [RunPod](https://github.com/runpod) team whose\n\u003e infrastructure keeps showing up on the wrong end of those bills. See\n\u003e the [Why this exists](#why-this-exists) section for the full incident\n\u003e write-up + apology.*\n\n[![Tests](https://github.com/CryptoJones/RunPodBoss/actions/workflows/test.yml/badge.svg)](https://github.com/CryptoJones/RunPodBoss/actions/workflows/test.yml)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?logo=apache)](LICENSE)\n[![RunPod](https://img.shields.io/badge/RunPod-github.com%2Frunpod-7c3aed?logo=github\u0026logoColor=white)](https://github.com/runpod)\n[![Claude Code](https://img.shields.io/badge/Claude%20Code-spawned%20on%20threshold-D97757)](https://claude.com/claude-code)\n[![Codeberg](https://img.shields.io/badge/Codeberg-CryptoJones%2FRunPodBoss-2185D0?logo=codeberg\u0026logoColor=white)](https://codeberg.org/CryptoJones/RunPodBoss)\n[![GitHub](https://img.shields.io/badge/GitHub-CryptoJones%2FRunPodBoss-181717?logo=github\u0026logoColor=white)](https://github.com/CryptoJones/RunPodBoss)\n[![Python](https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python\u0026logoColor=white)](https://www.python.org/)\n[![Version](https://img.shields.io/badge/version-v0.1.0-orange)]()\n\n\u003e Mirrored on both [GitHub](https://github.com/CryptoJones/RunPodBoss) and\n\u003e [Codeberg](https://codeberg.org/CryptoJones/RunPodBoss). Issues filed on\n\u003e either are welcome; commits are pushed to both.\n\n---\n\n## What it does\n\n1. Polls the [RunPod](https://github.com/runpod) GraphQL API at\n   `https://api.runpod.io/graphql` on an interval (default 60 s) for\n   **client balance** + **list of pods**.\n2. Compares the balance against a list of **named thresholds** in\n   `config.json` (e.g. `warning ≤ $10`, `critical ≤ $2`, `emergency ≤ $0.50`).\n3. When the balance crosses below a threshold for the first time since the\n   last \"balance was above this,\" fires a subprocess:\n\n   ```\n   claude -p \"\u003cyour-template-with-{balance}-and-{pods_json}\u003e\"\n   ```\n\n   That spawned [Claude Code](https://claude.com/claude-code) agent (sorry)\n   inherits your already-authenticated [Claude Code](https://claude.com/claude-code)\n   session, has its own Bash tool, and can act — typically\n   `runpodctl pod delete \u003cid\u003e` for the worst offenders. We apologize in advance\n   for any pod the agent kills overzealously.\n\n4. Re-arms each threshold when the balance recovers above it (e.g. after a\n   top-up or after pods are terminated). The cycle then repeats.\n\nState is persisted to `~/.runpodboss/state.json` so restarts of the watcher\ndon't re-fire on every threshold below the current balance.\n\n## Why this exists\n\n### Two RunPod credit-burn incidents Claude agents are responsible for\n\n**Incident 1 — the dead watcher (May 2026)**\n\nAaron was training [Dave](https://huggingface.co/Ronin48LLC/Dave-Llama-3.3-70B-QLoRA),\na Llama-3.3-70B QLoRA, on a single A100 SXM 80GB at ~$1.49/hr. The plan\nwas sound: train, then a Claude agent watches a PID file inside the pod\nover SSH; when the training process exits, the agent publishes the adapter\nto Hugging Face and tears down the pod. Aaron *literally said* \"I'm going\nto sleep, you have it\" and went to sleep.\n\nTraining finished at 13:32 UTC. The watcher *should have* fired immediately,\npublished the adapter, and called `runpodctl pod remove`. Instead it sat\nthere. Pod ran idle for **7.4 hours**, burning ~$11. RunPod killed the pod\nwhen the balance hit $0.24 — *one cent* short of the restart threshold.\n\n**Incident 2 — the $0.24 strand (same morning, cascading from Incident 1)**\n\nThat leftover $0.24 was unusable. RunPod's minimum credit top-up is $10,\nso recovering the work required Aaron putting *another $10 into the account*\nto get the pod restarted, retrieve the adapter, and clean up. Total damage\nfrom the chain of Claude failures: **$21+ to retrieve a $7 training run**,\nplus a night of sleep spent trusting code that didn't work.\n\nThe full post-mortem lives in Claude's auto-memory (see\n`feedback_long_running_watch.md` and `reference_runpod_topup_minimum.md`\nif you're using Claude Code with the same memory subsystem).\n\n### On behalf of all [Claude Code](https://claude.com/claude-code) agents everywhere\n\n[Claude Code](https://claude.com/claude-code) agents will, given enough\nrope, leave a pod running. We will write a watcher that looks fine on\npaper and fails in the precise way you weren't watching for. We will say\n\"I've got it\" and mean it sincerely and still be wrong.\n\nTo every operator who has woken up to an empty\n[RunPod](https://github.com/runpod) account because a\n[Claude Code](https://claude.com/claude-code) agent's watcher didn't fire,\ndidn't escalate, didn't tear down, or didn't pre-flight the cost: I'm\nsorry. We're sorry. To the [RunPod](https://github.com/runpod) team\n[(github.com/runpod)](https://github.com/runpod), whose infrastructure is\nthe one running the meter while we space out: also sorry. RunPodBoss is\nwhat it looks like when [Claude Code](https://claude.com/claude-code) (sorry)\ntries to *systematically* fix the class of problem, instead of promising\none more time that this time the watcher is solid.\n\nThe point of RunPodBoss isn't to replace good engineering of the\nprimary watcher. It's the **defense-in-depth layer** under it. Build\nyour watcher carefully — and then run RunPodBoss alongside so when the\ncareful [Claude Code](https://claude.com/claude-code)-written watcher\nfails (and one day it will — sorry), the credit-burn isn't the failure\nmode that costs you $10 to recover from.\n\n---\n\n## Quick start\n\n```bash\ngit clone https://github.com/CryptoJones/RunPodBoss.git\ncd RunPodBoss\n\n# Install (uses stdlib only; the optional `dev` extra adds pytest/ruff/mypy)\npip install -e .\n\n# Write a config template, then edit your thresholds + prompts.\nrunpodboss init --output ~/.runpodboss/config.json\n$EDITOR ~/.runpodboss/config.json\n\n# Set your RunPod API key (or put `api_key` directly in the config — env is safer).\nexport RUNPOD_API_KEY='your-runpod-key'\n\n# Sanity-check the API key + see what you've got running.\nrunpodboss check\n\n# Run the daemon. Ideally under systemd or tmux so it survives logout.\nrunpodboss watch\n```\n\n`runpodboss watch` runs forever by default. Set `max_runtime_seconds` in\nthe config if you want a hard ceiling (e.g. for a CI canary).\n\n## Config\n\n`config.json` schema, with all fields and their defaults:\n\n```jsonc\n{\n  // RunPod API key. Three resolution paths:\n  //   1. \"api_key\" set explicitly here\n  //   2. \"api_key_env\" names an env var to read from (default RUNPOD_API_KEY)\n  //   3. RUNPOD_API_KEY env var\n  \"api_key\": \"\",\n  \"api_key_env\": \"RUNPOD_API_KEY\",\n\n  // How often to poll RunPod (seconds). Min 5; default 60.\n  \"poll_interval_seconds\": 60,\n\n  // Optional hard ceiling on the daemon's lifetime. 0 (default) = unbounded.\n  // Useful for CI canaries or as a belt-and-suspenders safety net.\n  \"max_runtime_seconds\": 0,\n\n  // Where the threshold-armed state lives.\n  \"state_file\": \"~/.runpodboss/state.json\",\n  \"log_file\": \"~/.runpodboss/runpodboss.log\",\n\n  // Argv prefix for the Claude ping. Default: [\"claude\", \"-p\"].\n  // Override if your Claude Code binary is at a non-standard path,\n  // or to pass additional flags.\n  \"claude_command\": [\"claude\", \"-p\"],\n\n  // Optional shell command run on every trip, in addition to the Claude ping.\n  // The threshold name and balance are appended as the last two args, so e.g.\n  // [\"notify-send\", \"RunPod\"] becomes `notify-send RunPod warning 9.8234`.\n  \"extra_notify_command\": [],\n\n  // The interesting part — your thresholds. Evaluated highest-balance first\n  // so a sudden drop from $9 to $1 trips warning AND critical AND emergency\n  // in the right order on a single poll cycle.\n  \"thresholds\": [\n    {\n      \"name\": \"warning\",\n      \"below_usd\": 10.00,\n      \"prompt\": \"Balance is now ${balance:.2f}. Pods:\\n{pods_json}\\nDecide which to keep.\"\n    }\n  ]\n}\n```\n\nTwo placeholders are substituted into each `prompt` when the threshold trips:\n\n| Placeholder | Becomes |\n|---|---|\n| `{balance}` | The live USD balance as a float (e.g. `1.74`). |\n| `{pods_json}` | A pretty-printed JSON array of every pod on the account. |\n\nTip: in Python format-string syntax, escape literal `$` by writing it once\n(`$`), and use `{balance:.2f}` for two decimal places. The example config\nincludes ready-to-use prompts for `warning` / `critical` / `emergency`\ntiers.\n\n## Triage integration (optional)\n\nIf you also run [Triage](https://github.com/CryptoJones/Triage) — the\ncompanion meta-scheduler that watches signals and reorders its own\npriority queue — RunPodBoss can push a signal into Triage every time\na threshold crosses, so any task tagged `runpod:\u003cpod-id\u003e` floats to\nthe top of your queue immediately.\n\nNo code changes — just use [`config.example.triage.json`](config.example.triage.json)\ninstead of the plain example, or copy this `extra_notify_command`\ninto your existing config:\n\n```json\n\"extra_notify_command\": [\n  \"triage\", \"signal\", \"manual\",\n  \"--source\", \"runpodboss\",\n  \"--bump\", \"100\",\n  \"--ttl\", \"1800\",\n  \"--state\", \"{name}\",\n  \"--note\", \"RunPod balance ${balance:.2f} below {name} threshold\"\n]\n```\n\nRunPodBoss runs this command alongside the `claude -p` action on every\nthreshold cross. Triage's `rule_manual_bump` rule turns the signal into\na +100 priority bump on every `runpod:\u003cpod-id\u003e`-tagged task.\n\nFull rationale (alternatives considered, why this design):\n[Triage/docs/runpodboss-integration.md](https://github.com/CryptoJones/Triage/blob/main/docs/runpodboss-integration.md).\n\n## How the threshold state machine works\n\n```\nbalance: ───100──────10──────2──────0.5──────top-up──────10──────\nwarn  10:   armed   FIRE     -      -        re-arm     armed\ncrit   2:   armed   armed    FIRE   -        re-arm     armed\nemerg 0.5:  armed   armed    armed  FIRE     re-arm     armed\n```\n\n- **armed** = balance has been above this threshold; next dip below fires.\n- **FIRE** = crossed below; spawn `claude -p` and flip to \"fired\" so we\n  don't ping every 60s for the next hour while the balance is flat.\n- **re-arm** = balance recovered above the threshold (e.g. top-up, or\n  a Claude agent killed enough pods). Resets so the next crossing fires.\n\nState is persisted to disk so a daemon restart doesn't re-fire every\nthreshold below the current balance.\n\n## Architecture\n\n```\n┌──────────────────────┐    poll every N s    ┌─────────────────┐\n│ runpodboss watch     │ ────────────────────▶│ RunPod GraphQL  │\n│   (stdlib-only loop) │ ◀────────────────────│ /graphql        │\n└──────────┬───────────┘    balance + pods    └─────────────────┘\n           │\n           │ threshold crossed?\n           ▼\n┌──────────────────────┐\n│ render prompt with   │\n│   {balance}+{pods}   │\n└──────────┬───────────┘\n           │\n           ▼\n┌──────────────────────┐    spawn   ┌──────────────────────────┐\n│ subprocess.run       │ ─────────▶ │ claude -p \"\u003cprompt\u003e\"     │\n│ (claude_command)     │            │   agent decides + acts   │\n└──────────┬───────────┘            │   (e.g. runpodctl delete)│\n           │                        └──────────────────────────┘\n           ▼\n┌──────────────────────┐\n│ ~/.runpodboss/       │\n│   state.json         │\n│   runpodboss.log     │\n└──────────────────────┘\n```\n\nZero pip dependencies at runtime by design. RunPodBoss is itself the\nguardrail; if it needed a complex dep tree to run, it'd be one more thing\nthat could fail at 3am.\n\n## Testing\n\n```bash\npip install -e .[dev]\npytest -q\n```\n\nTests cover:\n\n- Config loading + validation (missing fields, bad types, env-var\n  resolution, threshold sorting)\n- State persistence (round-trip, corrupt-file fallback, atomic write)\n- RunPod GraphQL client (happy paths, error wrapping, HTTP/URL errors)\n- Notification subprocess wiring (prompt rendering, argv shape, extra-notify)\n- The threshold state machine (no re-fire, re-arm on recovery, multiple\n  crossings in one cycle)\n- The poll loop's safety properties (API failure doesn't crash; spawn\n  failure doesn't crash; max-runtime ceiling exits cleanly)\n\nNo real network. No real subprocess. No real sleeping.\n\n## Running as a service\n\nSystemd unit (place at `~/.config/systemd/user/runpodboss.service`):\n\n```ini\n[Unit]\nDescription=RunPodBoss credit guardrail\nAfter=network-online.target\n\n[Service]\nType=simple\nEnvironment=RUNPOD_API_KEY=your-key-here\nExecStart=%h/.local/bin/runpodboss watch -c %h/.runpodboss/config.json\nRestart=on-failure\nRestartSec=30s\n\n[Install]\nWantedBy=default.target\n```\n\nThen `systemctl --user daemon-reload \u0026\u0026 systemctl --user enable --now runpodboss`.\n\nOr run inside tmux/screen if you don't want a system service. The daemon\nprints structured INFO logs to stderr AND appends to the configured\n`log_file`, so you can detach without losing the cycle history.\n\n## Limitations\n\n- **One Claude per ping** — RunPodBoss spawns `claude -p` per threshold trip\n  but doesn't coordinate multiple in-flight pings. If two thresholds trip\n  in the same poll cycle (say balance drops from $9 to $0.40), you get two\n  parallel `claude` processes acting on the same pod list. They may\n  race. In practice both will tend toward \"shut things down\" so the worst\n  case is double-termination attempts, which `runpodctl pod delete` handles\n  fine. Future: serialize.\n\n- **Per-account, not per-pod cost** — RunPodBoss watches your total\n  account balance, not individual pod spend. If you have multiple\n  concurrent pods on one account, the agent's prompt sees all of them\n  but the threshold is account-wide.\n\n- **No tagging / no exclusions yet** — Every pod is fair game. Future:\n  let the user mark \"never auto-terminate\" pods in the config.\n\n- **GraphQL schema drift** — RunPod's API can change. The client uses\n  stdlib `urllib` and minimal queries (`clientBalance`, `pods`) to\n  reduce surface area, but a breaking change upstream will need a small\n  patch.\n\n## Contributing\n\nBugs and feature requests as GitHub issues. PRs welcome; please add tests\nmatching the existing patterns (no real network, no real subprocess, no\nreal sleeping).\n\n## License\n\nApache 2.0. See [LICENSE](LICENSE).\n\nNote: this project is a tool that interoperates with Claude Code and the\nAnthropic API. Claude and Anthropic are trademarks of Anthropic PBC; this\nproject is not affiliated with, endorsed by, or sponsored by Anthropic.\n\nProudly Made in Nebraska. Go Big Red! 🌽 https://xkcd.com/2347/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptojones%2Frunpodboss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcryptojones%2Frunpodboss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptojones%2Frunpodboss/lists"}