{"id":51251940,"url":"https://github.com/avwohl/aws_watch","last_synced_at":"2026-06-29T07:32:18.537Z","repository":{"id":363161130,"uuid":"1261358475","full_name":"avwohl/aws_watch","owner":"avwohl","description":"Hourly watchdog that e-mails you about idle/wasteful AWS resources (idle EC2 instances, unattached EBS volumes, unassociated Elastic IPs, stale long-running boxes) across all regions, with live load average via SSM.","archived":false,"fork":false,"pushed_at":"2026-06-07T16:58:49.000Z","size":58,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T18:27:25.748Z","etag":null,"topics":["aws","boto3","cloud-cost","cloudwatch","cost-optimization","cron","devops","ec2","finops","monitoring","python","ssm"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avwohl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-06T15:24:20.000Z","updated_at":"2026-06-07T16:58:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/avwohl/aws_watch","commit_stats":null,"previous_names":["avwohl/aws_watch"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/avwohl/aws_watch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avwohl%2Faws_watch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avwohl%2Faws_watch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avwohl%2Faws_watch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avwohl%2Faws_watch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avwohl","download_url":"https://codeload.github.com/avwohl/aws_watch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avwohl%2Faws_watch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34918101,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","boto3","cloud-cost","cloudwatch","cost-optimization","cron","devops","ec2","finops","monitoring","python","ssm"],"created_at":"2026-06-29T07:32:17.653Z","updated_at":"2026-06-29T07:32:18.527Z","avatar_url":"https://github.com/avwohl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# aws_watch\n\nAn hourly watchdog that scans **every** AWS region for EC2 instances, spot\nrequests, EBS volumes and Elastic IPs, reports their creation time and current\nload, and e-mails you when something looks like wasted spend — an idle instance\nleft running, an unattached volume, an unassociated Elastic IP, or a box that\nhas been up far too long.\n\nIt exists because it is easy to leave a big test instance running and quietly\nburn money. aws_watch nags you about exactly that, and stays quiet otherwise.\n\nIt can also, **optionally**, *clean up* a narrowly allowlisted class of\nthrowaway instances for you — see [Reaping orphaned\ninstances](#reaping-orphaned-instances-optional-️-destructive). That feature is\ndestructive and off by default; the core watchdog is strictly read-only.\n\n## What it reports\n\n- **Instances** — id, Name tag, type, architecture, lifecycle (spot/on-demand),\n  state, **creation (launch) time**, age, public IP, and **current load\n  average** (1/5/15-min) pulled live from the instance.\n- **Spot requests** — id, state, status code, type, creation time.\n- **Volumes** — id, state, size, type, **creation time**, age, attachment.\n- **Elastic IPs** — address, allocation id, association.\n\n## What it flags as waste\n\n- **Idle running instances** — 5-minute load-average per vCPU (or CloudWatch CPU%\n  as a fallback) below a threshold, after a startup grace period.\n- **Unattached volumes** — EBS volumes in the `available` state.\n- **Unassociated Elastic IPs** — allocated but attached to nothing (AWS bills these).\n- **Old long-running instances** — on-demand instances running longer than a\n  configurable age (default 24h).\n\nAnything in your **suppress** list is still shown in the inventory but never\ntriggers an alert — use it for resources you intend to run long-term.\n\n## How load average is measured\n\nFor each running instance aws_watch first tries **AWS Systems Manager (SSM)**,\nrunning `cat /proc/loadavg; nproc` on the box (no SSH, no inbound ports — the\ninstance just needs the SSM agent and an instance profile with\n`AmazonSSMManagedInstanceCore`). That yields a true Unix load average.\n\nIf SSM is not available for an instance, it falls back to the **CloudWatch\n`CPUUtilization`** average over the last hour. The report marks which source was\nused (`ssm` or `cw`).\n\n## When it e-mails you\n\nIt runs hourly from cron but is deliberately quiet:\n\n- **Alert mail** — sent as soon as a *new* problem appears. The same resource is\n  not re-reported more often than `renotify_hours` (default 24h), so a persistent\n  idle box does not mail you 24 times a day.\n- **Daily digest** — one full inventory e-mail per day at `digest.hour`.\n- Otherwise it does nothing but log.\n\nReports are plain text with **tab-separated** columns (no box-drawing\ncharacters) so they survive being pasted into e-mail.\n\n## Reaping orphaned instances (optional, ⚠️ DESTRUCTIVE)\n\n\u003e **⚠️ THIS TERMINATES EC2 INSTANCES.** It is the only part of aws_watch that\n\u003e deletes anything. It is **off by default**, runs as a **separate `reap`\n\u003e command** (the hourly watcher never terminates anything), and **previews by\n\u003e default** — it only destroys when you add `--apply`. Read this whole section\n\u003e before enabling it. A careless allowlist or a missing protect rule can delete\n\u003e production. When unsure, leave it disabled.\n\nAutomated build/CI flows sometimes launch a throwaway EC2 box and then crash or\nget killed before deleting it, leaving it to run (and bill) forever. The reaper\nsweeps for exactly those orphans and terminates them — **while refusing to touch\nanything that isn't on an explicit allowlist.**\n\n### The safety model\n\nAn instance is terminated **only when every one of these is true**:\n\n1. `reap.enabled: true` in `config.yaml`.\n2. Its **`Name` (or `Project`) tag matches one of `reap.name_prefixes`** — the\n   allowlist. With an empty list, **nothing is ever a candidate** and the reaper\n   is a no-op. This is the primary gate: keep it narrow (e.g. `iospharo-*`).\n   Matching is **case-sensitive** (`Iospharo-*` will not match `iospharo-…`), and\n   a bare `*` matches *everything* — never use one; the reaper warns if you do.\n3. It matches **none** of the protect rules — `protect_ids`,\n   `protect_name_globs`, `protect_regions`, `protect_zones`, the `protect_tag`\n   (default `Reap=skip`) — **and is not in your `suppress` list.** Any single\n   match spares it, even if it looks idle and old.\n4. It is older than `min_age_minutes` (grace — never reap a box still booting).\n5. It is **idle** (low load/CPU, measured exactly like the idle *alert*) **or**\n   older than `max_age_hours` (a hard cap).\n\nAnd even with all of that, **`aws_watch.py reap` only prints what it would do.**\nTermination requires `aws_watch.py reap --apply`. The `--apply` is itself refused\nunless `reap.enabled` is true *and* `reap.name_prefixes` is non-empty, so a\ndefault or half-configured install can never delete anything.\n\nA box with no load/CPU metric at all, or an undeterminable launch time, is\n**kept** — the reaper never acts on missing data.\n\n### How to use it safely\n\n```sh\n# 1. Configure the allowlist + protections in config.yaml (see config.example.yaml).\n# 2. PREVIEW — terminates nothing, shows the REAP and KEPT tables:\npython3 aws_watch.py reap\n\n# 3. Read both tables carefully. Confirm everything under REAP is genuinely\n#    disposable and nothing production is missing from KEPT.\n# 4. Only then, terminate for real:\npython3 aws_watch.py reap --apply\n\n# 5. Install it on a 15-minute cron (also retires any old reaper cron line):\n./install.sh --with-reaper\n```\n\n`reap --apply` e-mails you a summary whenever it actually terminates something\n(`reap.email_on_reap`). To exempt one specific box without editing config, tag\nit `Reap=skip` (or add it to `suppress`) — but prefer a **keep-alive lease**\n(below) for anything temporary: a tag never expires and is exactly how idle boxes\nget left running for days.\n\n\u003e **Tip:** prefer narrow `name_prefixes` and explicit `regions` over `regions:\n\u003e all`. The allowlist makes an all-region sweep safe in principle, but an\n\u003e explicit region list is one less way to be surprised.\n\n### Reaper configuration (`config.yaml`)\n\n```yaml\nreap:\n  enabled: false                 # master switch\n  name_prefixes: [\"iospharo-*\"]  # ALLOWLIST — only these can ever be reaped\n  match_tag_keys: [Name, Project]\n  regions: [us-east-2]           # [] =\u003e use the top-level `regions`\n  min_age_minutes: 30            # grace period\n  max_age_hours: 12              # hard age cap (null =\u003e idle-only)\n  idle: {enabled: true, load_per_vcpu: 0.10, cpu_percent: 5.0}\n  protect_ids: []                # never-reap instance ids\n  protect_name_globs: []         # never-reap Name globs, e.g. \"*-prod-*\"\n  protect_regions: []\n  protect_zones: []\n  protect_tag: \"Reap=skip\"       # key=value tag that exempts a box (\"\" =\u003e off)\n  delete_alarm_template: null    # e.g. \"iospharo-idle-terminate-{id}\"\n  email_on_reap: true\n  keepalive:                     # SELF-EXPIRING protection (see below)\n    enabled: true\n    stale_after_minutes: 30      # heartbeat older than this no longer protects\n    overrides_max_age: false     # the max_age_hours hard cap still wins\n    on_db_error: keep            # DB unreadable =\u003e spare everything (fail safe)\n    db: {unix_socket: /run/mysqld/mysqld.sock, user: null, database: aws_watch, table: instance_lease}\n```\n\n## Keep-alive leases\n\nA protect tag or `suppress` entry exempts a box **forever** — which is precisely\nhow a \"temporary\" box ends up running for days after its work is done. A\nkeep-alive lease fixes that by being **self-expiring**: a box is spared only\nwhile something that wants it keeps actively saying so.\n\n**The contract**\n\n1. **Register** (the creator, once, within the reaper's `min_age_minutes` grace —\n   keep that ≥ 10 min). This just lists the instance id in the `instance_lease`\n   table; it is not yet a heartbeat.\n2. **Heartbeat** — update the row's `last_beat` more often than\n   `stale_after_minutes` (e.g. every \u003c 20 min for the default 30 min window).\n   **Only something actively working should heartbeat**, and when it stops, the\n   lease must be allowed to go stale. In this project the heartbeat is sent\n   **only by an actively-working Claude**, from a Claude Code `PostToolUse` hook\n   ([`aws-lease-beat-hook.sh`](aws-lease-beat-hook.sh)) — no cron, no daemon, no\n   provisioning script ever beats. The instant the Claude finishes, beats stop.\n3. **Release** (teardown) — delete the row. Optional; a released-or-forgotten\n   lease simply goes stale on its own.\n\nThe reaper then spares any box with a **fresh** lease from *idle* reaping, and\nreaps a box whose lease is stale or absent exactly like any other orphan. The\n`max_age_hours` hard cap still applies (a stuck heartbeat can't keep a box\nforever) unless you set `overrides_max_age: true`. If the lease DB can't be read,\n`on_db_error: keep` spares everything that sweep (never reap on missing data).\n\n**Set up the table (once)**\n\n```sh\nmysql \u003c schema.sql          # creates DB `aws_watch` + table `instance_lease`\n```\n\n**The writer + the locked-down key.** Writers never touch the DB directly from a\nremote box. They run [`lease_cmd.py`](lease_cmd.py) — `register` / `beat` /\n`release` / `fresh` / `list`, with strict validation and fully-parameterized SQL\n([`lease_db.py`](lease_db.py)). To let a box (or a laptop) heartbeat without\ngiving it any other access, expose `lease_cmd.py` as an SSH **forced command** on\na dedicated key:\n\n```sh\n# on the writer:\nssh-keygen -t ed25519 -N \"\" -f ~/.ssh/aws-lease -C aws-lease-beat\n# on this host (~/.ssh/authorized_keys) — this key can do NOTHING else:\ncommand=\"/usr/bin/python3 /home/USER/src/aws_watch/lease_cmd.py\",restrict ssh-ed25519 AAAA…  aws-lease-beat\n```\n\nThen a heartbeat is just `ssh -i ~/.ssh/aws-lease HOST \"beat i-0123…\"`; a shell or\nany other command over that key is refused. The reader (`aws_watch reap`) talks\nto the DB locally via `keepalive.db` (unix_socket auth as the cron user by\ndefault — no password on disk).\n\n**Trust model \u0026 limitations.** By default the forced-command key is *shared* (the\nsame key on every box) and `lease_cmd.py` authorizes by instance-id *format*\nonly, so any key holder can `beat`/`release` any lease and `list` the registry.\nFor a single owner this is fine: a box holding the key can already terminate its\nsame-account siblings via its instance-profile IAM, so the lease path grants\nnothing new, and the reaper only ever honors a lease for an instance already on\n`reap.name_prefixes` — a lease cannot protect or resurrect an off-allowlist box.\nIf leases span mutually-distrusting projects, issue a **per-box key** and pin its\ninstance id in the forced command:\n\n```\ncommand=\"/usr/bin/python3 /home/USER/src/aws_watch/lease_cmd.py --only i-0123…\",restrict ssh-ed25519 AAAA…\n```\n\nWith `--only`, that key may register/beat/release/list only that one instance;\nthe pin is read from `authorized_keys`, not the client, so it cannot be widened.\n(Notes also pass a control-character filter, so a crafted note can't inject\nterminal escapes into `list`; and the reaper releases a box's lease row when it\nreaps it, so the table self-prunes.)\n\n## Requirements\n\n- Python 3.9+\n- `boto3` and `PyYAML` (`pip install -r requirements.txt`)\n- A local MTA for `sendmail` (e.g. postfix), **or** configure SMTP in the config.\n- AWS credentials for a read-only IAM user (below).\n\n## Quick start\n\n```sh\ngit clone \u003cthis repo\u003e aws_watch \u0026\u0026 cd aws_watch\ncp .env.example .env            # then put your AWS keys in .env\ncp config.example.yaml config.yaml   # then set email + thresholds\nchmod 600 .env\n\npython3 aws_watch.py report     # one-off: print the full inventory\npython3 aws_watch.py test-email # confirm e-mail delivery works\n./install.sh                    # install the hourly cron job\n```\n\n`install.sh` installs dependencies if needed, creates `config.yaml`/`.env` from\nthe examples if missing, and adds an idempotent hourly crontab entry.\n\n## IAM policy (least privilege, read-only)\n\nCreate a dedicated IAM user and attach this policy. Everything is read-only\nexcept `ssm:SendCommand`, which only runs the load-average probe; drop the SSM\nstatement if you prefer and aws_watch will use the CloudWatch fallback.\n\n```json\n{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Sid\": \"Inventory\",\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"ec2:DescribeRegions\",\n        \"ec2:DescribeInstances\",\n        \"ec2:DescribeInstanceTypes\",\n        \"ec2:DescribeVolumes\",\n        \"ec2:DescribeSpotInstanceRequests\",\n        \"ec2:DescribeAddresses\",\n        \"cloudwatch:GetMetricStatistics\",\n        \"sts:GetCallerIdentity\"\n      ],\n      \"Resource\": \"*\"\n    },\n    {\n      \"Sid\": \"LoadAverageProbe\",\n      \"Effect\": \"Allow\",\n      \"Action\": [\n        \"ssm:DescribeInstanceInformation\",\n        \"ssm:SendCommand\",\n        \"ssm:GetCommandInvocation\"\n      ],\n      \"Resource\": \"*\"\n    }\n  ]\n}\n```\n\n### Extra permissions for the reaper (only if you enable it)\n\nThe read-only policy above is enough for the watcher. The **reaper** additionally\nneeds the power to terminate instances (and, optionally, delete the per-instance\nidle alarm). This is a privileged grant — add it only if you use `reap`, and\nprefer a separate credential scoped to the regions/resources you actually reap:\n\n```json\n{\n  \"Sid\": \"Reaper\",\n  \"Effect\": \"Allow\",\n  \"Action\": [\n    \"ec2:TerminateInstances\",\n    \"cloudwatch:DeleteAlarms\"\n  ],\n  \"Resource\": \"*\"\n}\n```\n\nDrop the `cloudwatch:DeleteAlarms` action if you leave `reap.delete_alarm_template`\nunset. You can tighten `Resource`/add `Condition` keys (e.g. a tag condition that\nmirrors your `name_prefixes`) for defense in depth.\n\n## Configuration\n\nCredentials live in `.env` (git-ignored). Everything else is in `config.yaml`\n(also git-ignored); see `config.example.yaml` for the fully documented template.\nKey settings:\n\n- `email.to` / `email.method` (`sendmail` or `smtp`)\n- `regions` — `all` or an explicit list\n- `digest.hour` — local hour for the daily inventory\n- `renotify_hours` — alert de-duplication window\n- `alerts.*` — enable/disable each check and tune its thresholds\n- `suppress` — resource ids or `name:\u003cglob\u003e` to exclude from alerts\n\n## CLI\n\n```\naws_watch.py run          # the hourly cron logic (alerts + daily digest)\naws_watch.py report       # print full inventory to stdout, send nothing\naws_watch.py digest       # force-send a digest now\naws_watch.py test-email   # send a test e-mail\naws_watch.py reap         # DESTRUCTIVE: preview orphaned-instance reaping\naws_watch.py reap --apply # DESTRUCTIVE: actually terminate (see the reaper section)\n```\n\nUseful flags: `--dry-run` (print what would be e-mailed; for `reap`, forces\npreview even with `--apply`), `--apply` (`reap` only — really terminate),\n`--regions us-east-1,us-east-2`, `--config PATH`, `--env PATH`, `-v`.\n\n## A note on S3-compatible endpoints\n\nIf the host is configured to use an S3-compatible service (e.g. Wasabi) via\n`AWS_ENDPOINT_URL` or `~/.aws/config`, that would otherwise hijack these API\ncalls. aws_watch ignores the machine-wide AWS config and uses **only** the\ncredentials in its own `.env`, talking to real AWS endpoints.\n\n## Security\n\n- `.env` is git-ignored and should be `chmod 600`. Never commit real keys.\n- Use a dedicated, least-privilege IAM user (policy above).\n- If a key is ever pasted somewhere it shouldn't be, rotate it in IAM.\n\n## Development\n\n```sh\npython3 -m unittest discover -s tests -v\n```\n\nThe tests cover the pure logic (alerting, suppression, de-dup, age/load parsing,\nthe reaper's allowlist/protect/grace/idle decisions, and the no-line-drawing\nreport guarantee) and make no AWS calls.\n\n## License\n\nGPL-3.0-or-later. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favwohl%2Faws_watch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favwohl%2Faws_watch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favwohl%2Faws_watch/lists"}