{"id":49594229,"url":"https://github.com/timo-kang/watchdog","last_synced_at":"2026-05-04T03:08:29.242Z","repository":{"id":352527699,"uuid":"1210250674","full_name":"timo-kang/watchdog","owner":"timo-kang","description":"Production-oriented edge watchdog for robots and sensor-dense edge devices with local supervision, module heartbeats, and CAN/EtherCAT health monitoring.","archived":false,"fork":false,"pushed_at":"2026-04-30T06:34:58.000Z","size":131,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-30T08:14:12.770Z","etag":null,"topics":["can-bus","edge-computing","ethercat","robotics","watchdog"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timo-kang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-14T08:21:54.000Z","updated_at":"2026-04-30T06:33:26.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/timo-kang/watchdog","commit_stats":null,"previous_names":["timo-kang/watchdog"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/timo-kang/watchdog","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timo-kang%2Fwatchdog","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timo-kang%2Fwatchdog/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timo-kang%2Fwatchdog/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timo-kang%2Fwatchdog/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timo-kang","download_url":"https://codeload.github.com/timo-kang/watchdog/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timo-kang%2Fwatchdog/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32592740,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T22:12:39.696Z","status":"online","status_checked_at":"2026-05-04T02:00:06.625Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["can-bus","edge-computing","ethercat","robotics","watchdog"],"created_at":"2026-05-04T03:08:28.697Z","updated_at":"2026-05-04T03:08:29.234Z","avatar_url":"https://github.com/timo-kang.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Watchdog\n\n`watchdog` is a local-first health watchdog for robots and other sensor-dense edge devices.\n\nIt is meant to run on the robot, not in the cloud. It polls local health sources, writes incident snapshots, and sends structured action requests to a local supervisor. Remote dashboards and fleet systems can be added later, but they are not part of the safety-critical path.\n\n## What Ships Today\n\nCurrent binaries:\n\n- `watchdog`: polling daemon\n- `watchdog-supervisor`: local action receiver and latch\n- `watchdogctl`: operator status tool\n\nCurrent built-in source families:\n\n- `host`: CPU temperature, memory, load, hottest sensor\n- `module_reports`: JSON heartbeats from local modules, including C++ producers\n- `systemd`: service state, main PID, restart count\n- `network`: Linux link state and interface counters\n- `power`: Linux `power_supply` state\n- `storage`: free space, read-only state, busy percentage\n- `time_sync`: `timedatectl` state, RTC drift, sync grace window\n- `can`: SocketCAN and command-based probes\n- `ethercat`: SOEM, partial IgH, and command-based probes\n\nCurrent local action policy:\n\n- `warn` -\u003e `notify`\n- `fail` or `stale` -\u003e `degrade`\n- EtherCAT lost slave or required link down -\u003e `safe_stop`\n- recovery -\u003e `resolve`\n\nPrometheus-compatible `/metrics` endpoints are now built into both `watchdog` and `watchdog-supervisor`, so the same surface can feed Prometheus, Grafana, or Datadog OpenMetrics collection.\n\n## Install From Release\n\nThe published Linux target today is Ubuntu 24.04 x86_64.\n\nDownload the latest release asset:\n\n- `watchdog-v\u003cversion\u003e-ubuntu24-amd64.tar.gz`\n- `watchdog-v\u003cversion\u003e-ubuntu24-amd64.tar.gz.sha256`\n\nInstall it on the robot:\n\n```bash\ntar -xzf watchdog-v\u003cversion\u003e-ubuntu24-amd64.tar.gz\ncd watchdog-v\u003cversion\u003e-ubuntu24-amd64\n\nsudo install -d /etc/watchdog\nsudo install -m 0755 bin/watchdog /usr/local/bin/watchdog\nsudo install -m 0755 bin/watchdog-supervisor /usr/local/bin/watchdog-supervisor\nsudo install -m 0755 bin/watchdogctl /usr/local/bin/watchdogctl\nsudo install -m 0644 configs/watchdog.json /etc/watchdog/watchdog.json\nsudo install -m 0644 configs/watchdog-supervisor.json /etc/watchdog/watchdog-supervisor.json\nsudo install -m 0644 systemd/watchdog.service /etc/systemd/system/watchdog.service\nsudo install -m 0644 systemd/watchdog-supervisor.service /etc/systemd/system/watchdog-supervisor.service\nsudo systemctl daemon-reload\nsudo systemctl enable --now watchdog-supervisor watchdog\n```\n\nThe robot does not need the Go toolchain installed.\n\n## Build From Source\n\n```bash\ngo build ./...\n```\n\nBuild release-style binaries locally:\n\n```bash\nmkdir -p dist/linux-amd64\ngo build -o dist/linux-amd64/watchdog ./cmd/watchdog\ngo build -o dist/linux-amd64/watchdog-supervisor ./cmd/watchdog-supervisor\ngo build -o dist/linux-amd64/watchdogctl ./cmd/watchdogctl\n```\n\n## First Config To Use\n\nFor a real Ubuntu 24.04 x86_64 node, start from:\n\n- `configs/watchdog.ubuntu24-amd64.json`\n- `configs/watchdog-supervisor.ubuntu24-amd64.json`\n\nThat baseline is intentionally conservative:\n\n- enabled by default: `host`, `storage`, `time_sync`, module ingest, supervisor actions\n- disabled until you fill real platform values: `systemd`, `network`, `power`, `can`, `ethercat`\n\nFor a fuller robot bring-up, use:\n\n- `configs/watchdog.robot-baseline.example.json`\n\nEdit these first:\n\n- `systemd.units`\n- `network.interfaces`\n- `power.supplies`\n- `can.interfaces`\n- `ethercat.masters`\n- `sources.time_sync.require_synchronized`\n- `sources.time_sync.sync_grace_period`\n\n## Runtime Layout\n\nConfig:\n\n- `/etc/watchdog/watchdog.json`\n- `/etc/watchdog/watchdog-supervisor.json`\n\nRuntime sockets:\n\n- `/run/watchdog/module.sock`\n- `/run/watchdog/supervisor.sock`\n\nPersistent state and incidents:\n\n- `/var/lib/watchdog/incidents/`\n- `/var/lib/watchdog/actions/`\n- `/var/lib/watchdog/supervisor/current_state.json`\n- `/var/lib/watchdog/supervisor/latest.json`\n- `/var/lib/watchdog/supervisor/requests/`\n\nService logs:\n\n- `journalctl -u watchdog`\n- `journalctl -u watchdog-supervisor`\n\nMetrics endpoints:\n\n- `watchdog`: `127.0.0.1:9108/metrics`\n- `watchdog-supervisor`: `127.0.0.1:9109/metrics`\n\nIf Grafana is running on a different machine, these loopback binds are not reachable from it. In that case, use the `remote-metrics` example configs and scrape the robot's real IP or hostname instead.\n\n## How To Inspect It\n\nLive logs:\n\n```bash\nsudo journalctl -u watchdog -f\nsudo journalctl -u watchdog-supervisor -f\n```\n\nOperator view:\n\n```bash\nwatchdogctl status -config /etc/watchdog/watchdog-supervisor.json\nwatchdogctl status -config /etc/watchdog/watchdog-supervisor.json -verbose\nwatchdogctl status -config /etc/watchdog/watchdog-supervisor.json -json -verbose\n```\n\nImportant files:\n\n```bash\nsudo jq . /var/lib/watchdog/supervisor/current_state.json\nsudo jq . /var/lib/watchdog/supervisor/latest.json\nsudo ls -lt /var/lib/watchdog/incidents\n```\n\n## Prometheus and Grafana\n\nBoth processes can expose Prometheus-compatible metrics:\n\n```json\n\"metrics\": {\n  \"enabled\": true,\n  \"listen_address\": \"127.0.0.1:9108\",\n  \"path\": \"/metrics\"\n}\n```\n\nUse loopback if Prometheus runs on the robot. Use a real interface bind such as `0.0.0.0:9108` only when a central Prometheus server is meant to scrape the robot directly.\n\nThe repository includes a local observability stack for the Docker sim:\n\n- `deploy/observability/prometheus/prometheus.docker-sim.yml`\n- `deploy/observability/grafana/provisioning/...`\n- `deploy/observability/grafana/dashboards/watchdog-overview.json`\n\nRun it with the simulator:\n\n```bash\ndocker compose -f deploy/docker/docker-compose.sim.yml --profile observability up --build\n```\n\nThen open:\n\n- Prometheus: `http://localhost:9091`\n- Grafana: `http://localhost:3300`\n  - login: `admin`\n  - password: `admin`\n\nThe provisioned Grafana dashboard is `Watchdog Overview`.\n\n### Central Prometheus Scraping Real Robots\n\nFor a monitoring server or laptop that scrapes one or more robots directly:\n\n1. On each robot, make the metrics endpoints reachable.\n   Start from:\n   - `configs/watchdog.ubuntu24-amd64.remote-metrics.example.json`\n   - `configs/watchdog-supervisor.ubuntu24-amd64.remote-metrics.example.json`\n\n2. Install those as:\n   - `/etc/watchdog/watchdog.json`\n   - `/etc/watchdog/watchdog-supervisor.json`\n\n3. Restart the services:\n\n```bash\nsudo systemctl restart watchdog-supervisor watchdog\n```\n\n4. From the monitoring server, verify basic reachability before opening Grafana:\n\n```bash\ncurl http://ROBOT_IP:9108/metrics\ncurl http://ROBOT_IP:9109/metrics\n```\n\nIf Prometheus runs in Docker on the same Linux machine as the robot processes, use `host.docker.internal:9108` and `host.docker.internal:9109` in the Prometheus target list. The provided `docker-compose.server.yml` already maps `host.docker.internal` to the host gateway.\n\n5. Run the central observability stack:\n\n```bash\ncd deploy/observability\n$EDITOR prometheus/prometheus.robot-server.example.yml\ndocker compose -f docker-compose.server.yml up -d\n```\n\nThen open:\n\n- Prometheus: `http://SERVER_IP:9091`\n- Grafana: `http://SERVER_IP:3300`\n\nIf you see \"no data\" in Grafana, the first checks should be:\n\n```bash\ncurl http://ROBOT_IP:9108/metrics\ncurl http://ROBOT_IP:9109/metrics\ncurl http://SERVER_IP:9091/api/v1/targets\n```\n\nThe common failure cases are:\n\n- Prometheus is still scraping the Docker sim targets instead of the robot IPs\n- robot metrics are still bound to `127.0.0.1`\n- firewall or network policy blocks `9108` or `9109`\n- Prometheus target entries do not match the robot address\n\n## Time Sync Behavior\n\n`time_sync` now has a configurable grace window before an unsynchronized clock becomes a hard failure.\n\nConfig:\n\n```json\n\"time_sync\": {\n  \"enabled\": true,\n  \"source_id\": \"system-clock\",\n  \"require_synchronized\": true,\n  \"warn_on_local_rtc\": true,\n  \"sync_grace_period\": \"10m\"\n}\n```\n\nBehavior:\n\n- during the grace window: `warn`\n- after the grace window expires: `fail`\n- incident snapshots are written on state transitions, not every repeated poll\n\nUse this intentionally:\n\n- if the robot must eventually sync, keep `require_synchronized=true` and tune `sync_grace_period`\n- if the robot is expected to run without synchronized time, set `require_synchronized=false`\n\n## Module Heartbeats\n\nLocal modules send one JSON datagram per heartbeat to `module.sock`.\n\nMinimal example payload:\n\n```json\n{\n  \"source_id\": \"planner\",\n  \"severity\": \"warn\",\n  \"reason\": \"deadline miss\",\n  \"stale_after_ms\": 1500,\n  \"metrics\": {\n    \"deadline_miss_ms\": 18.5\n  },\n  \"labels\": {\n    \"process\": \"planner_main\"\n  }\n}\n```\n\nC++ helper code is in:\n\n- `sdk/cpp/include/watchdog/client.hpp`\n- `sdk/cpp/examples/send_heartbeat.cpp`\n\nSOEM helper code is in:\n\n- `sdk/cpp/include/watchdog/ethercat_probe.hpp`\n- `sdk/cpp/examples/emit_soem_probe.cpp`\n\n## Simulation\n\nLocal demo config:\n\n- `configs/watchdog.local-demo.example.json`\n\nDocker simulation stack:\n\n- `deploy/docker/docker-compose.sim.yml`\n\nRun the Docker sim:\n\n```bash\ndocker compose -f deploy/docker/docker-compose.sim.yml up --build\n```\n\nInspect it:\n\n```bash\ndocker compose -f deploy/docker/docker-compose.sim.yml logs -f watchdog watchdog-supervisor planner-sim\ndocker compose -f deploy/docker/docker-compose.sim.yml exec watchdog-supervisor /usr/local/bin/watchdogctl status -config /configs/watchdog-supervisor.docker-sim.json\n```\n\n## Repository Layout\n\n- `cmd/watchdog`: daemon entrypoint\n- `cmd/watchdog-supervisor`: local receiver and hook dispatcher\n- `cmd/watchdogctl`: status CLI\n- `cmd/watchdog-sim-module`: simulation producer\n- `internal/adapters`: collectors\n- `internal/actions`: action request building and delivery\n- `internal/config`: config loading and validation\n- `internal/health`: normalized health model\n- `internal/incident`: incident persistence\n- `internal/rules`: severity evaluation\n- `internal/supervisor`: local supervisor state and hook execution\n- `deploy/systemd`: unit files\n- `deploy/docker`: simulation stack\n- `configs`: example configs\n- `docs`: roadmap and interface notes\n\n## More Docs\n\n- `docs/milestones.md`: project milestones\n- `docs/bus-integration.md`: CAN and EtherCAT integration handoff\n- `docs/action-interface.md`: watchdog to supervisor contract\n- `docs/observability.md`: metrics, Prometheus, Grafana, and dashboard notes\n\n## Current Boundaries\n\nThis is a local watchdog stack, not yet a full robot control-plane product.\n\nWhat it already does well:\n\n- local health polling\n- component-level state derivation\n- incident snapshot writing\n- supervisor latching and audit\n- C++ heartbeat integration\n- baseline host, storage, time, network, power, CAN, and EtherCAT inputs\n\nWhat still belongs outside this repo:\n\n- hard real-time actuator safety\n- final robot FSM and autonomy policy\n- fleet dashboards and remote command center\n- vendor-specific telemetry for every module on the robot\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimo-kang%2Fwatchdog","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimo-kang%2Fwatchdog","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimo-kang%2Fwatchdog/lists"}