https://github.com/timo-kang/watchdog
Production-oriented edge watchdog for robots and sensor-dense edge devices with local supervision, module heartbeats, and CAN/EtherCAT health monitoring.
https://github.com/timo-kang/watchdog
can-bus edge-computing ethercat robotics watchdog
Last synced: 2 months ago
JSON representation
Production-oriented edge watchdog for robots and sensor-dense edge devices with local supervision, module heartbeats, and CAN/EtherCAT health monitoring.
- Host: GitHub
- URL: https://github.com/timo-kang/watchdog
- Owner: timo-kang
- Created: 2026-04-14T08:21:54.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-30T06:34:58.000Z (2 months ago)
- Last Synced: 2026-04-30T08:14:12.770Z (2 months ago)
- Topics: can-bus, edge-computing, ethercat, robotics, watchdog
- Language: Go
- Size: 128 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Watchdog
`watchdog` is a local-first health watchdog for robots and other sensor-dense edge devices.
It is meant to run on the robot, not in the cloud. It polls local health sources, writes incident snapshots, and sends structured action requests to a local supervisor. Remote dashboards and fleet systems can be added later, but they are not part of the safety-critical path.
## What Ships Today
Current binaries:
- `watchdog`: polling daemon
- `watchdog-supervisor`: local action receiver and latch
- `watchdogctl`: operator status tool
Current built-in source families:
- `host`: CPU temperature, memory, load, hottest sensor
- `module_reports`: JSON heartbeats from local modules, including C++ producers
- `systemd`: service state, main PID, restart count
- `network`: Linux link state and interface counters
- `power`: Linux `power_supply` state
- `storage`: free space, read-only state, busy percentage
- `time_sync`: `timedatectl` state, RTC drift, sync grace window
- `can`: SocketCAN and command-based probes
- `ethercat`: SOEM, partial IgH, and command-based probes
Current local action policy:
- `warn` -> `notify`
- `fail` or `stale` -> `degrade`
- EtherCAT lost slave or required link down -> `safe_stop`
- recovery -> `resolve`
Prometheus-compatible `/metrics` endpoints are now built into both `watchdog` and `watchdog-supervisor`, so the same surface can feed Prometheus, Grafana, or Datadog OpenMetrics collection.
## Install From Release
The published Linux target today is Ubuntu 24.04 x86_64.
Download the latest release asset:
- `watchdog-v-ubuntu24-amd64.tar.gz`
- `watchdog-v-ubuntu24-amd64.tar.gz.sha256`
Install it on the robot:
```bash
tar -xzf watchdog-v-ubuntu24-amd64.tar.gz
cd watchdog-v-ubuntu24-amd64
sudo install -d /etc/watchdog
sudo install -m 0755 bin/watchdog /usr/local/bin/watchdog
sudo install -m 0755 bin/watchdog-supervisor /usr/local/bin/watchdog-supervisor
sudo install -m 0755 bin/watchdogctl /usr/local/bin/watchdogctl
sudo install -m 0644 configs/watchdog.json /etc/watchdog/watchdog.json
sudo install -m 0644 configs/watchdog-supervisor.json /etc/watchdog/watchdog-supervisor.json
sudo install -m 0644 systemd/watchdog.service /etc/systemd/system/watchdog.service
sudo install -m 0644 systemd/watchdog-supervisor.service /etc/systemd/system/watchdog-supervisor.service
sudo systemctl daemon-reload
sudo systemctl enable --now watchdog-supervisor watchdog
```
The robot does not need the Go toolchain installed.
## Build From Source
```bash
go build ./...
```
Build release-style binaries locally:
```bash
mkdir -p dist/linux-amd64
go build -o dist/linux-amd64/watchdog ./cmd/watchdog
go build -o dist/linux-amd64/watchdog-supervisor ./cmd/watchdog-supervisor
go build -o dist/linux-amd64/watchdogctl ./cmd/watchdogctl
```
## First Config To Use
For a real Ubuntu 24.04 x86_64 node, start from:
- `configs/watchdog.ubuntu24-amd64.json`
- `configs/watchdog-supervisor.ubuntu24-amd64.json`
That baseline is intentionally conservative:
- enabled by default: `host`, `storage`, `time_sync`, module ingest, supervisor actions
- disabled until you fill real platform values: `systemd`, `network`, `power`, `can`, `ethercat`
For a fuller robot bring-up, use:
- `configs/watchdog.robot-baseline.example.json`
Edit these first:
- `systemd.units`
- `network.interfaces`
- `power.supplies`
- `can.interfaces`
- `ethercat.masters`
- `sources.time_sync.require_synchronized`
- `sources.time_sync.sync_grace_period`
## Runtime Layout
Config:
- `/etc/watchdog/watchdog.json`
- `/etc/watchdog/watchdog-supervisor.json`
Runtime sockets:
- `/run/watchdog/module.sock`
- `/run/watchdog/supervisor.sock`
Persistent state and incidents:
- `/var/lib/watchdog/incidents/`
- `/var/lib/watchdog/actions/`
- `/var/lib/watchdog/supervisor/current_state.json`
- `/var/lib/watchdog/supervisor/latest.json`
- `/var/lib/watchdog/supervisor/requests/`
Service logs:
- `journalctl -u watchdog`
- `journalctl -u watchdog-supervisor`
Metrics endpoints:
- `watchdog`: `127.0.0.1:9108/metrics`
- `watchdog-supervisor`: `127.0.0.1:9109/metrics`
If Grafana is running on a different machine, these loopback binds are not reachable from it. In that case, use the `remote-metrics` example configs and scrape the robot's real IP or hostname instead.
## How To Inspect It
Live logs:
```bash
sudo journalctl -u watchdog -f
sudo journalctl -u watchdog-supervisor -f
```
Operator view:
```bash
watchdogctl status -config /etc/watchdog/watchdog-supervisor.json
watchdogctl status -config /etc/watchdog/watchdog-supervisor.json -verbose
watchdogctl status -config /etc/watchdog/watchdog-supervisor.json -json -verbose
```
Important files:
```bash
sudo jq . /var/lib/watchdog/supervisor/current_state.json
sudo jq . /var/lib/watchdog/supervisor/latest.json
sudo ls -lt /var/lib/watchdog/incidents
```
## Prometheus and Grafana
Both processes can expose Prometheus-compatible metrics:
```json
"metrics": {
"enabled": true,
"listen_address": "127.0.0.1:9108",
"path": "/metrics"
}
```
Use loopback if Prometheus runs on the robot. Use a real interface bind such as `0.0.0.0:9108` only when a central Prometheus server is meant to scrape the robot directly.
The repository includes a local observability stack for the Docker sim:
- `deploy/observability/prometheus/prometheus.docker-sim.yml`
- `deploy/observability/grafana/provisioning/...`
- `deploy/observability/grafana/dashboards/watchdog-overview.json`
Run it with the simulator:
```bash
docker compose -f deploy/docker/docker-compose.sim.yml --profile observability up --build
```
Then open:
- Prometheus: `http://localhost:9091`
- Grafana: `http://localhost:3300`
- login: `admin`
- password: `admin`
The provisioned Grafana dashboard is `Watchdog Overview`.
### Central Prometheus Scraping Real Robots
For a monitoring server or laptop that scrapes one or more robots directly:
1. On each robot, make the metrics endpoints reachable.
Start from:
- `configs/watchdog.ubuntu24-amd64.remote-metrics.example.json`
- `configs/watchdog-supervisor.ubuntu24-amd64.remote-metrics.example.json`
2. Install those as:
- `/etc/watchdog/watchdog.json`
- `/etc/watchdog/watchdog-supervisor.json`
3. Restart the services:
```bash
sudo systemctl restart watchdog-supervisor watchdog
```
4. From the monitoring server, verify basic reachability before opening Grafana:
```bash
curl http://ROBOT_IP:9108/metrics
curl http://ROBOT_IP:9109/metrics
```
If Prometheus runs in Docker on the same Linux machine as the robot processes, use `host.docker.internal:9108` and `host.docker.internal:9109` in the Prometheus target list. The provided `docker-compose.server.yml` already maps `host.docker.internal` to the host gateway.
5. Run the central observability stack:
```bash
cd deploy/observability
$EDITOR prometheus/prometheus.robot-server.example.yml
docker compose -f docker-compose.server.yml up -d
```
Then open:
- Prometheus: `http://SERVER_IP:9091`
- Grafana: `http://SERVER_IP:3300`
If you see "no data" in Grafana, the first checks should be:
```bash
curl http://ROBOT_IP:9108/metrics
curl http://ROBOT_IP:9109/metrics
curl http://SERVER_IP:9091/api/v1/targets
```
The common failure cases are:
- Prometheus is still scraping the Docker sim targets instead of the robot IPs
- robot metrics are still bound to `127.0.0.1`
- firewall or network policy blocks `9108` or `9109`
- Prometheus target entries do not match the robot address
## Time Sync Behavior
`time_sync` now has a configurable grace window before an unsynchronized clock becomes a hard failure.
Config:
```json
"time_sync": {
"enabled": true,
"source_id": "system-clock",
"require_synchronized": true,
"warn_on_local_rtc": true,
"sync_grace_period": "10m"
}
```
Behavior:
- during the grace window: `warn`
- after the grace window expires: `fail`
- incident snapshots are written on state transitions, not every repeated poll
Use this intentionally:
- if the robot must eventually sync, keep `require_synchronized=true` and tune `sync_grace_period`
- if the robot is expected to run without synchronized time, set `require_synchronized=false`
## Module Heartbeats
Local modules send one JSON datagram per heartbeat to `module.sock`.
Minimal example payload:
```json
{
"source_id": "planner",
"severity": "warn",
"reason": "deadline miss",
"stale_after_ms": 1500,
"metrics": {
"deadline_miss_ms": 18.5
},
"labels": {
"process": "planner_main"
}
}
```
C++ helper code is in:
- `sdk/cpp/include/watchdog/client.hpp`
- `sdk/cpp/examples/send_heartbeat.cpp`
SOEM helper code is in:
- `sdk/cpp/include/watchdog/ethercat_probe.hpp`
- `sdk/cpp/examples/emit_soem_probe.cpp`
## Simulation
Local demo config:
- `configs/watchdog.local-demo.example.json`
Docker simulation stack:
- `deploy/docker/docker-compose.sim.yml`
Run the Docker sim:
```bash
docker compose -f deploy/docker/docker-compose.sim.yml up --build
```
Inspect it:
```bash
docker compose -f deploy/docker/docker-compose.sim.yml logs -f watchdog watchdog-supervisor planner-sim
docker compose -f deploy/docker/docker-compose.sim.yml exec watchdog-supervisor /usr/local/bin/watchdogctl status -config /configs/watchdog-supervisor.docker-sim.json
```
## Repository Layout
- `cmd/watchdog`: daemon entrypoint
- `cmd/watchdog-supervisor`: local receiver and hook dispatcher
- `cmd/watchdogctl`: status CLI
- `cmd/watchdog-sim-module`: simulation producer
- `internal/adapters`: collectors
- `internal/actions`: action request building and delivery
- `internal/config`: config loading and validation
- `internal/health`: normalized health model
- `internal/incident`: incident persistence
- `internal/rules`: severity evaluation
- `internal/supervisor`: local supervisor state and hook execution
- `deploy/systemd`: unit files
- `deploy/docker`: simulation stack
- `configs`: example configs
- `docs`: roadmap and interface notes
## More Docs
- `docs/milestones.md`: project milestones
- `docs/bus-integration.md`: CAN and EtherCAT integration handoff
- `docs/action-interface.md`: watchdog to supervisor contract
- `docs/observability.md`: metrics, Prometheus, Grafana, and dashboard notes
## Current Boundaries
This is a local watchdog stack, not yet a full robot control-plane product.
What it already does well:
- local health polling
- component-level state derivation
- incident snapshot writing
- supervisor latching and audit
- C++ heartbeat integration
- baseline host, storage, time, network, power, CAN, and EtherCAT inputs
What still belongs outside this repo:
- hard real-time actuator safety
- final robot FSM and autonomy policy
- fleet dashboards and remote command center
- vendor-specific telemetry for every module on the robot