https://github.com/compscifutures/apmonitor
On-prem/LAN availability monitoring with realtime guarantees & decaying alert pacing. Multithreaded high speed availability checking for PING, TCP/UDP, QUIC & HTTP/S resources incl. SSL/TLS cert. pinning. Integrates w/Site24x7 heartbeat monitoring for failover alerts + Slack & Pushover webhooks. Thread safe, reentrant, easily modifiable.
https://github.com/compscifutures/apmonitor
availability availability-and-monitoring discord h3 heartbeat-monitor http http3 https lan monitoring on-premise pagerduty pushover quic realtime servermasters site24x7 slack ssl-pinning tcp
Last synced: about 1 month ago
JSON representation
On-prem/LAN availability monitoring with realtime guarantees & decaying alert pacing. Multithreaded high speed availability checking for PING, TCP/UDP, QUIC & HTTP/S resources incl. SSL/TLS cert. pinning. Integrates w/Site24x7 heartbeat monitoring for failover alerts + Slack & Pushover webhooks. Thread safe, reentrant, easily modifiable.
- Host: GitHub
- URL: https://github.com/compscifutures/apmonitor
- Owner: CompSciFutures
- License: other
- Created: 2025-11-22T12:34:55.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-02-11T06:07:38.000Z (3 months ago)
- Last Synced: 2026-02-11T06:54:34.259Z (3 months ago)
- Topics: availability, availability-and-monitoring, discord, h3, heartbeat-monitor, http, http3, https, lan, monitoring, on-premise, pagerduty, pushover, quic, realtime, servermasters, site24x7, slack, ssl-pinning, tcp
- Language: Python
- Homepage: https://blog.andrewprendergast.com
- Size: 6.56 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE.txt
Awesome Lists containing this project
README

# `APMonitor.py` - A Hands-Off Layer 2 & 4 On-Premises Monitoring Tool with Alert Delivery Guarantees
***Built for NOCs and OT/ICS Sensor Networks***: This is an on-prem monitoring tool written completely in very clear Python-only code (so you can modify it) and is designed to work on a LAN for on-prem availability monitoring of resources that aren't necessarily connected to The Internet, and/or where the on-prem monitoring itself is also required to have availability guarantees.
It is particularly suited to availability monitoring of embedded devices +/- 10 secs. It's designed primarily for firewalls, switches, routers, hubs, environmental sensors & #OT / #ICS systems, but works with normal servers & services as well.
It supports multi-threading of the availability checking of monitored resources for high speed near-realtime performance, if that is what you need (see the `-t` command line option). The default operation mode is single-threaded for log clarity that runs on small systems like a Raspberry Pi.
It also supports pacing of monitoring alarms using a decaying curve that delivers alert notifications quickly at the start, then slows down notifications over time.
`APMonitor.py` (APMonitor) is primarily designed to work in tandem with [Site24x7](https://site24x7.com) and integrates very well with their "[Heartbeat Monitoring](https://www.site24x7.com/help/heartbeat/)".
To achieve **guaranteed always-on monitoring service levels**, simply setup local availability monitors in your config, [sign-up for a Pro Plan at Site24x7](https://www.site24x7.com/site24x7-pricing.html) then use `heartbeat_url` and `heartbeat_every_n_secs` configuration options to `APMonitor.py` to ping a [Heartbeat Monitoring](https://www.site24x7.com/help/heartbeat/) URL endpoint at [Site24x7](https://site24x7.com) when the monitored resource is up. This then ensures that when a heartbeat doesn't arrive from APMonitor, monitoring alerts fall back to Site24x7, and when both are working you have second-opinion availability monitoring reporting.
**The service level guarantee works as follows:** If the resource is down, `APMonitor.py` won't hit the [Heartbeat Monitoring](https://www.site24x7.com/help/heartbeat/) endpoint URL, and Site24x7 will then send an alert about the missed heartbeat without the need for any additional dependencies on-prem/on-site. So the entire machine `APMonitor.py` is running on can fall over, and you still get availability monitoring alerts sent, with all the benefits of having on-prem monitoring on your local network behind your firewall.
You can quickly signup for a [Site24x7.com Lite or Pro Plan](https://www.site24x7.com/site24x7-pricing.html) for \$10-\$50 USD per month, then setup a bunch of [Heartbeat Monitoring](https://www.site24x7.com/help/heartbeat/) URL endpoints that works with `APMonitor.py` rather easily.
**Note: Heartbeat Monitoring is not available on their Website Monitoring plans. You need an 'Infrastructure Monitoring' or 'All-In-One' plan for it to work correctly.**
APMonitor also integrates well with [Slack](https://slack.com/) and [Pushover](https://pushover.net/) via webhook URL endpoints, and supports email notifications via SMTP.
APMonitor is a neat way to guarantee your on-prem availability monitoring will always let you know about an outage and to avoid putting resources onto the net that don't need to be.
Andrew (AP) Prendergast
https://linktr.ee/CompSciFutures
Master of Science
Ex-ServerMasters
Ex-Googler
Ex-Xerox PARC/PARK
Ex-Intel Foundry
Ex Chief Scientist @ Clemenger BBDO / Omnicom
[ACM](https://acm.org/), [IEEE](https://ieee.org) & [INFORMS](https://informs.org) member.
[](https://www.paypal.com/donate/?hosted_button_id=WN472NX5XC5CJ)
If you find APMonitor.py useful in your NOC, for monitoring your IOT/ICS devices,
or would like email / telephone support, please consider
a regular donation via Buy me a coffee,
so I can keep improving it.
Telephone Support: +61497222775
Support email: hello@enertium.org
# Quickstart
To run APMonitor with a configuration file and auto-derived statefile under `/var/tmp/APMonitor/`:
```bash
./APMonitor.py test-apmonitor-config.yaml --generate-rrds
./APMonitor.py site1.yaml site2.yaml --generate-mrtg-config
```
To properly setup `APMonitor.py`:
1. Spin up Debian Linux on a VM or PC on a Card/PC on a Chip (e.g., rPI) - optional but recommended
This is required because control of `/var/www/html` is taken over when installing the MRTG web interface.
2. Install APMonitor (to spin up `APMonitor.py` in `systemctl` as `apmonitor.service`)
```bash
sudo make install
```
3. Install MRTG web interface (to spin up an NGINX webserver for MRTG charts in `systemctl` as `apmonitor-nginx.service`)
```bash
sudo make installmrtg
```
4. Edit `/usr/local/etc/apmonitor-config.yaml`
See Configuration Options for site file configuration details.
4. Test the config (using `./APMonitor.py --test-config /usr/local/etc/apmonitor-config.yaml`):
```
sudo make test-config
```
6. Start monitoring:
```bash
sudo make enable
```
**Note:** Statefiles are stored under `/var/tmp/APMonitor/` by default, e.g. `/var/tmp/APMonitor/apmonitor-config.statefile.json` for a default install. The `-s` flag overrides this for single-config invocations only.
That's it!
> [!WARNING]
> If you are upgrading to the 1.3.x stream: This is a schema change release stream that contains RRD & config YAML schema changes that require existing RRD files to be deleted and recreated before upgrading.
> APMonitor will auto-heal existing RRDs on first run when `--generate-rrds` or `--generate-mrtg-config` is specified.
>
> To do a full upgrade change your YAML to replace `type: snmp` with `type: ports` then execute something similar to this command:
>
> ```
> cp tellusion-apmonitor-config.yaml /usr/local/etc/apmonitor-config.yaml; \
> make install; make installmrtg; \
> rm /var/tmp/apmonitor-statefile.rrd/*
> ```
# Expected Output with MRTG/RRD Integration Enabled
Installing MRTG with `make install; make installmrtg` will spin up via `rc.d` a small lightweight NGINX web server with FastCGI on http://localhost:888/, as follows:

This layout is specifically designed for now commonly available 4K Ultra HD (3840x2160 16:9 2160p) screens. It's not uncommon to see modern NOCs with an array of these on the wall at eye height when someone is sitting down.
Instead of just having CCTV, you can now add some proper network telemetry and instrumentation, say with one YAML site file per screen, on the top row of screens.
Clicking on the heading associated with a set of ports will provide more L2/L3 information (depending on what's available via SNMP):

Note the NGINX/FastCGI combination means we don't need to keep a machine chewing on itself generating charts anymore - they are now generated on demand in near-realtime and extremely efficiently. The only I/O is the RRD files, which under the hood operate very much like the older MRTG text file format.
I chose RRD because it's a rather good frequency domain format for data warehousing of frequency domain sample data that's still compatible with Tier 1 NOCs.
If you want to work with this data directly, consider looking at LibROSA from NYU's Fourier Lab team.
It is designed for working with Frequency Domain/Time Domain data and has a rather nifty spectrogram visualisation which might be relevant to you, amongst other things.
See the launch lecture given at SciPy for more information.
You might also want to look at nixtla.io or R's seasonal decomposition function called `stl`. Nixtla is more advanced and I've posted on π about it here.
# Design Philosophy & Provenance
Once upon a time, I was well known in data center circles along Highway 101 in Silicon Valley for carrying in my back pocket a super lightweight pure C/libc cross-platform availability monitoring tool with no dependencies whatsoever called `APMonitor.c`. I'd graciously provide the source code to anyone who asked.
This is a rebuild of that project with enhanced features, starting with a Python prototype.
The design philosophy centers on simplicity and elegance: a single, unified source file containing the main execution flow for a 100% on-premises/LAN availability monitoring tool with guaranteed alerts and intelligent pacing.
Key Features:
- Near-realtime programming so heartbeats and alerts arrive when they say they are going to (+/- 10 secs)
- Multithreaded high-speed availability checking for PING, TCP, UDP, QUIC, HTTP(S), and SNMP resources
- SSL/TLS certificate checking and pinning so you can use self-signed certificates on-lan safely
- SNMP monitoring for network device interface bandwidth, I/O statistics, and TCP retransmit metrics
- Host performance monitoring (CPU, memory, disk I/O, swap, interrupts) per *System Performance Tuning* by Musumeci & Loukides (O'Reilly)
- Integration with Site24x7/PagerDuty heartbeat monitoring for high-availability second-opinion and failover alerting
- Integration with Slack and Pushover webhooks for notifications, plus standard email support
- Smart notification pacing: rapid alerts initially, then gradually decreasing frequency for extended outages
- Multi-site monitoring: for multiple single panes of glass, pass multiple config files on the command line; each runs concurrently as an independent subprocess with its own statefile, RRD database, and MRTG index
- Runs on everything from Raspberry Pi to enterprise systems
- Super accurate, high-frequency monitoring for real-time / embedded / heartbeat monitored environments
- Thread-safe, reentrant, and easily modifiable
- GPL 3.0 free open source always, so you know there are no backdoors
## Alternatives
If lightweight or realtime guarantees aren't important to you, and you want something more feature packed,
consider these on-prem alternatives:
- Uptime Kuma
- Statping
- UptimeRobot
- Paessler PRTG
APMonitor is simple, minimalist, elegant and lightweight and comes from a reliable line of heritage so you can spin
it up fast as a 2nd opinion monitoring tool with little more than a `make install`. If you want something more
sophisticated that's less focused on realtime programming or elegant simplicity, take a look at those very capable
alternatives.
# Relevance to the 12 Pillars of Information Security
NB: This tool is useful for implementing the second & third pillars (Availability & System Integrity)
from the 12 Pillars of Information Security, for Necessary, Sufficient & Complete Security:

Also be mindful of the Attack Surface Kill-Switch Riddle:

To address this riddle, you should try to configure your machines & devices so that even if they are shutdown or halted in some way,
the Ethernet MAC address can still be read at Layer 2 so you can still receive alerts like this:

NB. Be careful that your definition of "Kill Switched" is well defined and tested before the need to make use of it comes time.
E.g., downing a port never works long term, it's merely advisory and something one does as they walk across the floor to unplug the cable from a switch.
Or is it, if you have this? YMMV.
See DOI [10.13140/RG.2.2.12609.84321](https://doi.org/10.13140/RG.2.2.12609.84321) and associated [LinkedIn post](https://www.linkedin.com/feed/update/urn:li:activity:7331490410197905409/) for more information on the Pillars of Information Security. It borrows from a piece of work I did back when #PARC needed me to work on #BookMasters in the digital era.
## Recommended configurations for addressing the first pillar: Physical Security
Using `APMonitor.py` to address Availability & System Integrity can help with maintaining Physical Security. Here are some tips from the trenches on keeping server equipment secure.
### Removing SIM Cards from Inner Range T4000 remote monitored alarm devices
Inner Range has become a dominating force in access control and alarm systems in IDCs, offices and high-end homes around the western world in recent times.
What installers don't tell you is that they are full of vendor backdoors. The best way to address this is to remove it's access to your monitoring station via 3G/4G via The Internets entirely and put it into your LAN so it goes through normal governance, risk and compliance as per all other devices.
NB: Know this: in addition to vendor backdoors, every remote monitored alarm is a reverse shell. That's just how it is.
Steps to securing your T4000 and Inner Range devices from Vendor Backdoors:
1. Block all communications with Inner Range directly fromm your IOT network:
You do not want your T4000, Inception or Integriti devices communicating with the default IPs associated with Inner Range which are published here.
2. Remove the SIMs from your T4000 so all traffic routes through your availability monitored network:
A boxed T4000 unit:

A T4000 unit with it's SIMs removed:

This will stop it talking to home base with reverse shells and vendor backdoors.
4. Plugin the GigE adapter from your IOT network to the T4000 (grey cable in picture above).
NB: Removing the SIMs breaks the circuit that allows the device to communicate wirelesley.
NNB: This is a valid enterprise grade T4000 configuration.
### Using Chinese made pin entry locks with protective covers
All locks can be picked, and all high security registered key systems can have additional keys cut by the police
or anyone persuasive enough (read: vendor backdoors & $$$ respectively) to get a locksmith to make spare key.
I've seen it happen to server rooms several times over the years.
To get around the problem, we combine normal physical locks with Chinese made electronic pin locks from eBay,
but they all suffer the same issue of being circumventable using a credit card or knife, as this video demonstrates how easy it is:

To address the problem, we get a metal fab to manufacture a protective plate to cover the lock so it can't be so easily circumvented:

Here is the same video for a lock with a plate installed - can't open it now:

And here are the basic plans to get a metal fab to create a Protective Striker Cover Plate for you:
[](physical-security/Striker-Plate-Cover-CAD-design.pdf)
For maximum security, try to customize the lip that covers the front of the door to be as wide as possible without
bumping into the actual lock (marked as 35.0 and 19.3 in the CAD diagram).
### Using a span port + tcpdump to analyse IOT traffic for security devices
Sometimes we just want to know what a device or an IOT network is communicating with on The Internets. Here is how it's done.
First you need to slurp up some packets using tcpdump + spans, then analyse it using tshark and sed/awk/grep, as follows.
Steps to monitor TCP/IP connectivity by a device:
1. Setup your IOT switch so that all traffic over the uplink port is spanned onto a secondary port (all managed switches do this - look at the manual on how to setup a span).
NB: `APMonitor.py` may take this input as a live feed in future, so get used to working with spans and taps.
2. Plug a linux box into the span port and dump the traffic on the port using `tcpdump` into daily `.pcap` files:
```
apt install tcpdump wireshark tshark
tcpdump -i eno1 \
-nn -e -v -t --print --immediate-mode -l \
-G 86400 -Z ap -w %Y%m%d-%H%M%S-eno1.pcap -W 90 -C 10240
```
3. Run this script over the `.pcap` files:
```
ls *.pcap | \
xargs -I {} tshark -r {} -d tcp.port==40844,http -d tcp.port==40844,tls -Y '(eth.addr==00:11:b9:06:93:fe or eth.addr==00:11:b9:09:04:ff) and (ip or ipv6)' -T fields -e eth.src -e eth.dst -e ip.version -e ip.proto -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport -e udp.srcport -e udp.dstport -e http.host -e tls.handshake.extensions_server_name > /tmp/tshark_output.txt
awk -F'\t' '
# Pass 1: Build lookup table
NR==FNR {
ip = ($1 == "00:11:b9:06:93:fe" || $1 == "00:11:b9:09:04:ff") ? $6 : $5;
http_host = $11;
tls_sni = $12;
if ((http_host || tls_sni) && !app_hosts[ip]) {
app_hosts[ip] = http_host ? http_host : tls_sni;
print "added: " ip " = " app_hosts[ip] > "/dev/stderr";
}
next;
}
# Pass 2: Use lookup table
{
mac = ($1 == "00:11:b9:06:93:fe" || $1 == "00:11:b9:09:04:ff") ? $1 : $2;
ip = ($1 == "00:11:b9:06:93:fe" || $1 == "00:11:b9:09:04:ff") ? $6 : $5;
proto = ($4 == "6") ? "tcp" : ($4 == "17") ? "udp" : $4;
src_port = $7 ? $7 : $9;
dst_port = $8 ? $8 : $10;
remote_port = ($1 == "00:11:b9:06:93:fe" || $1 == "00:11:b9:09:04:ff") ? dst_port : src_port;
app_host = (app_hosts[ip] ? app_hosts[ip] : "-");
if (remote_port) print mac "\t" ip "\t" remote_port "/" proto "\t" app_host;
}
' /tmp/tshark_output.txt /tmp/tshark_output.txt | \
sort | uniq -c | \
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5}' | \
while IFS=$'\t' read count mac ip port_proto app_host; do
hostname=$(host $ip 2>/dev/null | awk '{print $NF}' | sed 's/\.$//')
port=$(echo $port_proto | cut -d/ -f1)
proto=$(echo $port_proto | cut -d/ -f2)
service=$(getent services "$port/$proto" 2>/dev/null | awk '{print $1}')
echo "$count $mac $ip $port_proto ${service:-unknown} $app_host $hostname"
done && rm /tmp/tshark_output.txt
```
Which for a T4000 should generate output such as the following:
```
added: 142.251.2.109 = smtp.gmail.com
added: 74.125.137.108 = smtp.gmail.com
added: 74.125.137.109 = smtp.gmail.com
added: 142.251.2.108 = smtp.gmail.com
added: 142.250.101.108 = smtp.gmail.com
added: 142.250.141.108 = smtp.gmail.com
added: 142.250.141.109 = smtp.gmail.com
added: 142.250.101.109 = smtp.gmail.com
added: 212.227.81.55 = ipv4.connman.net
added: 172.67.221.214 = irmsg.vizdynamics.com
added: 104.21.67.116 = irmsg.vizdynamics.com
201 00:11:b9:06:93:fe 137.116.114.112 40844/tcp unknown - 3(NXDOMAIN)
16 00:11:b9:06:93:fe 192.168.68.1 67/udp bootps - 3(NXDOMAIN)
5382 00:11:b9:06:93:fe 23.101.229.107 40844/tcp unknown - 3(NXDOMAIN)
11 00:11:b9:06:93:fe 255.255.255.255 67/udp bootps - 3(NXDOMAIN)
2 00:11:b9:06:93:fe 9.9.9.9 53/udp domain - dns9.quad9.net
12 00:11:b9:09:04:ff 104.21.67.116 443/tcp https irmsg.vizdynamics.com 3(NXDOMAIN)
16 00:11:b9:09:04:ff 115.70.68.136 123/udp ntp - 115-70-68-136.ip4.exetel.com.au
12 00:11:b9:09:04:ff 119.18.6.37 123/udp ntp - smtp.juneks.com.au
31 00:11:b9:09:04:ff 129.250.35.251 123/udp ntp - y.ns.gin.ntt.net
3 00:11:b9:09:04:ff 129.250.35.251,192.168.68.204 40756/1,17 unknown - 3(NXDOMAIN)
18 00:11:b9:09:04:ff 13.55.50.68 123/udp ntp - ec2-13-55-50-68.ap-southeast-2.compute.amazonaws.com
46700 00:11:b9:09:04:ff 137.116.114.112 40844/tcp unknown - 3(NXDOMAIN)
34 00:11:b9:09:04:ff 139.180.160.82 123/udp ntp - syd.clearnet.pw
6 00:11:b9:09:04:ff 139.99.135.247 123/udp ntp - vps-b7eaeed7.vps.ovh.ca
76 00:11:b9:09:04:ff 142.250.101.108 587/tcp submission smtp.gmail.com dz-in-f108.1e100.net
230 00:11:b9:09:04:ff 142.250.101.109 587/tcp submission smtp.gmail.com dz-in-f109.1e100.net
2065 00:11:b9:09:04:ff 142.250.141.108 587/tcp submission smtp.gmail.com dd-in-f108.1e100.net
1500 00:11:b9:09:04:ff 142.250.141.109 587/tcp submission smtp.gmail.com dd-in-f109.1e100.net
380 00:11:b9:09:04:ff 142.251.2.108 587/tcp submission smtp.gmail.com dl-in-f108.1e100.net
1600 00:11:b9:09:04:ff 142.251.2.109 587/tcp submission smtp.gmail.com dl-in-f109.1e100.net
15719 00:11:b9:09:04:ff 149.112.112.112 53/udp domain - dns.quad9.net
54 00:11:b9:09:04:ff 150.107.75.115 123/udp ntp - time.pickworth.net
16 00:11:b9:09:04:ff 159.196.178.7 123/udp ntp - 3(NXDOMAIN)
37 00:11:b9:09:04:ff 159.196.3.239 123/udp ntp - 159-196-3-239.9fc403.mel.nbn.aussiebb.net
16 00:11:b9:09:04:ff 159.196.45.149 123/udp ntp - record
20 00:11:b9:09:04:ff 162.159.200.1 123/udp ntp - time.cloudflare.com
24 00:11:b9:09:04:ff 162.159.200.123 123/udp ntp - time.cloudflare.com
32 00:11:b9:09:04:ff 167.179.162.50 123/udp ntp - 167-179-162-50.a7b3a2.bne.nbn.aussiebb.net
16 00:11:b9:09:04:ff 172.105.179.71 123/udp ntp - 172-105-179-71.ip.linodeusercontent.com
100218 00:11:b9:09:04:ff 172.67.221.214 443/tcp https irmsg.vizdynamics.com 3(NXDOMAIN)
20826 00:11:b9:09:04:ff 172.67.221.214 80/tcp http irmsg.vizdynamics.com 3(NXDOMAIN)
6 00:11:b9:09:04:ff 180.150.8.191 123/udp ntp - bitburger.simonrumble.com
11 00:11:b9:09:04:ff 192.168.68.1 123/udp ntp - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 34051/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 35951/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 36204/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 38036/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 40942/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 44065/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 48603/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.203 55896/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.204 42573/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.204 52984/1,17 unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 192.168.68.1,192.168.68.204 57294/1,17 unknown - 3(NXDOMAIN)
31 00:11:b9:09:04:ff 192.168.68.1 67/udp bootps - 3(NXDOMAIN)
6 00:11:b9:09:04:ff 194.195.249.28 123/udp ntp - ap-southeast-2.clearnet.pw
50 00:11:b9:09:04:ff 203.12.5.225 123/udp ntp - my.blockbluemedia.com
24 00:11:b9:09:04:ff 203.14.0.250 123/udp ntp - tic.ntp.telstra.net
50 00:11:b9:09:04:ff 212.227.81.55 80/tcp http ipv4.connman.net ipv4.connman.net
48 00:11:b9:09:04:ff 220.158.215.20 123/udp ntp - 220-158-215-20.broadband.telesmart.co.nz
99 00:11:b9:09:04:ff 224.0.0.251 5353/udp mdns - mdns.mcast.net
6187 00:11:b9:09:04:ff 23.101.229.107 40844/tcp unknown - 3(NXDOMAIN)
1 00:11:b9:09:04:ff 239.255.255.250 1902/udp unknown - 3(NXDOMAIN)
38 00:11:b9:09:04:ff 255.255.255.255 67/udp bootps - 3(NXDOMAIN)
48 00:11:b9:09:04:ff 27.124.125.250 123/udp ntp - ntp1.ds.network
6 00:11:b9:09:04:ff 45.124.53.221 123/udp ntp - ns1.adelaidewebsites.com.au
8 00:11:b9:09:04:ff 67.219.100.202 123/udp ntp - mel.clearnet.pw
494 00:11:b9:09:04:ff 74.125.137.108 587/tcp submission smtp.gmail.com dy-in-f108.1e100.net
643 00:11:b9:09:04:ff 74.125.137.109 587/tcp submission smtp.gmail.com dy-in-f109.1e100.net
70 00:11:b9:09:04:ff 82.165.8.211 80/tcp http - 3(NXDOMAIN)
15739 00:11:b9:09:04:ff 9.9.9.9 53/udp domain - dns9.quad9.net
```
5. Inspect the list and go through each host/protocol and build a whitelist of what you want to allow.
# Recommended configuration for real-time environments
To put APMonitor into near-realtime mode so that it checks resources multiple times per second, use these global settings:
- Dial up threads with `-t 15` on the command line or `max_threads: 15` in the site config,
- set `max_retries` to `1` and
- dial down `max_try_secs` to `10` or `15` seconds
for real-time environments.
NB: If you are running `APMonitor.py` out of `systemd` with a default install, not specifying `max_threads` will default to `20`.
> [!WARNING]
> You need to make sure your configs have enough threads to finish in << 10 seconds to get near-realtime performance.
> Make sure `max_threads` & `max_try_secs` are configured appropriately. Also note that separate site configs are executed
> in parallel as subprocesses, so any down monitors in one site do not slow down monitors in other sites, regardless of settings.
>
> Note that the thing that usually slows down a site configuration are monitors that are down β
> you need enough threads to cover the maximum number of down monitors at any one time, on average.
> We say 'on average' because not all monitors are polled simultaneously after a decent period of
> a site config having been operational.
# Recommended configuration for securing IOT/OT/ICS networks
***IOT is not supposed to be a thing*** - to compensate **if you have an NVR**, you need L2 monitoring of MAC address changes for each OT/ICS device such as cameras, NVRs & Security Computer on your IOT network.
Use Layer 2 Port MAC Change Monitoring,
Layer 4 HTTPS Self-Signed Certificate Pinning and
Layer 2 MAC Address Pinning so your network can't be tampered with.
To avoid vendor backdoors, disable IPV6 and stop your IOT devices from communicating directly with The Internets excepting whitelisted addresses for purposes you specify (don't whitelist any cloud admin reverse shells).
# Recommended configuration of Site24x7 Heartbeat Monitor Thresholds for HA Availability Monitoring
You do need to configure Site24x7's Heartbeat Monitoring to achieve high-availability second opinion availability monitoring.
As an exemplar, for the following monitored resource:
```yaml
monitors:
- type: http
name: home-nas
address: https://192.168.1.12/api/bump
expect: "(C) COPYRIGHT 2005, Super NAS Storage Inc."
ssl_fingerprint: a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890
heartbeat_url: https://plus.site24x7.com/hb/your-unique-heartbeat-id/homenas
heartbeat_every_n_secs: 300
```
Setup Site24x7 as follows:

This will send a heartbeat to [Site24x7](https://site24x7.com) every 5 minutes, and Site24x7 will drop an alarm whenever a heartbeat
doesn't arrive or arrives out of sequence +/- 1 minute (i.e., if the heartbeat doesn't arrive or is > 60 seconds out).
This ensures availability monitoring will always function, even when one of APMonitor or Site24x7 is down.
This also means you don't need to expose internal LAN network resources to The Internets.
APMonitor's near-realtime capabilities will deliver heartbeats +/- 10 secs, so if you want high-precision alerts
drop an alarm if a heartbeat does not arrive bang on 5 minutes apart +/- 10 secs.
To see the accuracy, configure Site24x7 as follows:

Site24x7 will record the error in their dashboard for anything that is more than +/- 1000 ms out,
so you can keep a record of how accurate the near-realtime heartbeat timing is.
See Site24x7 docs for more info:
- [Heartbeat Monitoring](https://www.site24x7.com/help/heartbeat/)
- [Thresholds configuration](https://www.site24x7.com/help/admin/configuration-profiles/threshold-and-availability/server-monitor.html)
NB: "+/- 10 secs" means your errors should be measurable in 10ths of a minute. Once Mercator Queues are added, this will
drop down to "+/- 1 sec" or possibly "+/- 100 ms", depending on how well Python performs with high-speed realtime
programming. A workaround in the meantime is to make sure your number of threads is equal to the number of monitored
resources - something that is not necessarily practical or required in most settings.
# Recommended configuration for 'Hands-Off' alarm notification pacing
If you want to avoid the need to connect to the monitoring server to hush alarms as they happen and ensure you receive
UP notifications as soon as things return to normal, you might also want to consider alarm notification pacing, so that
recently down resources generate more frequent messages, whilst long outages are notified less frequently. To enable:
- Set `notify_every_n_secs` to `3600` seconds (i.e., 1 hour), and
- Set `after_every_n_notifications` to `8`,
which will slow alarms down to one per hour after 8 notifications.
An alternate config for monitored resources that have long outages is as follows:
- Set `notify_every_n_secs` to `43200` (i.e., 12 hours), and
- Set `after_every_n_notifications` to `6`,
which will slow alarms down to one every 12 hours after 6 notifications, which means after a few days you will only get at most one alarm whilst asleep.
To see how the alarm pacing will accelerate then subsequently delay notifications, use the example calculations spreadsheet in [20151122 Reminder Timing with Quadratic Bezier Curve.xlsx](devnotes/20151122%20Reminder%20Timing%20with%20Quadratic%20Bezier%20Curve.xlsx) to experiment with various configuration scenarios:

Note that alarm pacing can be set at a global level in the `site:` config, and is overridden when set at a per monitored resource level in the `monitors:` section of the config.
# Recommended configuration for running multiple site configurations & panes of glass
APMonitor supports monitoring multiple sites from a single service instance by passing multiple configuration files on the command line. Each config file is processed as an independent site with its own statefile, RRD database, and MRTG index page under `/var/www/html/mrtg//`.
This is useful for running multiple single panes of glass out of one monitoring box.
If you are running multiple single panes of glass out of one computer, consider buying a USB Air Mouse or three till you find one that works well for you, like this one:

## How it works
When multiple config files are specified, APMonitor spawns one subprocess per config file and runs them concurrently, joining all subprocesses before exiting. Each subprocess:
- Derives its own statefile automatically from the config filename under `/var/tmp/APMonitor/` (e.g. `apmonitor-config.yaml` β `/var/tmp/APMonitor/apmonitor-config.statefile.json`)
- Writes its MRTG index and detail pages to `/var/www/html/mrtg//` where `` is derived from `site.name` in the config
- Maintains completely independent monitoring state, notification history, and RRD data
## Systemd service configuration
Edit `/etc/systemd/system/apmonitor.service` to list all config files on the `ExecStart` line:
```
[Unit]
Description=APMonitor Network Resource Monitor
After=network.target
[Service]
Type=simple
ExecStart=/bin/bash -c 'while true; do /usr/local/bin/APMonitor.py -t 20 -vv /usr/local/etc/apmonitor-config.yaml /usr/local/etc/site2-config.yaml /usr/local/etc/site3-config.yaml --generate-mrtg-config; sleep 10; done'
Restart=always
RestartSec=10
User=monitoring
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
```
It is useful to keep a commented-out single-site `ExecStart` line for quick debugging:
```
#ExecStart=/bin/bash -c 'while true; do /usr/local/bin/APMonitor.py -vv /usr/local/etc/apmonitor-config.yaml --generate-mrtg-config; sleep 10; done'
```
After editing the service file, reload systemd and restart the service:
```bash
sudo systemctl daemon-reload
sudo systemctl restart apmonitor.service
```
Note that `make install` will preserve a customized `ExecStart` line on subsequent installs β it only writes the default if no service file exists yet.
## Statefiles and MRTG output
Each config file produces its own set of derived files. Statefiles are stored under `/var/tmp/APMonitor/` (mode 755, no www-data access) and MRTG output is written into a per-site subdirectory of the MRTG working directory:
| Config file | Statefile | MRTG index |
|---|---|---|
| `apmonitor-config.yaml` | `/var/tmp/APMonitor/apmonitor-config.statefile.json` | `http://:888/mrtg/HomeLab/` |
| `site2-config.yaml` | `/var/tmp/APMonitor/site2-config.statefile.json` | `http://:888/mrtg/TellusionLab/` |
| `site3-config.yaml` | `/var/tmp/APMonitor/site3-config.statefile.json` | `http://:888/mrtg/OfficeLab/` |
The MRTG subdirectory name comes from `site.name` in each config file (sanitised to a filesystem-safe string), not from the config filename. The statefile name is always derived from the config filename stem.
## Default state file location
On Unix-like systems, APMonitor stores all statefiles under `/var/tmp/APMonitor/`:
- Directory is created automatically with mode `755` (no group write β www-data is explicitly excluded)
- Persists across reboots (unlike `/tmp`)
- All sibling files (`.json`, `.json.new`, `.json.old`, `.mrtg.cfg`, `.rrd/`) live in this directory
The `-s/--statefile` flag overrides this for single-config invocations. It is not valid when multiple config files are specified.
## Migrating statefiles from older versions
If upgrading from a version that stored statefiles in `/var/tmp/` directly, run:
```bash
sudo make migrate
```
This performs a two-phase migration:
1. Renames `apmonitor-statefile.*` β `apmonitor-config.statefile.*` in `/var/tmp/` (legacy name fix)
2. Moves all `apmonitor-*.statefile.*` files and `.rrd` directories from `/var/tmp/` into `/var/tmp/APMonitor/`
The service is stopped before migration and restarted afterwards. If a destination file already exists it is skipped with a warning rather than overwritten.
## Threading with multiple sites
The `-t` flag sets the number of monitor-checking threads **per site**, not globally. With three sites and `-t 20`, up to 60 threads may be active concurrently across all subprocesses. Size `-t` based on the largest single site's monitor count rather than the total across all sites.
## Notes
- `-s/--statefile` is not valid when multiple config files are specified β each site always derives its own statefile automatically from the config filename.
- `make install` writes a default single-site `ExecStart`. Edit it manually after installation to add additional config files β subsequent `make install` runs will preserve your customized `ExecStart`.
- `make test-config` only tests the default config at `$(CONFIG_DIR)/apmonitor-config.yaml`. Test additional configs directly: `APMonitor.py --test-config /usr/local/etc/site2-config.yaml`.
# Recommended configuration for SNMP monitoring on Debian Linux
To enable SNMP monitoring on a Debian host so that APMonitor can poll it, install and configure `snmpd` with a read-only community string restricted to your APMonitor machine.
## Install
```bash
sudo apt install snmpd snmp
```
## Configure `/etc/snmp/snmpd.conf`
Replace the default config with the following minimal read-only configuration:
```
# Listen on all interfaces (lock to a specific IP if preferred)
agentAddress udp:161
# Read-only community, restricted to your APMonitor host only
# Replace 192.168.1.50 with the IP of your APMonitor machine
rocommunity YourCommunityString 192.168.1.50
# Optional: identify the device
sysLocation "Server Room Rack 3"
sysContact "admin@example.com"
sysName "my-debian-host"
```
## Enable and restart
```bash
sudo systemctl restart snmpd
sudo systemctl enable snmpd
```
## Firewall
If the host runs a firewall, allow UDP port 161 from your APMonitor machine only:
```bash
# ufw
sudo ufw allow from 192.168.1.50 to any port 161 proto udp
# iptables
sudo iptables -A INPUT -s 192.168.1.50 -p udp --dport 161 -j ACCEPT
```
## Test from your APMonitor host
```bash
snmpwalk -v 2c -c YourCommunityString 192.168.1.x
```
## Notes
- `rocommunity` is the read-only directive β the absence of any `rwcommunity` line is what keeps access strictly read-only.
- Locking the source IP to your APMonitor machine is the primary access control on a LAN. Do not use `default` or `0.0.0.0/0` unless there is no alternative.
- Change `YourCommunityString` to something non-obvious β `public` is the first string any scanner tries.
- SNMPv3 with authentication and encryption is the correct choice for hosts on networks you do not fully trust. For a closed LAN behind a firewall, SNMPv2c with a non-default community string and source IP restriction is workable.
## APMonitor configuration
Once `snmpd` is running, add a `ports` monitor pointing at the host:
```yaml
- type: ports
name: my-debian-ports
address: "snmp://192.168.1.x"
community: "YourCommunityString"
check_every_n_secs: 300
```
For host performance monitoring (CPU, memory, disk I/O), use `type: host` instead:
```yaml
- type: host
name: my-debian-host
address: "snmp://192.168.1.x"
community: "YourCommunityString"
check_every_n_secs: 300
```
# MRTG/RRD Integration for Performance Graphing
APMonitor integrates with MRTG (Multi Router Traffic Grapher) and RRDtool to provide historical performance graphs of resource availability and response times. This integration enables trend analysis, capacity planning, and visual monitoring dashboards.
## Quick Start
Install MRTG and related dependencies:
```bash
sudo make installmrtg
```
This installs nginx on port 888, fcgiwrap for CGI support, and sets up the MRTG web interface.
Enable RRD data collection by running APMonitor with `--generate-mrtg-config`:
```bash
./APMonitor.py -vv -s /var/tmp/apmonitor-statefile.json config.yaml --generate-mrtg-config
```
Access graphs at `http://localhost:888/mrtg//` or `http://:888/mrtg//`.
## How It Works
When `--generate-mrtg-config` is specified:
1. **RRD Collection Enabled**: APMonitor records response times and availability status to RRDtool databases
2. **MRTG Config Generated**: Creates a `.mrtg.cfg` file derived from the statefile path
3. **Site subdirectory created**: MRTG output (index.html, detail pages) is written to `/var/www/html/mrtg//` where `` is sanitised from `site.name` in the config
4. **Web Interface Updated**: Updates `mrtg-rrd.cgi.pl` with the new config path and generates `index.html`
5. **Continuous Updates**: Subsequent runs update RRD files and regenerate the index with latest metrics and outage state
**Output file locations:**
- Statefile: `/var/tmp/APMonitor/.statefile.json`
- MRTG config: `/var/tmp/APMonitor/.statefile.mrtg.cfg`
- RRD databases:
- Availability monitors: `/var/tmp/APMonitor/.statefile.rrd/-availability.rrd`
- SNMP monitors: `/var/tmp/APMonitor/.statefile.rrd/-snmp.rrd`
- MRTG index: `/var/www/html/mrtg//index.html`
- Detail pages: `/var/www/html/mrtg//--detail.html`
- Web interface: `http://localhost:888/mrtg//`
## Command Options
Generate MRTG config with default base working directory (`/var/www/html/mrtg`):
```bash
./APMonitor.py apmonitor-config.yaml --generate-mrtg-config
```
Specify a custom base working directory (site subdirectory is always appended):
```bash
./APMonitor.py apmonitor-config.yaml --generate-mrtg-config /var/www/html/graphs
```
## RRD Data Collection
### Availability Monitors (ping, http, quic, tcp, udp)
Each availability monitor's RRD file tracks two metrics:
- **`response_time`** (GAUGE, milliseconds): Time taken for check to complete
- Range: 0 to unlimited
- Value: `U` (unknown) when check fails
- **`is_up`** (GAUGE, boolean): Service availability
- `100` = service up
- `0` = service down
### SNMP Monitors (port, ports, host)
All SNMP-family monitors (`port`, `ports`, `host`) use a single unified RRD schema per device. The schema is divided into three sections: per-interface DS pairs (used by `ports`/`port` only), fixed aggregate network DS (used by `ports`/`port`; stored as `U` for `host`), and fixed host performance DS (used by `host`; stored as `U` for `ports`/`port`).
**Filename**: `/var/tmp/APMonitor/.statefile.rrd/-snmp.rrd`
**Per-Interface Data Sources** (one pair per discovered interface, COUNTER β `ports`/`port` only):
- **`if{index}_in`**: Inbound bytes for interface at ifIndex `{index}` (IF-MIB::ifInOctets)
- **`if{index}_out`**: Outbound bytes for interface at ifIndex `{index}` (IF-MIB::ifOutOctets)
DS names use the raw ifIndex integer (e.g., `if1_in`, `if2_out`), not the interface description string. DS order is stable β interfaces are sorted numerically by ifIndex at both create and update time.
**Fixed Aggregate Network Data Sources** (COUNTER β `ports`/`port` populated, `host` stores `U`):
- **`tcp_retrans`**: Global TCP retransmit segment counter (TCP-MIB::tcpRetransSegs) β `ports` only
- **`total_bits_in`**: Sum of inbound octets Γ 8 across all interfaces
- **`total_bits_out`**: Sum of outbound octets Γ 8 across all interfaces
- **`total_pkts_in`**: Sum of all inbound packets (unicast + multicast + broadcast) across all interfaces
- **`total_pkts_out`**: Sum of all outbound packets across all interfaces
- **`total_errors_in`**: Sum of inbound interface errors across all interfaces (IF-MIB::ifInErrors)
- **`total_errors_out`**: Sum of outbound interface errors across all interfaces (IF-MIB::ifOutErrors)
- **`total_pkts_ucast`**: Total unicast packets in+out combined across all interfaces
- **`total_pkts_bmcast`**: Total broadcast+multicast packets in+out combined across all interfaces
**System Resource Data Sources** (GAUGE β all types):
- **`cpu_load`**: CPU utilization percentage, range 0β100. Sourced from vendor-specific OIDs (Cisco/HP/Juniper/Ubiquiti) with HOST-RESOURCES-MIB::hrProcessorLoad as fallback. Stored as `U` if unavailable.
- **`memory_pct`**: Memory utilization percentage, range 0β100. Sourced from vendor-specific OIDs with HOST-RESOURCES-MIB::hrStorage as fallback. Stored as `U` if unavailable.
**Fixed Host Performance Data Sources** (COUNTER/GAUGE β `host` populated, `ports`/`port` store `U`):
- **`context_switches`** (COUNTER): Raw context switch counter (UCD-SNMP-MIB::ssRawContexts)
- **`swap_io`** (COUNTER): Raw swap pages in + out combined (UCD-SNMP-MIB::ssRawSwapIn + ssRawSwapOut)
- **`disk_read`** (COUNTER): Disk read bytes summed across all block devices (UCD-DISKIO-MIB::diskIOReadX)
- **`disk_write`** (COUNTER): Disk write bytes summed across all block devices (UCD-DISKIO-MIB::diskIOWriteX)
- **`disk_space_pct`** (GAUGE): Root filesystem utilization percentage 0β100 (HOST-RESOURCES-MIB::hrStorage `/` entry). Also persisted to statefile for display in MRTG index and detail page headers.
- **`swap_used`** (GAUGE): Swap space used in bytes (HOST-RESOURCES-MIB::hrStorage virtual memory entry, with UCD-SNMP-MIB::memTotalSwap β memAvailSwap as fallback)
- **`interrupts`** (COUNTER): Raw hardware interrupt counter (UCD-SNMP-MIB::ssRawInterrupts)
**Fixed Tamper/Network Capacity Data Sources** (GAUGE β `ports` only, `port`/`host` store `U`):
- **`ports_up_count`**: Count of interfaces with oper=up
- **`nvram_flash_bytes`**: Sum of used bytes across NVRAM/flash hrStorage entries
- **`mac_count`**: Count of learned FDB entries via Q-BRIDGE-MIB
- **`arp_count`**: Count of ARP entries via ipNetToPhysicalTable / ipNetToMediaTable
**Total fixed DS count: 22** (11 network/system + 7 host performance + 4 tamper/network). Expected DS count for auto-heal check = `(2 Γ interface_count) + 22`.
**MRTG Targets generated per monitor type:**
| Target suffix | DS pair | Monitor types | Description |
|---|---|---|---|
| `-bandwidth` | `total_bits_in` / `total_bits_out` | `ports`, `port` | Total bandwidth in/out (bits) |
| `-packets` | `total_pkts_in` / `total_pkts_out` | `ports`, `port` | Total packets in/out |
| `-packets-type` | `total_pkts_ucast` / `total_pkts_bmcast` | `ports`, `port` | Unicast vs broadcast+multicast |
| `-errors` | `total_errors_in` / `total_errors_out` | `ports`, `port` | Interface errors in/out |
| `-retransmits` | `tcp_retrans` / `tcp_retrans` | `ports` only | TCP retransmits (single line) |
| `-system` | `cpu_load` / `memory_pct` | `ports` only | CPU & memory utilization |
| `-tamper` | `ports_up_count` / `nvram_flash_bytes` | `ports` only | Active ports & NVRAM/flash bytes |
| `-network` | `mac_count` / `arp_count` | `ports` only | Learned MACs & ARP entries |
| `-system1` | `cpu_load` / `context_switches` | `host` | CPU & Load |
| `-system2` | `memory_pct` / `swap_io` | `host` | Memory & Paging |
| `-system3` | `disk_read` / `disk_write` | `host` | Disk I/O (Disk Use % in PageTop) |
| `-system4` | `swap_used` / `interrupts` | `host` | System Thrashing |
**Notes:**
- COUNTER type automatically calculates per-second rates and handles 32/64-bit wraparound.
- All interfaces for a device are stored in a single RRD for atomic updates. If the interface list changes, stale DS entries remain in the RRD unused β the RRD is never recreated on interface list change alone.
- If the discovered interface count grows such that the expected DS count exceeds what was created, APMonitor auto-heals by deleting and recreating the RRD on the next run.
- `disk_space_pct` is stored in the RRD as a GAUGE DS and also persisted to the statefile so that `generate_mrtg_config()` and `generate_mrtg_index()` can embed the live value (e.g., `Disk Use: 73.4%`) in MRTG PageTop headers and index cell headings without a live SNMP poll at generation time. Displays as `Disk Use: N/A` until the first successful poll.
- UCD-SNMP-MIB host performance metrics (context switches, swap I/O, disk I/O, interrupts) are Linux `net-snmp` specific. On network devices (Cisco, HP, Juniper, Ubiquiti), these DS will store `U`.
### RRD Retention Policy
| Time Range | Resolution | MRTG Standard Rows | APMonitor Default |
|---|---|---|---|
| High-resolution recent | Native step | 1 day native | 31 days native |
| Short-term | 5-minute | 600 (~2 days) | 18600 (~64 days) |
| Medium-term | 30-minute | 600 (~12.5 days) | 18600 (~387 days) |
| Long-term | 1-hour | β | 43830 (~5 years) |
| Historical | 1-day | 732 (~2 years) | 22692 (~62 years) |
> [!WARNING]
> Be careful if upgrading to the 1.3.x stream. This release contains RRD schema changes that require existing RRD files to be deleted and recreated before upgrading. APMonitor will auto-heal existing RRDs on first run when `--generate-rrds` or `--generate-mrtg-config` is specified.
To use custom retention, modify the row constants in `create_rrd_rras()`:
```python
rows_1day_native = 86400 // step_secs * 31 # 31 days at native resolution
rows_2days_5min = 18600 # ~64 days at 5-min
rows_12days_30min = 18600 # ~387 days at 30-min
rows_5years_1hour = 43830 # ~5 years at 1-hour
rows_2years_daily = 22692 # ~62 years at 1-day
```
## Working with RRD Files Directly
```bash
# Query availability RRD database info
rrdtool info /var/tmp/APMonitor/apmonitor-config.statefile.rrd/monitor-name-availability.rrd
# Query SNMP RRD database info
rrdtool info /var/tmp/APMonitor/apmonitor-config.statefile.rrd/switch-snmp.rrd
# Run APMonitor with MRTG & RRD enabled
./APMonitor.py -vv apmonitor-config.yaml --generate-mrtg-config
# Check when the RRD was created
ls -la /var/tmp/APMonitor/apmonitor-config.statefile.rrd/tellusion-gw-availability.rrd
# Dump RRD info to see its structure
rrdtool info /var/tmp/APMonitor/apmonitor-config.statefile.rrd/tellusion-gw-availability.rrd | head -50
# Check the last update timestamp
rrdtool lastupdate /var/tmp/APMonitor/apmonitor-config.statefile.rrd/tellusion-gw-availability.rrd
# Fetch the last 300 seconds
rrdtool fetch /var/tmp/APMonitor/apmonitor-config.statefile.rrd/tellusion-gw-availability.rrd AVERAGE -s end-300 -e now
# Fetch SNMP interface data
rrdtool fetch /var/tmp/APMonitor/apmonitor-config.statefile.rrd/switch-snmp.rrd AVERAGE -s end-3600 -e now
```
**References:**
- [MRTG-RRD Documentation](https://directory.fsf.org/wiki/Mrtg-rrd)
- [mrtg-rrd.cgi FAQ](https://web.archive.org/web/20081228131907/http://www.fi.muni.cz:80/~kas/mrtg-rrd/cvsweb.cgi/FAQ?rev=HEAD)
- *System Performance Tuning*, 2nd Ed. β Gian-Paolo D. Musumeci & Mike Loukides (O'Reilly) β the canonical reference for the host performance metrics collected by `type: host`
**Note:** RRD data collection is disabled by default. Run with `--generate-mrtg-config` once to enable, then continue normal monitoring to collect historical data.
# `APMonitor.py` YAML/JSON Site Configuration Options
APMonitor uses a YAML or JSON configuration file to define the site being monitored and the resources to check. The configuration consists of two main sections: site-level settings that apply globally, and per-monitor settings that define individual resources to check.
## Complete Example Configuration
Here's a complete example showing all available configuration options:
```yaml
site:
name: "HomeLab"
email_server:
smtp_host: "smtp.gmail.com"
smtp_port: 587
smtp_username: "alerts@example.com"
smtp_password: "app_password_here"
from_address: "alerts@example.com"
use_tls: true
outage_emails:
- email: "admin@example.com"
email_outages: true
email_recoveries: true
email_reminders: true
- email: "manager@example.com"
email_outages: yes
email_recoveries: yes
email_reminders: no
outage_webhooks:
- endpoint_url: "https://api.pushover.net/1/messages.json"
request_method: POST
request_encoding: JSON
request_prefix: "token=your_app_token&user=your_user_key&message="
request_suffix: ""
max_threads: 1
max_retries: 3
max_try_secs: 20
check_every_n_secs: 60
notify_every_n_secs: 600
after_every_n_notifications: 1
monitors:
# Single-port MAC-pinning monitor (hidden from MRTG display, monitoring continues)
- type: port
name: "switch-port0"
address: snmp://192.168.1.6
community: TellusionLab
check_every_n_secs: 10
notify_every_n_secs: 60
after_every_n_notifications: 6
port: 0
mac: 18:E8:29:45:F8:F7
always_up: yes
display: false
# Switch port status + SNMP metrics monitoring
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
percentile: 95
check_every_n_secs: 10
notify_every_n_secs: 3600
after_every_n_notifications: 1
# Host performance monitoring (CPU, memory, disk I/O, swap, interrupts)
- type: host
name: debmon-host
address: "snmp://192.168.1.10"
community: "public"
check_every_n_secs: 300
# TCP port check with send/receive
- type: tcp
name: smtp-server
address: "tcp://mail.example.com:25"
send: "EHLO apmonitor\r\n"
content_type: text
expect: "250"
check_every_n_secs: 60
# TCP connection-only check
- type: tcp
name: mysql-db
address: "tcp://192.168.1.100:3306"
check_every_n_secs: 30
# UDP send with hex data
- type: udp
name: custom-protocol
address: "udp://192.168.1.200:9999"
send: "01 02 03 04"
content_type: hex
expect: "OK"
check_every_n_secs: 60
# UDP send with text data
- type: udp
name: syslog-collector
address: "udp://192.168.1.50:514"
send: "<134>APMonitor: test message"
check_every_n_secs: 300
- type: ping
name: home-fw
address: "192.168.1.1"
check_every_n_secs: 60
email: true
heartbeat_url: "https://hc-ping.com/uuid-here"
heartbeat_every_n_secs: 300
- type: http
name: in3245622
address: "http://192.168.1.21/Login?oldUrl=Index"
expect: "System Name: HomeLab"
check_every_n_secs: 120
notify_every_n_secs: 3600
after_every_n_notifications: 5
email: yes
- type: http
name: json-api
address: "https://api.example.com/webhook"
send: '{"event": "test", "status": "ok"}'
content_type: "application/json"
expect: "success"
- type: http
name: nvr0
address: "https://192.168.1.12/api/system"
expect: "nvr0"
ssl_fingerprint: "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890"
ignore_ssl_expiry: true
email: false
heartbeat_url: "https://plus.site24x7.com/hb/uuid/nvr0"
heartbeat_every_n_secs: 60
- type: quic
name: fast-api
address: "https://192.168.1.50/api/health"
expect: "ok"
check_every_n_secs: 30
```
## site: configuration options
The `site` section defines global settings for the monitoring site.
### Required Fields
- **`name`** (string): The name of the site being monitored. Used in notification messages and as the MRTG output subdirectory name (sanitised to a filesystem-safe string).
```yaml
site:
name: "HomeLab"
```
### Optional Fields
- **`email_server`** (object, optional): SMTP server configuration for sending email notifications. Required if `outage_emails` is configured.
```yaml
email_server:
smtp_host: "smtp.gmail.com"
smtp_port: 587
smtp_username: "alerts@example.com"
smtp_password: "app_password_here"
from_address: "alerts@example.com"
use_tls: true
```
- **`smtp_host`** (string, required): SMTP server hostname or IP address
- **`smtp_port`** (integer, required): SMTP server port (typically 587 for TLS, 465 for SSL, 25 for unencrypted). Must be between 1 and 65535
- **`smtp_username`** (string, optional): SMTP authentication username
- **`smtp_password`** (string, optional): SMTP authentication password. Use app-specific passwords for Gmail/Google Workspace
- **`from_address`** (string, required): Email address to use in the "From" field. Must be a valid email address
- **`use_tls`** (boolean, optional): Whether to use TLS/STARTTLS encryption. Default: true
**Note**: For Gmail/Google Workspace, you must use an [app-specific password](https://support.google.com/accounts/answer/185833) rather than your account password. Port 587 with `use_tls: true` is the recommended configuration for most SMTP servers.
- **`outage_emails`** (list of objects, optional): Email addresses to notify when resources go down or recover. Requires `email_server` to be configured.
```yaml
outage_emails:
- email: "admin@example.com"
email_outages: true
email_recoveries: true
email_reminders: true
- email: "oncall@example.com"
email_outages: yes
email_recoveries: no
```
- **`email`** (string, required): Valid email address
- **`email_outages`** (boolean/integer/string, optional): Send email when resource goes down. Default: true
- **`email_recoveries`** (boolean/integer/string, optional): Send email when resource recovers. Default: true
- **`email_reminders`** (boolean/integer/string, optional): Send email for ongoing outage reminders. Default: true
- **`outage_webhooks`** (list of objects, optional): Webhook endpoints to call when resources go down or recover.
```yaml
outage_webhooks:
- endpoint_url: "https://api.example.com/alerts"
request_method: POST
request_encoding: JSON
request_prefix: ""
request_suffix: ""
```
- **`endpoint_url`** (string, required): Valid URL with scheme and host
- **`request_method`** (string, required): HTTP method, must be `GET` or `POST`
- **`request_encoding`** (string, required): Message encoding format:
- `URL`: URL-encode the message (for query parameters or form data)
- `HTML`: HTML-escape the message
- `JSON`: Send as JSON object with `message` field (POST only)
- `CSVQUOTED`: CSV-quote the message for comma-separated values
- **`request_prefix`** (string, optional): String to prepend to encoded message (e.g., API tokens, field names)
- **`request_suffix`** (string, optional): String to append to encoded message
- **`max_threads`** (integer, optional): Number of concurrent threads for checking resources in parallel. Must be β₯ 1. Default: 1 (single-threaded). Can be overridden by command line `-t` option.
```yaml
max_threads: 1
```
**Note**: For near-realtime monitoring environments, set `max_threads` to 5-15 to enable parallel checking of multiple resources. Single-threaded mode (1) is recommended for small systems like Raspberry Pi or when log clarity is important. This setting is overridden by the `-t` command line argument if specified.
- **`max_retries`** (integer, optional): Number of times to retry failed checks before marking resource as down. Must be β₯ 1. Default: 3
```yaml
max_retries: 3
```
**Note**: For near-realtime monitoring, set `max_retries: 1` to reduce detection latency. Higher values (3-5) are better for unstable networks where transient failures are common.
- **`max_try_secs`** (integer, optional): Timeout in seconds for each individual check attempt. Must be β₯ 1. Default: 20
```yaml
max_try_secs: 20
```
- **`check_every_n_secs`** (integer, optional): Default seconds between checks for all monitors. Individual monitors can override this with their own `check_every_n_secs` setting. Must be β₯ 1. Default: 60
```yaml
check_every_n_secs: 300
```
**Note**: This sets the baseline check interval for all monitors. Can be overridden per-monitor for resources requiring different check frequencies. When a monitor's configuration changes (detected via SHA-256 checksum), it is checked immediately regardless of this interval.
- **`notify_every_n_secs`** (integer, optional): Default minimum seconds between outage notifications for all monitors. Individual monitors can override this with their own `notify_every_n_secs` setting. Must be β₯ 1. Default: 600
```yaml
notify_every_n_secs: 1800
```
**Note**: This sets the baseline notification throttling interval. Combined with `after_every_n_notifications`, controls the notification escalation curve for all monitors unless overridden per-monitor.
- **`after_every_n_notifications`** (integer, optional): Default number of notifications after which the notification interval reaches `notify_every_n_secs` for all monitors. Individual monitors can override this with their own `after_every_n_notifications` setting. Must be β₯ 1. Default: 1 (constant notification intervals)
```yaml
after_every_n_notifications: 1
```
**Note**: When set to a value > 1, notification intervals start shorter and gradually increase following a quadratic Bezier curve until reaching `notify_every_n_secs` after the specified number of notifications. This provides more frequent alerts at the start of an outage when immediate attention is needed, then reduces notification frequency as the outage continues. A value of 1 maintains constant notification intervals (original behavior).
- **`alarms`** (boolean/integer/string, optional): Master switch to enable/disable all outage/recovery/reminder notifications for every monitor in this site. Accepts: `true`/`yes`/`on`/`1` (case-insensitive) for enabled, `false`/`no`/`off`/`0` for disabled. Default: true
```yaml
alarms: false
```
**Note**: When set to `false`, no email or webhook notifications are sent for any monitor in the site. Monitoring, state tracking, heartbeats, RRD collection, and MRTG display all continue unaffected. Useful for silencing a site during planned maintenance or initial deployment. Can be overridden per-monitor with a monitor-level `alarms` setting.
## monitors: configuration options
The `monitors` section is a list of resources to monitor. Each monitor defines what to check and how often.
### Required Fields (All Monitor Types)
- **`type`** (string): Type of check to perform. Must be one of:
- `ping`: ICMP ping check
- `http`: HTTP/HTTPS endpoint check (supports both HTTP and HTTPS schemes, follows and checks redirect chain for errors)
- `quic`: HTTP/3 over QUIC endpoint check (UDP-based, faster than HTTP/HTTPS for high-latency networks)
- `tcp`: TCP port connectivity and protocol check
- `udp`: UDP datagram send/receive check
- `ports`: SNMP network device monitor β collects interface bandwidth/packet/error metrics, TCP retransmits, CPU & memory, and tracks per-interface oper/admin state and MAC address changes
- `port`: SNMP single-port MAC-pinning monitor (pins one switch port to one MAC address; fires alerts on wrong MAC, port down, or MAC absence depending on `always_up`)
- `host`: SNMP host performance monitor β collects CPU, memory, disk I/O, swap activity, and hardware interrupt metrics per *System Performance Tuning* (Musumeci & Loukides, O'Reilly)
> [!NOTE]
> `type: snmp` has been removed. Use `type: ports` for network device monitoring or `type: host` for server performance monitoring.
- **`name`** (string): Unique identifier for this monitor.
- **`address`** (string): Resource to check. Format depends on monitor type:
- For `ping`: Valid hostname, IPv4, or IPv6 address
- For `http`/`quic`: Full URL with scheme and host
- For `tcp`: URL with `tcp://` scheme, hostname/IP, and port (e.g., `tcp://server.example.com:22`)
- For `udp`: URL with `udp://` scheme, hostname/IP, and port (e.g., `udp://192.168.1.1:161`)
- For `ports`: URL with `snmp://` scheme and hostname/IP (e.g., `snmp://192.168.1.1` or `snmp://192.168.1.1:161`)
- For `port`: URL with `snmp://` scheme and hostname/IP β uses SNMP transport, same format as `ports` (e.g., `snmp://192.168.1.6`)
- For `host`: URL with `snmp://` scheme and hostname/IP β uses SNMP transport, same format as `ports` (e.g., `snmp://192.168.1.10`)
### Optional Fields (All Monitor Types)
- **`check_every_n_secs`** (integer, optional): Seconds between checks for this resource. Overrides site-level `check_every_n_secs`. Must be β₯ 1. Default: 60 (or site-level setting if configured)
```yaml
check_every_n_secs: 300
```
**Note**: When a monitor's configuration changes (any field modification), the monitor is checked immediately on the next run regardless of this interval. Configuration changes are detected via SHA-256 checksum stored in the state file.
- **`notify_every_n_secs`** (integer, optional): Minimum seconds between outage notifications while resource remains down. Must be β₯ 1 and β₯ `check_every_n_secs`. Default: 600
```yaml
notify_every_n_secs: 1800
```
- **`after_every_n_notifications`** (integer, optional): Number of notifications after which the notification interval reaches `notify_every_n_secs` for this specific monitor. Overrides site-level `after_every_n_notifications`. Can only be specified if `notify_every_n_secs` is present. Must be β₯ 1.
```yaml
notify_every_n_secs: 3600
after_every_n_notifications: 5
```
**Behavior**: Notification timing follows a quadratic Bezier curveβintervals start shorter and gradually increase over the first N notifications until reaching the full `notify_every_n_secs` interval. After N notifications, the interval remains constant at `notify_every_n_secs`. This provides aggressive early alerting that tapers off as outages persist.
- **`email`** (boolean/integer/string, optional): Master switch to enable/disable email notifications for this specific monitor. Accepts: `true`/`yes`/`on`/`1` (case-insensitive) for enabled, `false`/`no`/`off`/`0` for disabled. Default: true (enabled if `email_server` configured)
```yaml
email: true
```
**Note**: When set to `false`, this monitor will not send any email notifications regardless of site-level `outage_emails` configuration. Useful for non-critical resources or during maintenance windows. This is a monitor-level override that takes precedence over all other email settings.
- **`display`** (boolean/integer/string, optional): Controls whether this monitor appears in the MRTG index page. Accepts: `true`/`yes`/`on`/`1` (case-insensitive) for visible, `false`/`no`/`off`/`0` for hidden. Default: true (displayed)
```yaml
display: false
```
**Note**: When set to `false`, the monitor is completely excluded from the MRTG index HTML output and MRTG config file β no graphs are generated and no graph cells appear. Monitoring, alerting, heartbeats, and RRD data collection continue unaffected. Hidden monitors are listed by name in a small audit footer at the bottom of the MRTG index page; if a hidden monitor is down, its name appears in red in that footer so outages remain visible as a detective control. Useful for suppressing internal infrastructure monitors (e.g., the APMonitor host itself) that would clutter the dashboard without adding operational value.
- **`alarms`** (boolean/integer/string, optional): Enable/disable all outage/recovery/reminder notifications for this specific monitor. Accepts: `true`/`yes`/`on`/`1` (case-insensitive) for enabled, `false`/`no`/`off`/`0` for disabled. Default: true (or site-level `alarms` setting if configured)
```yaml
alarms: false
```
**Note**: Monitor-level `alarms` overrides site-level `alarms`. When set to `false`, no email or webhook notifications are sent for this monitor. Monitoring, state tracking, heartbeats, RRD collection, and MRTG display all continue unaffected. Useful for silencing noisy or non-critical monitors without removing them from the config.
- **`heartbeat_url`** (string, optional): URL to ping (HTTP GET) when resource check succeeds. Useful for external monitoring services like Site24x7 or Healthchecks.io. Must be valid URL with scheme and host.
```yaml
heartbeat_url: "https://hc-ping.com/your-uuid-here"
```
- **`heartbeat_every_n_secs`** (integer, optional): Seconds between heartbeat pings. Must be β₯ 1. Can only be specified if `heartbeat_url` is present. If not specified, heartbeat is sent on every successful check.
```yaml
heartbeat_every_n_secs: 300
```
### HTTP/QUIC Monitor Specific Fields
These fields are only valid for monitors with `type: http` or `type: quic`:
- **`expect`** (string, optional): Substring that must appear in the HTTP response body for the check to succeed. If not present, any 200 OK response is considered successful. The check performs a simple string searchβif the expected content appears anywhere in the response body, the check passes.
```yaml
expect: "System Name: HomeLab"
```
**Note**: The `expect` field is string-only for simplicity. It performs exact substring matching (case-sensitive). For complex validation scenarios requiring status code checks, header validation, or regex matching, consider using external monitoring tools or extending APMonitor.
- **`ssl_fingerprint`** (string, optional): SHA-256 fingerprint of the expected SSL/TLS certificate (with or without colons). Enables certificate pinning for self-signed certificates. When specified, the certificate is verified before making the HTTP request.
```yaml
ssl_fingerprint: "e85260e8f8e85629cfa4d023ea0ae8dd3ce8ccc0040b054a4753c2a5ab269296"
```
- **`ignore_ssl_expiry`** (boolean/integer/string, optional): Skip SSL/TLS certificate expiration checking. Accepts: `true`/`1`/`"yes"`/`"ok"` (case-insensitive) for true, or `false`/`0`/`"no"` for false. Useful for development environments or when certificate renewal is managed separately.
```yaml
ignore_ssl_expiry: true
```
### HTTP/QUIC POST Request Fields
These optional fields enable HTTP/QUIC monitors to send POST requests with data:
- **`send`** (string, optional): Data to send in HTTP/QUIC POST request body. When specified, the monitor sends a POST request instead of GET. Data is always UTF-8 encoded.
```yaml
send: '{"event": "test", "status": "ok"}'
```
- **`content_type`** (string, optional): MIME type for the Content-Type header. Can only be specified if `send` is present. This is a raw MIME type string (e.g., `application/json`, `application/x-www-form-urlencoded`, `text/plain`). Default: `text/plain; charset=utf-8`
```yaml
content_type: "application/json"
send: '{"event": "test", "status": "ok"}'
```
**HTTP JSON POST Example:**
```yaml
- type: http
name: json-api
address: "https://api.example.com/webhook"
send: '{"event": "test", "status": "ok"}'
content_type: "application/json"
expect: "success"
```
**HTTP Form POST Example:**
```yaml
- type: http
name: form-submit
address: "https://example.com/submit"
send: "name=test&value=123"
content_type: "application/x-www-form-urlencoded"
expect: "received"
```
**QUIC POST Example:**
```yaml
- type: quic
name: text-endpoint
address: "https://fast.example.com/log"
send: "Test message"
content_type: "text/plain; charset=utf-8"
```
**Note**: HTTP/QUIC monitors without `send` perform GET requests (original behavior). The `content_type` for HTTP/QUIC is a raw MIME type header, unlike TCP/UDP where it specifies encoding format (text/hex/base64).
### TCP/UDP Monitor Specific Fields
These fields are only valid for monitors with `type: tcp` or `type: udp`:
- **`send`** (string, optional for TCP, **required for UDP**): Data to send to the service. UDP monitors require this parameter because UDP is connectionless and needs application-layer data to verify connectivity.
```yaml
send: "EHLO apmonitor\r\n"
```
- **`content_type`** (string, optional): Encoding format for the `send` data. Can only be specified if `send` is present. Valid values:
- `text` (default): UTF-8 encoded string
- `hex`: Hexadecimal byte string (spaces and colons are stripped)
- `base64`: Base64-encoded binary data
```yaml
content_type: hex
send: "01 02 03 04"
```
**Note**: TCP monitors without `send` perform connection-only checks. TCP monitors automatically attempt to receive data after connecting (useful for banner protocols like SSH, SMTP, FTP). UDP monitors without `expect` succeed if the packet is sent without socket errors, but cannot verify if the service is actually listening.
- **`expect`** (string, optional): Substring that must appear in the response for the check to succeed. For TCP, this validates the received banner or response. For UDP, this requires a matching response to be received.
```yaml
expect: "SSH-2.0"
```
**UDP Behavior Notes**:
- **With `expect`**: Real service validation (recommended for SNMP, DNS, NTP) - waits for response and validates content
- **Without `expect`**: Fire-and-forget (useful for syslog, statsd) - succeeds if packet sends without socket error, cannot detect if port is listening
- UDP is connectionless, so there's no "connection established" signal like TCP's three-way handshake
### Ports Monitor Specific Fields
The `ports` monitor type polls a managed network switch, router, or Linux host via SNMPv2c. It combines two orthogonal functions in one monitor: it collects bandwidth, packet, error, TCP retransmit, CPU, and memory metrics into RRD (the former `type: snmp` function), and it also tracks the operational and administrative status of every interface plus the set of learned MAC addresses on each port (the original `ports` function), firing one notification per changed interface.
> [!NOTE]
> `type: ports` subsumes the former `type: snmp`. If you previously used `type: snmp` for bandwidth/metric monitoring, change it to `type: ports`. The only functional difference is that `ports` also performs port state and MAC change detection; for devices where that is not relevant (e.g., a Linux host with no managed switching), the MAC walk will simply return empty results harmlessly.
**Required Fields:**
- **`type`**: Must be `ports`
- **`address`**: URL with `snmp://` scheme and hostname/IP β same format as former `snmp` monitors (e.g., `snmp://192.168.1.6`). Uses IF-MIB via SNMP transport.
**Optional Fields:**
- **`community`** (string, optional): SNMP community string. Default: `public`
- **`percentile`** (integer, optional): Percentile value to compute and display beneath each MRTG graph (e.g., `95` for 95th percentile billing). Must be an integer between 1 and 99. When specified, the Nth percentile is calculated over the graphed time range and shown in the stats table below each graph alongside Max/Average/Current.
The 95th percentile is the standard metric for burstable bandwidth ("95th percentile billing"), which discards the top 5% of traffic samples to allow for short bursts without penalising peak usage in capacity planning.
```yaml
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
percentile: 95
check_every_n_secs: 300
```
**Note**: `percentile` is only valid for `ports` and `port` monitors and has no effect unless `--generate-mrtg-config` is also used.
- **`notify_every_n_secs`** / **`after_every_n_notifications`** (integers, optional): Control the per-interface silence window for port state change alerts. Default values from site config apply.
**Monitored MIB Objects:**
- **IF-MIB::ifDescr** (1.3.6.1.2.1.2.2.1.2) β Interface name/description (single walk shared by metrics and state)
- **IF-MIB::ifOperStatus** (1.3.6.1.2.1.2.2.1.8) β Operational status
- **IF-MIB::ifAdminStatus** (1.3.6.1.2.1.2.2.1.7) β Administrative status
- **IF-MIB::ifInOctets / ifOutOctets** (1.3.6.1.2.1.2.2.1.10/16) β Byte counters per interface
- **IF-MIB::ifInErrors / ifOutErrors** (1.3.6.1.2.1.2.2.1.14/20) β Error counters per interface
- **IF-MIB::ifHCIn/OutUcastPkts, ifHCIn/OutMulticastPkts, ifHCIn/OutBroadcastPkts** β 64-bit packet counters
- **TCP-MIB::tcpRetransSegs** (1.3.6.1.2.1.6.12.0) β Global TCP retransmit counter
- **Vendor-specific CPU OIDs** (Cisco/HP/Juniper/Ubiquiti) β fallback HOST-RESOURCES-MIB::hrProcessorLoad
- **Vendor-specific memory OIDs** (Cisco/HP/Juniper/Ubiquiti) β fallback HOST-RESOURCES-MIB::hrStorage
- **Q-BRIDGE-MIB::dot1qTpFdbPort** (1.3.6.1.2.1.17.7.1.2.2.1.2) β MAC-to-port mappings
- **Q-BRIDGE-MIB::dot1qTpFdbStatus** (1.3.6.1.2.1.17.7.1.2.2.1.3) β FDB entry status (learned=3 filter)
**MRTG Targets generated:** `-bandwidth`, `-packets`, `-packets-type`, `-errors`, `-retransmits`, `-system`, `-tamper`, `-network` (see MRTG targets table above).
**State Tracking:**
The state file stores one key per `ports` monitor:
- `ports_state`: committed baseline β dict of `{if_index: {name, oper, admin, macs}}` per interface; advances to current state on each successful poll
**Field Restrictions:**
- `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, `content_type` are not valid for `ports` monitors
- `ports` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
**Example Ports Monitor Configuration:**
```yaml
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
percentile: 95
check_every_n_secs: 30
notify_every_n_secs: 3600
after_every_n_notifications: 1
```
**Sample Notification Output:**
```
##### PORT CHANGE: office-switch in HomeLab: GigabitEthernet0/2 oper=down admin=up (was oper=up admin=up) at 2:15 PM #####
##### PORT MAC CHANGE: office-switch in HomeLab: GigabitEthernet0/1 MAC change appeared=[AA:BB:CC:DD:EE:FF] at 2:22 PM #####
```
### Host Monitor Specific Fields
The `host` monitor type polls a Linux host (or any net-snmp compatible device) via SNMPv2c for system performance metrics drawn from UCD-SNMP-MIB and HOST-RESOURCES-MIB. The four MRTG charts generated correspond directly to the canonical performance tuning metrics defined in *System Performance Tuning* by Gian-Paolo D. Musumeci & Mike Loukides (O'Reilly, 2nd Ed.).
`type: host` uses the same SNMP RRD schema as `ports` and `port`. Network DS (`total_bits_*`, `total_pkts_*`, etc.) are stored as `U` since `host` does not poll interface counters.
**Required Fields:**
- **`type`**: Must be `host`
- **`address`**: URL with `snmp://` scheme and hostname/IP (e.g., `snmp://192.168.1.10`)
**Optional Fields:**
- **`community`** (string, optional): SNMP community string. Default: `public`
**MRTG Charts Generated:**
| Slot | DS pair | Title | Description |
|---|---|---|---|
| `-system1` | `cpu_load` / `context_switches` | CPU & Load | CPU utilization % + context switches/sec |
| `-system2` | `memory_pct` / `swap_io` | Memory & Paging | Memory utilization % + swap I/O rate |
| `-system3` | `disk_read` / `disk_write` | Disk I/O | Disk read/write bytes/sec (all devices summed). Disk space utilization % shown in PageTop header as *Disk Use: ##.#%* |
| `-system4` | `swap_used` / `interrupts` | System Thrashing | Swap used bytes + hardware interrupts/sec |
**Disk Space Display**: The current root filesystem utilization percentage is embedded in the MRTG `-system3` detail page header (PageTop) and in the MRTG index cell heading, e.g., `Disk I/O β Disk Use: 73.4%`. The value is read from state (persisted on each successful poll) so it updates on every monitoring cycle without requiring a live SNMP poll at graph generation time. Displays as `Disk Use: N/A` until the first successful poll.
**Monitored MIB Objects:**
- **HOST-RESOURCES-MIB::hrProcessorLoad** (1.3.6.1.2.1.25.3.3.1.2) β CPU load per core (averaged)
- **HOST-RESOURCES-MIB::hrStorage** (1.3.6.1.2.1.25.2.3.1.*) β Physical memory, swap, and root filesystem utilization
- **UCD-SNMP-MIB::ssRawContexts** (1.3.6.1.4.1.2021.11.60.0) β Raw context switch counter
- **UCD-SNMP-MIB::ssRawSwapIn** (1.3.6.1.4.1.2021.11.62.0) β Raw swap-in counter
- **UCD-SNMP-MIB::ssRawSwapOut** (1.3.6.1.4.1.2021.11.63.0) β Raw swap-out counter
- **UCD-SNMP-MIB::ssRawInterrupts** (1.3.6.1.4.1.2021.11.59.0) β Raw hardware interrupt counter
- **UCD-SNMP-MIB::memTotalReal / memAvailReal** (1.3.6.1.4.1.2021.4.5/6.0) β Memory fallback if hrStorage unavailable
- **UCD-SNMP-MIB::memTotalSwap / memAvailSwap** (1.3.6.1.4.1.2021.4.3/4.0) β Swap fallback if hrStorage unavailable
- **UCD-DISKIO-MIB::diskIOReadX** (1.3.6.1.4.1.2021.13.15.1.1.5) β 64-bit disk read bytes per device (walked, summed)
- **UCD-DISKIO-MIB::diskIOWriteX** (1.3.6.1.4.1.2021.13.15.1.1.6) β 64-bit disk write bytes per device (walked, summed)
**Notes:**
- UCD-SNMP-MIB OIDs (`ssRaw*`, `diskIO*`) are Linux `net-snmp` specific. On network devices these DS store `U`.
- Disk I/O bytes are summed across all block devices discovered by `diskIOTable`. This gives aggregate host I/O throughput rather than per-device breakdown.
- hrStorage physical memory and swap are used preferentially; UCD memTotal/memAvail OIDs are fallback.
- Root filesystem is identified by matching hrStorageDescr against `/`, `root`, `c:\`, or `c:`.
**Field Restrictions:**
- `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, `content_type`, `percentile` are not valid for `host` monitors
- `host` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
**Example Host Monitor Configuration:**
```yaml
- type: host
name: debmon-host
address: "snmp://192.168.1.10"
community: "YourCommunityString"
check_every_n_secs: 300
heartbeat_url: "https://hc-ping.com/uuid-here"
heartbeat_every_n_secs: 600
```
### Port Monitor Specific Fields
The `port` monitor type polls a single switch port by ifIndex via SNMPv2c, pinning it to a specific MAC address. It is orthogonal to the `ports` type: `ports` watches all interfaces on a device holistically; `port` watches one interface with a hard MAC binding.
**Required Fields:**
- **`type`**: Must be `port`
- **`address`**: URL with `snmp://` scheme and hostname/IP β same format as `snmp`/`ports` (e.g., `snmp://192.168.1.6`)
- **`port`** (integer): ifIndex of the switch port to monitor. Must be a non-negative integer. This is the raw ifIndex as returned by IF-MIB, not a zero-based port number.
- **`mac`** (string): Pinned MAC address in `XX:XX:XX:XX:XX:XX` format (case-insensitive). This is the expected device on the port.
**Optional Fields:**
- **`community`** (string, optional): SNMP community string. Default: `public`
- **`percentile`** (integer, optional): Percentile value for MRTG graphs. Must be an integer between 1 and 99. See `ports` monitor for details.
- **`always_up`** (boolean/integer/string, optional): Controls alarm semantics. Default: `false`
**Alarm Logic:**
| Condition | `always_up: true` | `always_up: false` |
|---|---|---|
| Port operβ up | Alarm | No alarm |
| Pinned MAC absent from port | Alarm | No alarm |
| Wrong MAC present on port | Alarm | Alarm |
| All clear | Recovery | Recovery |
- **`always_up: true`**: The port must be operationally up AND the pinned MAC must be present AND be the only learned MAC. Any deviation alarms.
- **`always_up: false`**: Only alarms when a non-pinned MAC is present on the port. Port down and MAC absence are silent (useful for ports that legitimately go idle).
**Recovery:** A recovery notification fires whenever all alarm conditions clear.
**MAC Resolution:**
Uses Q-BRIDGE-MIB (RFC 2674) `dot1qTpFdbTable` β the correct table for VLAN-aware managed switches. The classic `dot1dTpFdbTable` (BRIDGE-MIB) returns zero entries on VLAN-aware hardware because its FDB is partitioned per VLAN. MAC walk failure is non-fatal: monitoring continues with `current_mac=None`, which only triggers alarms when `always_up=true` (MAC absent condition).
**State Tracking:**
The state file stores one key per `port` monitor:
- `port_state`: dict of `{oper, mac}` from last successful poll β used for observability and future state transition logging
**Field Restrictions:**
- `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, `content_type` are not valid for `port` monitors
- `port` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
**Example Configuration:**
```yaml
- type: port
name: "switch-port0"
address: snmp://192.168.1.6
community: TellusionLab
check_every_n_secs: 10
notify_every_n_secs: 60
after_every_n_notifications: 6
port: 0
mac: 18:E8:29:45:F8:F7
always_up: yes
```
With `always_up: yes`, this fires an alarm if ifIndex 0 is not oper=up, if `18:E8:29:45:F8:F7` is absent, or if any other MAC is present on that port.
**Sample Notification Output:**
```
##### NEW OUTAGE: switch-port0 in HomeLab new outage: port ifIndex=0 18:E8:29:45:F8:F7 is down (admin=up) (snmp://192.168.1.6) at 2:15 PM, down for 0 secs #####
##### NEW OUTAGE: switch-port0 in HomeLab new outage: port ifIndex=0 is up but pinned MAC 18:E8:29:45:F8:F7 absent (snmp://192.168.1.6) at 2:16 PM, down for 0 secs #####
##### NEW OUTAGE: switch-port0 in HomeLab new outage: port ifIndex=0 wrong MAC: expected 18:E8:29:45:F8:F7, got AA:BB:CC:DD:EE:FF (snmp://192.168.1.6) at 2:17 PM, down for 0 secs #####
##### RECOVERY: switch-port0 in HomeLab is UP (snmp://192.168.1.6) at 2:18 PM, outage lasted 1 mins 3 secs #####
```
### Example Configurations
#### **Ping Monitor:**
```yaml
- type: ping
name: home-gateway
address: "192.168.1.1"
check_every_n_secs: 60
heartbeat_url: "https://hc-ping.com/uuid-here"
```
#### **HTTP Monitor with Content Check:**
```yaml
- type: http
name: web-server
address: "http://192.168.1.100/health"
expect: "status: ok"
check_every_n_secs: 120
notify_every_n_secs: 3600
```
#### **HTTPS Monitor with Certificate Pinning:**
```yaml
- type: http
name: nvr0
address: "https://192.168.1.12/api/system"
expect: "nvr0"
ssl_fingerprint: "e85260e8f8e85629cfa4d023ea0ae8dd3ce8ccc0040b054a4753c2a5ab269296"
ignore_ssl_expiry: true
heartbeat_url: "https://plus.site24x7.com/hb/uuid/nvr0"
heartbeat_every_n_secs: 60
```
#### **QUIC Monitor (HTTP/3):**
```yaml
- type: quic
name: fast-api
address: "https://api.example.com/health"
expect: "healthy"
check_every_n_secs: 30
ssl_fingerprint: "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890"
```
**Note**: QUIC monitoring uses HTTP/3 over UDP (port 443 by default) and is particularly effective for high-latency networks or when monitoring resources over unreliable connections. QUIC provides built-in connection migration and improved performance compared to TCP-based HTTP/2.
#### **TCP Banner Check (SSH):**
```yaml
- type: tcp
name: ssh-server
address: "tcp://server.example.com:22"
expect: "SSH-2.0"
check_every_n_secs: 60
```
#### **TCP Send/Receive (SMTP):**
```yaml
- type: tcp
name: smtp-server
address: "tcp://mail.example.com:25"
send: "EHLO apmonitor\r\n"
content_type: text
expect: "250"
check_every_n_secs: 60
```
#### **TCP Connection-Only Check:**
```yaml
- type: tcp
name: mysql-db
address: "tcp://192.168.1.100:3306"
check_every_n_secs: 30
```
#### **UDP with Response Validation (DNS):**
```yaml
- type: udp
name: dns-server
address: "udp://8.8.8.8:53"
send: "..." # DNS query packet
content_type: hex
expect: "..." # Expected response
check_every_n_secs: 60
```
#### **UDP Fire-and-Forget (Syslog):**
```yaml
- type: udp
name: syslog-collector
address: "udp://192.168.1.50:514"
send: "<134>APMonitor: test message"
check_every_n_secs: 300
```
#### **Network Switch with 95th Percentile (formerly `type: snmp`):**
```yaml
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
percentile: 95
check_every_n_secs: 300
heartbeat_url: "https://hc-ping.com/uuid-switch"
heartbeat_every_n_secs: 600
```
#### **Host Performance Monitor:**
```yaml
- type: host
name: debmon-host
address: "snmp://192.168.1.10"
community: "public"
check_every_n_secs: 300
```
#### **Switch Port Status + Metrics + MAC Change Monitor:**
```yaml
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
check_every_n_secs: 30
notify_every_n_secs: 3600
after_every_n_notifications: 1
```
#### **Single Port MAC Pinning Monitor:**
```yaml
- type: port
name: "switch-port0"
address: snmp://192.168.1.6
community: TellusionLab
check_every_n_secs: 10
notify_every_n_secs: 60
after_every_n_notifications: 6
port: 0
mac: 18:E8:29:45:F8:F7
always_up: yes
```
#### Hidden Monitor (monitoring continues, excluded from MRTG display):
```yaml
- type: port
name: "switch-port0"
address: snmp://192.168.1.6
community: TellusionLab
port: 0
mac: 18:E8:29:45:F8:F7
always_up: yes
display: false
```
#### Silenced Monitor (monitoring and display continue, notifications suppressed):
```yaml
- type: ports
name: office-switch
address: "snmp://192.168.1.6"
community: "public"
alarms: false
```
### Validation Rules
The configuration validator enforces these rules:
1. Monitor names must be unique across all monitors
2. `notify_every_n_secs` must be β₯ `check_every_n_secs` if both specified
3. `heartbeat_every_n_secs` can only be specified if `heartbeat_url` exists
4. `expect`, `ssl_fingerprint`, and `ignore_ssl_expiry` are only valid for HTTP/QUIC monitors
5. `expect` must be a non-empty string if specified
6. All URLs must include both scheme (http/https/tcp/udp/snmp) and hostname
7. Email addresses must match standard email format (RFC 5322 simplified)
8. SSL fingerprints must be valid hexadecimal strings with length that's a power of two
9. `after_every_n_notifications` can only be specified if `notify_every_n_secs` is present
10. `outage_emails` can only be specified if `email_server` is configured
11. If `email_server` is present, `smtp_host`, `smtp_port`, and `from_address` are required
12. `smtp_username` and `smtp_password` are optional (for servers without authentication)
13. Email control flags (`email_outages`, `email_recoveries`, `email_reminders`) accept boolean or string values
14. Monitor-level `email` flag accepts boolean or string values
15. TCP monitors must use `tcp://` scheme, UDP monitors must use `udp://` scheme
16. TCP/UDP addresses must include hostname/IP and port
17. UDP monitors require `send` parameter
18. `content_type` can only be specified if `send` is present
19. `content_type` for TCP/UDP must be one of: text, hex, base64 (for HTTP/QUIC it's a raw MIME type string)
20. `ssl_fingerprint` and `ignore_ssl_expiry` are not allowed for TCP/UDP monitors
21. `ports` monitors must use `snmp://` scheme (SNMP transport)
22. `community` field is optional for `ports`/`port`/`host` monitors and must be a non-empty string if specified
23. `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, and `content_type` are not allowed for `ports` monitors
24. `ports` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
25. `percentile` is only valid for `ports` and `port` monitors and must be an integer between 1 and 99
26. `port` monitors must use `snmp://` scheme (SNMP transport)
27. `port` monitors require `port` (non-negative integer ifIndex) and `mac` (valid `XX:XX:XX:XX:XX:XX` address)
28. `always_up` is optional for `port` monitors and accepts boolean or string values
29. `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, `content_type` are not allowed for `port` monitors
30. `port` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
31. `host` monitors must use `snmp://` scheme (SNMP transport)
32. `expect`, `ssl_fingerprint`, `ignore_ssl_expiry`, `send`, `content_type`, `percentile` are not allowed for `host` monitors
33. `host` monitors support `heartbeat_url` and `heartbeat_every_n_secs` like other monitor types
34. `type: snmp` is not valid β the validator emits: *"type 'snmp' is not valid. Did you mean type: ports?"*
35. `display` is optional for all monitor types and accepts boolean or string values; when `false`, the monitor is excluded from MRTG index output but monitoring, alerting, heartbeats, and RRD collection continue unaffected; hidden monitors appear in the MRTG index audit footer and render in red when down
36. `alarms` is optional at both site and monitor level; accepts boolean or string values; monitor-level `alarms` overrides site-level `alarms`; when `false`, all outage/recovery/reminder notifications are suppressed while monitoring, state tracking, heartbeats, RRD collection, and MRTG display continue unaffected
# Dependencies
Install system-wide for production use:
```bash
sudo apt install python3-rrdtool librrd-dev python3-dev mrtg rrdtool librrds-perl libsnmp-dev
sudo pip3 install --break-system-packages PyYAML requests pyOpenSSL urllib3 aioquic rrdtool easysnmp
```
**Note**:
- The `aioquic` package is required for QUIC/HTTP3 monitoring support. If you don't plan to use `type: quic` monitors, you can omit this dependency.
- The `easysnmp` package and `libsnmp-dev` system library are required for SNMP monitoring support. If you don't plan to use `type: ports`, `type: port`, or `type: host` monitors, you can omit these dependencies.
# Example invocations
```bash
# Single site, auto-derived statefile
./APMonitor.py homelab-monitorhosts.yaml
# Single site, explicit statefile
./APMonitor.py -s /tmp/statefile.json homelab-monitorhosts.yaml
# Multiple sites (concurrent subprocesses, no -s allowed)
./APMonitor.py site1.yaml site2.yaml site3.yaml --generate-mrtg-config
# Test configuration
./APMonitor.py --test-config homelab-monitorhosts.yaml
# Test webhooks
./APMonitor.py --test-webhooks -v homelab-monitorhosts.yaml
# Test emails
./APMonitor.py --test-emails -v homelab-monitorhosts.yaml
```
# Command Line Usage
APMonitor is invoked from the command line with various options to control verbosity, threading, state file location, and testing modes.
## Synopsis
```
./APMonitor.py [OPTIONS] [ ...]
```
## Command Line Options
- **`config_file`** (required, repeatable): Path to one or more YAML or JSON configuration files. When multiple files are specified, each runs as an independent subprocess concurrently. `-s` is not valid with multiple config files.
- **`-v, --verbose`**: Increase verbosity level (can be repeated: `-v`, `-vv`, `-vvv`).
- **`-t, --threads `**: Number of concurrent threads per site for checking resources (default: 1). Overrides `max_threads` in site config.
- **`-s, --statefile `**: Path to state file. Only valid with a single config file. Default: `/var/tmp/APMonitor/.statefile.json`.
- **`--test-config`**: Validate configuration and print a summary of monitors, then exit. Does not check resources or touch the statefile.
- **`--test-webhooks`**: Send a test alert to all configured webhooks, then exit.
- **`--test-emails`**: Send a test alert to all configured email addresses, then exit.
- **`--generate-rrds`**: Enable RRD database creation and updates (implied by `--generate-mrtg-config`).
- **`--generate-mrtg-config [WORKDIR]`**: Generate MRTG config, update `mrtg-rrd.cgi.pl`, write `index.html` and detail pages into `WORKDIR//`. Default WORKDIR: `/var/www/html/mrtg`. Implies `--generate-rrds`.
## Common Usage Examples
### Basic Monitoring (Single-Threaded)
Run with default settings, state stored in tmpfs:
```
./APMonitor.py -s /tmp/statefile.json monitoring-config.yaml
```
### Verbose Monitoring for Debugging
Show detailed progress and decision-making:
```
./APMonitor.py -v -s /tmp/statefile.json monitoring-config.yaml
```
### High-Frequency Monitoring (Multiple Threads)
Check many resources concurrently for near-realtime behavior:
```
./APMonitor.py -t 10 -s /tmp/statefile.json monitoring-config.yaml
```
Use higher thread counts (`-t 5` to `-t 20`) when:
- Monitoring many independent resources (50+)
- Resources have long check timeouts
- Near-realtime alerting is required
- System has sufficient CPU cores
**Warning**: High thread counts increase lock contention. Test with `-v` to ensure checks aren't blocking each other.
### Test Webhook Configuration
Verify webhooks are configured correctly before production use:
```
./APMonitor.py --test-webhooks -v monitoring-config.yaml
```
This sends test messages to all configured webhooks with verbose output showing request/response details.
### Test Email Configuration
Verify email settings work correctly:
```
./APMonitor.py --test-emails -v monitoring-config.yaml
```
## Running `APMonitor.py` Continuously
APMonitor is designed to be run repeatedly rather than as a long-running daemon.
### Option 1: Cron (Recommended for Most Cases)
```
* * * * * /path/to/APMonitor.py /path/to/monitoring-config.yaml 2>&1 | logger -t apmonitor
```
NB: PID file locking should keep this under control, in case you get a long-running process.
**Advantages**:
- Automatic restart if process crashes
- Built-in scheduling
- System handles process lifecycle
- Easy to enable/disable (comment out cron entry)
**Best for**: Production systems, servers with standard monitoring requirements (check intervals β₯ 60 seconds)
### Option 2: While Loop (For Sub-Minute Monitoring)
Run continuously with short sleep intervals for near-realtime monitoring:
```
#!/bin/bash
while true; do
./APMonitor.py -t 5 monitoring-config.yaml
sleep 10
done
```
Or as a one-liner:
```
while true; do ./APMonitor.py -s /tmp/statefile.json monitoring-config.yaml; sleep 30; done
```
**Advantages**:
- Sub-minute check intervals
- Near-realtime alerting
- Fine control over execution frequency
**Best for**: Development, testing, systems requiring rapid failure detection (check intervals < 60 seconds)
**Note**: Use short sleep intervals (5-30 seconds) combined with per-resource `check_every_n_secs` settings to balance responsiveness and system load. APMonitor's internal scheduling prevents redundant checks even with frequent invocations.
### Systemd Service (Alternative)
For production deployments requiring process supervision:
```
[Unit]
Description=APMonitor Network Resource Monitor
After=network.target
[Service]
Type=simple
ExecStart=/bin/bash -c 'while true; do /usr/local/bin/APMonitor.py -vv /usr/local/etc/apmonitor-config.yaml --generate-mrtg-config; sleep 10; done'
Restart=always
RestartSec=10
User=monitoring
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
```
## Default State File Location
APMonitor automatically selects a platform-appropriate default location for the state file if the `-s/--statefile` option is not specified:
### Linux, macOS, FreeBSD, OpenBSD, NetBSD
**Default**: `/var/tmp/APMonitor/.statefile.json`
- Directory `/var/tmp/APMonitor/` is created automatically with mode `755` (no www-data write access)
- Persists across system reboots (unlike `/tmp`)
- All sibling files (`.new`, `.old`, `.mrtg.cfg`, `.rrd/`) live in the same directory
### Windows
**Default**: `%TEMP%\APMonitor\.statefile.json`
### Unknown/Other Platforms
**Default**: `./.statefile.json`
## Concurrency and Multiple Instances
When multiple config files are passed on the command line, APMonitor spawns one subprocess per config and joins all before exiting. Each subprocess runs completely independently with its own statefile, RRD database, lock file, and MRTG output directory. A PID lockfile (hashed from the config path) in `/tmp/` prevents duplicate instances per config.
For manual multi-instance operation with separate invocations, use separate config files β the config filename determines the statefile path and PID lock, so correct cardinality is enforced automatically:
```bash
# Instance 1: Production monitoring
./APMonitor.py prod-apmonitor-config.yaml --generate-mrtg-config
# Instance 2: Development monitoring
./APMonitor.py dev-apmonitor-config.yaml --generate-mrtg-config
```
# Developer Notes for modifying `APMonitor.py`
## State File
APMonitor uses a JSON state file to persist monitoring data across runs:
- **Location**: `/var/tmp/APMonitor/.statefile.json` by default
- **Format**: JSON with per-resource nested objects containing timestamps, status, and counters
- **Atomic Updates**: Uses `.new` and `.old` rotation to prevent corruption on crashes
- **Thread Safety**: Protected by internal lock during concurrent access
The state file tracks per-resource:
- `is_up`: Current resource status
- `last_checked`: When resource was last checked (ISO 8601 timestamp)
- `last_response_time_ms`: Response time in milliseconds for successful checks
- `last_notified`: When last notification was sent (ISO 8601 timestamp)
- `last_alarm_started`: When current/last outage began (ISO 8601 timestamp)
- `last_successful_heartbeat`: When heartbeat URL last succeeded (ISO 8601 timestamp)
- `down_count`: Consecutive failed checks
- `notified_count`: Number of notifications sent for current outage
- `error_reason`: Last error message
- `last_config_check