{"id":37103714,"url":"https://github.com/whiskeyjimbo/checkmate","last_synced_at":"2026-01-14T12:33:03.707Z","repository":{"id":269617354,"uuid":"906335972","full_name":"whiskeyjimbo/CheckMate","owner":"whiskeyjimbo","description":"A modern service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. Supports multiple protocols with Prometheus integration and rule-based monitoring.","archived":false,"fork":false,"pushed_at":"2025-01-14T05:35:47.000Z","size":211,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-14T06:26:58.419Z","etag":null,"topics":["availability","down-detector","go","golang","latency-monitor","monitoring","monitoring-tool","prometheus","service-health","uptime-monitor"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/whiskeyjimbo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-20T17:07:35.000Z","updated_at":"2025-01-14T05:35:49.000Z","dependencies_parsed_at":"2024-12-24T21:26:00.022Z","dependency_job_id":"63451642-1320-4c84-817b-d966f9c03295","html_url":"https://github.com/whiskeyjimbo/CheckMate","commit_stats":null,"previous_names":["whiskeyjimbo/checkmate"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/whiskeyjimbo/CheckMate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whiskeyjimbo%2FCheckMate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whiskeyjimbo%2FCheckMate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whiskeyjimbo%2FCheckMate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whiskeyjimbo%2FCheckMate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/whiskeyjimbo","download_url":"https://codeload.github.com/whiskeyjimbo/CheckMate/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whiskeyjimbo%2FCheckMate/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28420791,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T10:47:48.104Z","status":"ssl_error","status_checked_at":"2026-01-14T10:46:19.031Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["availability","down-detector","go","golang","latency-monitor","monitoring","monitoring-tool","prometheus","service-health","uptime-monitor"],"created_at":"2026-01-14T12:33:02.883Z","updated_at":"2026-01-14T12:33:03.686Z","avatar_url":"https://github.com/whiskeyjimbo.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CheckMate\n\n![License](https://img.shields.io/badge/license-GPLv3-blue.svg)\n![Go Version](https://img.shields.io/badge/language-go-blue.svg)\n\nCheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.\n\nDISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.\n\n## Features\n\n### Core Features\n- Multi-protocol support (TCP, HTTP, HTTPS with cert validation, SMTP, DNS)\n- Hierarchical configuration (Sites → Groups → Hosts → Checks)\n- High availability monitoring with configurable modes\n- Configurable check intervals per service\n- Prometheus metrics integration\n- Simple Rule-based monitoring with custom conditions\n- Flexible notification system\n- Service tagging system\n- TLS certificate expiration monitoring\n- Extensible design for easy protocol additions\n\n### High Availability Monitoring\n\nGroups support two monitoring modes that can be configured at different levels:\n\n- **All Mode (Default)**\n  - Group is considered \"up\" if any host is responding\n  - Rules only trigger when all hosts are down\n  - For redundant services where one available host is sufficient\n\n- **Any Mode**\n  - Group monitoring tracks all hosts individually\n  - Rules trigger when any host goes down\n  - Suitable for services where each host's availability is critical\n\nRule modes can be configured at three levels (in order of precedence):\n1. Check level - Overrides group settings for specific checks\n2. Group level - Default for all checks in the group\n3. Default - Falls back to \"all\" mode if not specified\n\n## Configuration\n\n### Site Configuration\n- `monitor_site`: Name of the monitoring instance\n- `sites`: List of infrastructure sites to monitor\n  - `name`: Site identifier\n  - `tags`: Site-level tags\n  - `groups`: List of service groups\n\n### Group Configuration\n- `name`: Group identifier\n- `tags`: Group-level tags (combined with site tags)\n- `hosts`: List of hosts to monitor\n  - `host`: Hostname or IP\n  - `tags`: Host-specific tags\n- `checks`: Service checks applied to all hosts\n  - `port`: Port number\n  - `protocol`: TCP, HTTP, SMTP, or DNS\n  - `interval`: Check frequency (e.g., \"30s\", \"1m\")\n  - `tags`: Check-specific tags\n  - `rule_mode`: Override group's rule mode\n  - `verify_cert`: Enable certificate checking\n- `rule_mode`: Group-level rule mode (\"all\" or \"any\")\n\n### Rule Configuration\nRules define conditions for generating notifications. Each rule requires a `type` field:\n\n```yaml\n# Standard Rule Example\n- name: \"prod_service_degraded\"\n  type: \"standard\"\n  condition: \"responseTime \u003e 1000 || downtime \u003e 0\"\n  tags: [\"prod\", \"critical\"]\n  notifications: [\"log\"]\n# Certificate Rule Example\n- name: \"cert_expiring_soon\"\n  type: \"cert\"\n  min_days_validity: 30\n  tags: [\"https-api\"]\n  notifications: [\"log\"]\n```\n\nCommon Fields:\n- `name`: Rule identifier\n- `type`: Either \"standard\" or \"cert\"\n- `tags`: Tags to match against groups/checks\n- `notifications`: Notification types to use\n\nType-specific Fields:\n- Standard Rules:\n  - `condition`: Expression using `downtime` and `responseTime` variables\n- Certificate Rules:\n  - `min_days_validity`: Days before expiration to trigger alert\n\n### Notification Configuration\n- `type`: Notification type (\"log\", more coming soon)\n\n## Metrics\n\nCheckMate exposes Prometheus metrics at `:9100/metrics`\n\n### Core Metrics\n- `checkmate_host_check_status`: Service availability (1 = up, 0 = down)\n- `checkmate_host_check_latency_milliseconds`: Response time in milliseconds\n- `checkmate_check_latency_histogram_seconds`: Response time distribution\n- `checkmate_hosts_up`: Number of hosts up in a group\n- `checkmate_hosts_total`: Total number of hosts in a group\n- `checkmate_cert_expiry_days`: Days until certificate expiration\n\n### Graph Visualization Metrics (In Development)\n\u003e Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux\n\n- `checkmate_node_info`: Node information for graph visualization\n  - Labels: id, type (site/group/host), name, tags, port, protocol\n  - Values: 1 for active nodes, 0 for inactive\n\n- `checkmate_edge_info`: Edge information with latency\n  - Labels: source, target, type, metric, port, protocol\n  - Values: latency in milliseconds\n\nExample Prometheus queries:\n```promql\n# Filter checks by site\ncheckmate_check_success{site=\"mars-lab\"}\n\n# Average response time for production APIs\navg(checkmate_check_latency_milliseconds{tags=~\".*prod.*\", tags=~\".*api.*\"})\n\n# 95th percentile latency by site\nhistogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))\n\n# Host availability ratio per group\nsum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)\n\n# Graph Visualization (In Development)\ncheckmate_node_info{type=\"host\", port=\"443\", protocol=\"HTTPS\"}\navg(checkmate_edge_info{type=\"contains\", metric=\"latency\"}) by (source, target, port, protocol)\n```\n\n### Grafana Node Graph Setup (In Development)\nTo visualize your infrastructure in Grafana's Node Graph:\n\n1. Create a new Node Graph panel\n2. Configure the Node Query:\n   ```promql\n   checkmate_node_info\n   ```\n3. Configure the Edge Query:\n   ```promql\n   checkmate_edge_info{metric=\"latency\"}\n   ```\n4. Set transformations:\n   - Nodes: Use 'id' for node ID, 'type' for node class\n   - Edges: Use 'source' and 'target' for connections\n\n\u003e Note: Graph visualization features are in flux and the query/configuration interface may change\n\n## Health Checks\n\nCheckMate provides Kubernetes-compatible health check endpoints:\n\n- `/health/live` - Liveness probe\n  - Returns 200 OK when the service is running\n\n- `/health/ready` - Readiness probe\n  - Returns 200 OK when ready to receive traffic\n  - Returns 503 Service Unavailable during initialization\n\nAll health check endpoints are served on port 9100 alongside metrics.\n\n## Mini Roadmap\n\n### High Pri\n- [ ] Config Hot Reload\n- [ ] Notification system expansion (Slack, Email)\n- [ ] Configurable notification thresholds\n- - time between alerts\n- - service restoration notification\n- - configurable custom alert levels (example: insignificant, minor, critical, all hands on deck)\n- - etc.\n- [ ] move alert logic to notifications (any/all)\n\n### Low Pri\n- [ ] Database support for historical data\n- [ ] Web UI for monitoring (MAYBE)\n\n## Completed\n- [x] Env Variables for config\n- [x] Dockerfile for dev\n- [x] Additional protocol support (HTTPS, TLS verification)\n- [x] Kubernetes readiness/liveness probe support\n- [x] Multiple host monitoring\n- [x] Multi-protocol per host\n- [x] Service tagging system\n- [x] Site-based infrastructure organization\n- [x] High availability group monitoring\n\n## License\n\nThis project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.\n\n## Development\n\n### Prerequisites\n- Go 1.21 or higher\n- [air](https://github.com/air-verse/air) for live reloading (optional)\n\n### Live Reloading\nFor development with automatic rebuilding on code changes:\n\n1. Install Air:\n```bash\ngo install github.com/air-verse/air@latest\n```\n\n2. Run with Air:\n```bash\nair\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhiskeyjimbo%2Fcheckmate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwhiskeyjimbo%2Fcheckmate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhiskeyjimbo%2Fcheckmate/lists"}