{"id":50327352,"url":"https://github.com/rtmuller/observability-reliability-lab","last_synced_at":"2026-05-29T07:31:42.064Z","repository":{"id":353331602,"uuid":"1151462985","full_name":"rtmuller/observability-reliability-lab","owner":"rtmuller","description":"Hands-on lab demonstrating observability reliability patterns: meta-monitoring, chaos scenarios, Watchdog, absent() rules.","archived":false,"fork":false,"pushed_at":"2026-04-23T11:51:02.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-23T13:32:15.861Z","etag":null,"topics":["alertmanager","chaos-engineering","grafana","meta-monitoring","observability","prometheus","sre"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rtmuller.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-06T13:55:22.000Z","updated_at":"2026-04-23T11:51:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rtmuller/observability-reliability-lab","commit_stats":null,"previous_names":["rtmuller/observability-reliability-lab"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/rtmuller/observability-reliability-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtmuller%2Fobservability-reliability-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtmuller%2Fobservability-reliability-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtmuller%2Fobservability-reliability-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtmuller%2Fobservability-reliability-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rtmuller","download_url":"https://codeload.github.com/rtmuller/observability-reliability-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtmuller%2Fobservability-reliability-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33642256,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alertmanager","chaos-engineering","grafana","meta-monitoring","observability","prometheus","sre"],"created_at":"2026-05-29T07:31:41.564Z","updated_at":"2026-05-29T07:31:42.056Z","avatar_url":"https://github.com/rtmuller.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Observability Reliability Lab\n\nA hands-on lab that demonstrates why your monitoring pipeline is a single point of failure — and how meta-monitoring fixes it.\n\n---\n\n## The Observability Trilogy\n\nThis lab is the companion to a three-part article series on building intentional, reliable, and cost-effective observability for cloud-native systems at scale.\n\nEach article tackles a different dimension of observability maturity — and each one builds on the lessons of the previous.\n\n### Article 1 — [Beyond Monitoring: The Hidden Cost of Observability at Scale](https://medium.com/@rafael_muller/beyond-monitoring-the-hidden-cost-of-observability-at-scale-adbee5ae5f8a)\n\nObservability costs don't explode because of traffic — they explode because of unchecked **cardinality**. A single unbounded label like `request_id` or `collection_id` can silently generate hundreds of thousands of active time series. This article covers how a single high-cardinality label caused an 81% cost increase in Grafana Cloud, and how relabeling rules and metric auditing bring costs back under control.\n\n### Article 2 — [The Silent Killer: Why \"No Data\" Is Often Worse Than Bad Data](https://medium.com/@rafael_muller/the-silent-killer-why-no-data-is-often-worse-than-bad-data-c811fa664371)\n\nMost alerts assume data exists — they trigger when a metric crosses a threshold. But when a metric **disappears entirely**, the alert never fires. There's no alarm, no page, just silence. This article introduces Prometheus's `absent()` function as a way to detect when critical metrics stop reporting — turning silence into an actionable signal.\n\n### Article 3 — The Observability Reliability Paradox *(this lab)*\n\nYou've got cardinality under control. You're alerting on missing data. But what happens when **Prometheus itself goes down**? Your `absent()` rules can't evaluate. Your Watchdog heartbeat stops. Your dashboards freeze. And nobody gets notified — because the notification path runs through the very system that failed. This lab lets you experience that blind spot firsthand and demonstrates the meta-monitoring patterns that solve it.\n\n---\n\n## What This Lab Demonstrates\n\nMost teams build alerting around thresholds and missing data — but never test what happens when the **monitoring stack itself fails**.\n\nThis lab lets you:\n\n- Run a full Prometheus + AlertManager + Grafana stack locally\n- **Kill Prometheus** and watch dashboards freeze, alerts go silent, and `absent()` rules become useless\n- **Kill AlertManager** and see Prometheus detect the failure but fail to notify anyone\n- Understand why **meta-monitoring** (Watchdog alerts, external heartbeats, blackbox probes) is the missing layer\n\n## Architecture\n\n```\n┌──────────────────────────────────────────────────────────┐\n│                     Docker Network                        │\n│                                                           │\n│  ┌──────────────┐     scrapes      ┌───────────────────┐ │\n│  │  Prometheus   │◄────────────────│  Payment Service   │ │\n│  │  :9090        │                 │  :8000             │ │\n│  └──────┬───────┘                 └───────────────────┘ │\n│         │                                                 │\n│    alerts│         ┌───────────────────┐                  │\n│         ▼         │ Blackbox Exporter  │                  │\n│  ┌──────────────┐ │  :9115            │                  │\n│  │ AlertManager  │ │  (probes health   │                  │\n│  │  :9093        │ │   endpoints)      │                  │\n│  └──────────────┘ └───────────────────┘                  │\n│                                                           │\n│  ┌──────────────┐                                         │\n│  │   Grafana     │  ← Meta-Monitoring Dashboard           │\n│  │  :3000        │                                        │\n│  └──────────────┘                                         │\n└──────────────────────────────────────────────────────────┘\n```\n\n## Prerequisites\n\n- [Docker](https://docs.docker.com/get-docker/) and Docker Compose\n- Ports `3000`, `8000`, `9090`, `9093`, `9115` available\n\n## Quick Start\n\n```bash\n# Clone the repo\ngit clone https://github.com/rtmuller/observability-reliability-lab.git\ncd observability-reliability-lab\n\n# Start everything\ndocker compose up -d --build\n\n# Verify all services are running\ndocker compose ps\n```\n\n### Access the UIs\n\n| Service          | URL                            | Credentials   |\n|------------------|--------------------------------|---------------|\n| **Grafana**      | http://localhost:3000           | admin / admin |\n| **Prometheus**   | http://localhost:9090           | —             |\n| **AlertManager** | http://localhost:9093           | —             |\n| **Blackbox**     | http://localhost:9115           | —             |\n| **Payment App**  | http://localhost:8000/metrics   | —             |\n\nOpen Grafana and navigate to **Dashboards → Meta-Monitoring — Observability Reliability** to see the pre-built dashboard.\n\n---\n\n## Chaos Scenarios\n\n### Scenario 1: Kill Prometheus\n\nSimulates Prometheus being OOMKilled or evicted from a node.\n\n```bash\n./chaos/kill-prometheus.sh\n```\n\n**What happens:**\n\n```\nPrometheus Health Endpoint:\n  CONNECTION REFUSED (Prometheus is down)\n\nPayment Service:\n  OK (app is fine, but nobody is watching it)\n\nGrafana:\n  Cannot reach Prometheus — dashboards are frozen\n\nAlertManager Active Alerts:\n  Watchdog       status=active (stale — will expire)\n\nBlackbox probe for Prometheus:\n  probe_success = 0 (FAIL — Prometheus unreachable)\n```\n\nBlackbox Exporter **detects** Prometheus is down. But Prometheus is the one that reads blackbox results. Nobody is consuming the data. The detection is useless.\n\n### Scenario 2: Kill AlertManager\n\nSimulates AlertManager crashing or becoming unreachable.\n\n```bash\n./chaos/kill-alertmanager.sh\n```\n\n**What happens:**\n\n```\nPrometheus Firing Alerts:\n  [critical] TargetDown              target=alertmanager:9093\n  [critical] MonitoringComponentDown target=http://alertmanager:9093/-/healthy\n\nPrometheus Notification Delivery:\n  notifications_dropped_total = 10\n  notifications_errors_total  = 9\n\nBlackbox probe for AlertManager:\n  probe_success = 0 (FAIL — AlertManager unreachable)\n```\n\nPrometheus **knows** AlertManager is down. It fires critical alerts. But it delivers alerts through AlertManager — the very thing that's broken. 10 dropped notifications. Nobody gets paged.\n\n### Restore Everything\n\n```bash\n./chaos/restore-all.sh\n```\n\n---\n\n## Key Concepts\n\n### Watchdog Alert (DeadMansSwitch)\n\nAn alert that **always fires**. If it stops, your pipeline is broken.\n\n```yaml\n- alert: Watchdog\n  expr: vector(1)\n  labels:\n    severity: none\n```\n\nRoute it to an external heartbeat service (Healthchecks.io, PagerDuty, Deadman's Snitch). If the heartbeat stops arriving, the external service alerts you through an independent path.\n\n### Blackbox Health Probes\n\nProbes the health endpoints of monitoring components themselves:\n\n- `http://prometheus:9090/-/healthy`\n- `http://alertmanager:9093/-/healthy`\n- `http://grafana:3000/api/health`\n\n### absent() Rules\n\nFrom [Article 2](https://medium.com/@rafael_muller/the-silent-killer-why-no-data-is-often-worse-than-bad-data-c811fa664371) — detects when metrics disappear. But as this lab demonstrates, `absent()` only works if Prometheus is alive to evaluate it.\n\n---\n\n## File Structure\n\n```\n.\n├── docker-compose.yml              # Full stack definition\n├── app/\n│   ├── Dockerfile                  # Sample payment service\n│   ├── main.py                     # Python app with Prometheus metrics\n│   └── requirements.txt\n├── prometheus/\n│   ├── prometheus.yml              # Scrape config + meta-monitoring\n│   └── alerts/\n│       ├── watchdog.yml            # Watchdog, TargetDown, meta-alerts\n│       └── absent.yml              # absent() rules\n├── alertmanager/\n│   └── alertmanager.yml            # Routing with DeadMansSwitch receiver\n├── blackbox/\n│   └── blackbox.yml                # HTTP health probe config\n├── grafana/\n│   ├── datasources.yml             # Auto-provisioned Prometheus source\n│   └── dashboards/\n│       ├── dashboard.yml           # Provisioning config\n│       └── meta-monitoring.json    # Pre-built dashboard\n└── chaos/\n    ├── kill-prometheus.sh          # Stop Prometheus\n    ├── kill-alertmanager.sh        # Stop AlertManager\n    └── restore-all.sh             # Restore all services\n```\n\n## Cleanup\n\n```bash\ndocker compose down -v\n```\n\n## License\n\nMIT\n\n---\n\n**Author:** [Rafael Muller](https://github.com/rtmuller) — Staff Cloud Engineer at Airbnb, working on platform infrastructure at 8M+ listings scale. Writing at [Medium](https://medium.com/@rafael_muller).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frtmuller%2Fobservability-reliability-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frtmuller%2Fobservability-reliability-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frtmuller%2Fobservability-reliability-lab/lists"}