https://github.com/rsionnach/nthlayer
Generate the complete reliability stack from a service spec in 5 minutes. Dashboards, alerts, SLOs, PagerDuty - zero toil.
https://github.com/rsionnach/nthlayer
alerts devops grafana monitoring observability pagerduty prometheus python slo sre
Last synced: 18 days ago
JSON representation
Generate the complete reliability stack from a service spec in 5 minutes. Dashboards, alerts, SLOs, PagerDuty - zero toil.
- Host: GitHub
- URL: https://github.com/rsionnach/nthlayer
- Owner: rsionnach
- Created: 2025-11-26T20:23:55.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-01-12T22:03:43.000Z (23 days ago)
- Last Synced: 2026-01-13T01:57:23.971Z (23 days ago)
- Topics: alerts, devops, grafana, monitoring, observability, pagerduty, prometheus, python, slo, sre
- Language: Python
- Homepage: https://rsionnach.github.io/nthlayer/
- Size: 14.4 MB
- Stars: 13
- Watchers: 0
- Forks: 1
- Open Issues: 30
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# NthLayer
### The Missing Layer of Reliability
**Reliability requirements as code.**
[](https://github.com/rsionnach/nthlayer)
[](https://pypi.org/project/nthlayer/)
[](LICENSE.txt)
[](https://github.com/samber/awesome-prometheus-alerts)
NthLayer lets you define what "production-ready" means for a service,
then generates, validates, and enforces those requirements automatically.
**Define once. Generate everything. Block bad deploys.**
---
## The Problem
For every new service, teams are expected to:
- Manually create dashboards
- Hand-craft alerts and recording rules
- Define SLOs and error budgets
- Configure incident escalation
- Decide if a service is "ready" for production
These decisions are usually made **after deployment**, enforced **inconsistently**, or revisited **only during incidents**.
## The Solution
NthLayer moves reliability left in the delivery lifecycle:
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ service.yaml → generate → lint → verify → check-deploy → deploy │
│ ↓ ↓ ↓ ↓ │
│ artifacts valid? metrics? budget ok? │
│ │
│ "Is this production-ready?" - answered BEFORE deployment │
└─────────────────────────────────────────────────────────────────────────────┘
```
```bash
# In your Tekton/GitHub Actions pipeline:
nthlayer apply service.yaml --lint # Generate + validate PromQL syntax
nthlayer verify service.yaml # Verify declared metrics exist
nthlayer check-deploy service.yaml # Check error budget gate
# Only if all pass: deploy to production
```
Works with: **Tekton**, **GitHub Actions**, **GitLab CI**, **ArgoCD**, **Mimir/Cortex**
---
## 🚦 Shift Left Features
| Command | What It Does | Pipeline Exit Code |
|---------|--------------|-------------------|
| `nthlayer verify` | Validates declared metrics exist in Prometheus | 1 if missing metrics |
| `nthlayer check-deploy` | Checks error budget - blocks if exhausted | 2 if budget exhausted |
| `nthlayer drift` | Detects reliability degradation trends over time | 1 warn, 2 critical |
| `nthlayer apply --lint` | Validates PromQL syntax with pint | 1 if invalid queries |
### Deployment Gate Example
---
## ⚡ Quick Start
```bash
pipx install nthlayer
nthlayer apply service.yaml
# Output: generated/payment-api/
# ├── dashboard.json → Grafana
# ├── alerts.yaml → Prometheus
# ├── slos.yaml → OpenSLO
# └── recording-rules.yaml → Prometheus
```
---
## What NthLayer Is
- A **reliability specification** that defines production-readiness
- A **compiler** from service intent to operational reality
- A **CI/CD-native** way to standardize reliability across teams
NthLayer integrates with existing tools (Prometheus, Grafana, PagerDuty) but operates **before** them - deciding what is allowed to reach production.
## What NthLayer Is Not
- Not a service catalog
- Not an observability platform
- Not an incident management system
- Not a runtime control plane
NthLayer **complements** these systems by ensuring services meet reliability expectations before they are deployed.
## Why NthLayer?
| With NthLayer | Without NthLayer |
|---------------|------------------|
| Platform teams encode reliability standards **once** | Standards recreated per service |
| Service teams inherit sane defaults **automatically** | Each team invents their own |
| "Is this production-ready?" = **deterministic check** | "Is this ready?" = negotiated opinion |
| Reliability is **enforced by default** | Reliability is **reactive and inconsistent** |
---
## 📥 What You Put In
### 1. Service Spec (`service.yaml`)
```yaml
# Minimal example (5 lines)
name: payment-api
tier: critical
type: api
dependencies:
- postgresql
```
### 2. Environment Variables (optional)
```bash
# 📟 PagerDuty - auto-create team, escalation policy, service
export PAGERDUTY_API_KEY=...
# 📊 Grafana - auto-push dashboards
export NTHLAYER_GRAFANA_URL=...
export NTHLAYER_GRAFANA_API_KEY=...
export NTHLAYER_GRAFANA_ORG_ID=1 # Default: 1
# 🔍 Prometheus - metric discovery for intent resolution
export NTHLAYER_PROMETHEUS_URL=...
export NTHLAYER_METRICS_USER=... # If auth required
export NTHLAYER_METRICS_PASSWORD=...
```
---
## 📤 What You Get Out
| Output | File | Deploy To |
|--------|------|-----------|
| 📊 Dashboard | `generated//dashboard.json` | Grafana |
| 🚨 Alerts | `generated//alerts.yaml` | Prometheus |
| 🎯 SLOs | `generated//slos.yaml` | OpenSLO-compatible |
| ⚡ Recording Rules | `generated//recording-rules.yaml` | Prometheus |
| 📟 PagerDuty | Created via API | Team, escalation policy, service |
---
## 📊 SLO Portfolio
Track reliability across your entire organization:
```bash
nthlayer portfolio # Org-wide reliability view
nthlayer portfolio --format json # Machine-readable for dashboards
nthlayer slo collect service.yaml # Query current budget from Prometheus
```
---
## 📝 Full Service Example
```yaml
name: payment-api
tier: critical # critical | standard | low
type: api # api | worker | stream
team: payments
slos:
availability: 99.95 # Generates Prometheus alerts
latency_p99_ms: 200 # Generates histogram queries
dependencies:
- postgresql # Adds PostgreSQL panels
- redis # Adds Redis panels
- kubernetes # Adds K8s pod metrics
pagerduty:
enabled: true
support_model: self # self | shared | sre | business_hours
```
---
## 💰 The Value
### Generation: 20 hours → 5 minutes per service
| Task | Manual Effort | With NthLayer |
|------|---------------|---------------|
| 🎯 Define SLOs & error budgets | 6 hours | Generated from tier |
| 🚨 Research & configure alerts | 4 hours | 400+ battle-tested rules |
| 📊 Build Grafana dashboards | 5 hours | 12-28 panels auto-generated |
| 📟 PagerDuty escalation setup | 2 hours | Tier-based defaults |
| 📋 Write recording rules | 3 hours | 20+ pre-computed metrics |
### Validation: Catch issues before production
| Problem | Without NthLayer | With NthLayer |
|---------|------------------|---------------|
| Missing metrics | Discover after deploy | `nthlayer verify` blocks promotion |
| Invalid PromQL | Prometheus rejects rules | `--lint` catches in CI |
| Policy violations | Manual review | `nthlayer validate-spec` enforces |
| Exhausted budget | Deploy anyway, incident | `check-deploy` blocks risky deploys |
### At Scale
| Scale | Generation Saved | Incidents Prevented* |
|-------|------------------|---------------------|
| 🚀 50 services | 996 hours ($100K) | ~12/year |
| 📈 200 services | 3,983 hours ($400K) | ~48/year |
| 🏢 1,000 services | 19,917 hours ($2M) | ~240/year |
*Estimated based on 60% reduction in "missing monitoring" incidents. Value at $100/hr engineering cost.
---
## 🧠 How It Works
### Generation
| Step | What Happens |
|------|--------------|
| 🎯 **Intent Resolution** | Maps "availability SLO" → best matching PromQL query |
| 🔀 **Type Routing** | API services get HTTP metrics, workers get job metrics |
| ⚡ **Tier Defaults** | Critical = 99.95% SLO + 5min escalation, Low = 99.5% + 60min |
| 🏗️ **Technology Templates** | 23 built-in: PostgreSQL, Redis, Kafka, MongoDB, etc. |
### CI/CD Pipeline
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Generate │───▶│ Validate │───▶│ Protect │───▶│ Deploy │
│ nthlayer │ │ --lint │ │ check-deploy│ │ kubectl │
│ apply │ │ verify │ │ │ │ argocd │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
artifacts exit 1 if exit 2 if
to git invalid budget exhausted
```
Works with: **GitHub Actions**, **GitLab CI**, **ArgoCD**, **Tekton**, **Jenkins**
---
## 🛠️ CLI Commands
### Generate
```bash
nthlayer init # Interactive service.yaml creation
nthlayer plan service.yaml # Preview what will be generated
nthlayer apply service.yaml # Generate all artifacts
nthlayer apply --push # Also push dashboard to Grafana
nthlayer apply --push-ruler # Push alerts to Mimir/Cortex Ruler API
```
### Validate
```bash
nthlayer apply --lint # Validate PromQL syntax (pint)
nthlayer validate-spec service.yaml # Check against policies (OPA/Rego)
nthlayer verify service.yaml # Verify metrics exist in Prometheus
```
### Protect
```bash
nthlayer check-deploy service.yaml # Check error budget gate (exit 2 = blocked)
nthlayer drift service.yaml # Analyze reliability drift trends
nthlayer portfolio # Org-wide SLO health
nthlayer portfolio --drift # Include drift analysis in portfolio
nthlayer slo collect service.yaml # Query current budget from Prometheus
```
---
## 🔮 Coming Soon
| Feature | Description | Status |
|---------|-------------|--------|
| 💰 **Error Budgets** | Track budget consumption, correlate with deploys | ✅ Done |
| 📊 **SLO Portfolio** | Org-wide reliability view across all services | ✅ Done |
| 🚦 **Deployment Gates** | Block deploys when error budget exhausted | ✅ Done |
| ✅ **Contract Verification** | Verify declared metrics exist before promotion | ✅ Done |
| 📉 **Drift Detection** | Detect reliability degradation trends, project budget exhaustion | ✅ Done |
| 📝 **Loki Integration** | Generate LogQL alert rules, technology-specific log patterns | 🔨 Next |
| 🤖 **AI Generation** | Conversational service.yaml creation via MCP | 📋 Planned |
---
## 📦 Installation
```bash
# Recommended
pipx install nthlayer
# Or with pip
pip install nthlayer
# Verify
nthlayer --version
```
---
## 🌐 Live Demo
See NthLayer in action with real Grafana dashboards and generated configs:
[](https://nthlayer.grafana.net)
[](https://rsionnach.github.io/nthlayer/demo/)
---
## 📚 Documentation
**[Full Documentation](https://rsionnach.github.io/nthlayer/)** - Comprehensive guides and reference.
[](https://deepwiki.com/rsionnach/nthlayer)
| Quick Links | |
|-------------|---|
| 🚀 [Quick Start](https://rsionnach.github.io/nthlayer/getting-started/quick-start/) | Get running in 5 minutes |
| 🔧 [Setup Wizard](https://rsionnach.github.io/nthlayer/commands/setup/) | Interactive configuration |
| 📊 [SLO Portfolio](https://rsionnach.github.io/nthlayer/commands/portfolio/) | Org-wide reliability view |
| 🔌 [18 Technologies](https://rsionnach.github.io/nthlayer/integrations/technologies/) | PostgreSQL, Redis, Kafka... |
| 📖 [CLI Reference](https://rsionnach.github.io/nthlayer/reference/cli/) | All commands |
| 🤝 [Contributing](CONTRIBUTING.md) | How to contribute |
Build docs locally
```bash
uv sync --extra docs
uv run mkdocs serve # Opens at http://localhost:8000
```
---
## 🤝 Contributing
```bash
# Install uv (https://docs.astral.sh/uv/)
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/rsionnach/nthlayer.git
cd nthlayer
make setup # Install deps, start services
make test # Run tests
```
See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
---
## 📄 License
MIT - See [LICENSE.txt](LICENSE.txt)
---
## 🙏 Acknowledgments
### Core Dependencies
- [grafana-foundation-sdk](https://github.com/grafana/grafana-foundation-sdk) - Dashboard generation SDK (Apache 2.0)
- [awesome-prometheus-alerts](https://github.com/samber/awesome-prometheus-alerts) - 580+ battle-tested alert rules (CC BY 4.0)
- [pint](https://github.com/cloudflare/pint) - PromQL linting and validation (Apache 2.0)
- [conftest](https://github.com/open-policy-agent/conftest) / [OPA](https://github.com/open-policy-agent/opa) - Policy validation (Apache 2.0)
- [PagerDuty Python SDK](https://github.com/PagerDuty/pdpyras) - Incident management integration (MIT)
### Architecture Inspiration
- [autograf](https://github.com/FUSAKLA/autograf) - Dynamic Prometheus metric discovery
- [Sloth](https://github.com/slok/sloth) - SLO specification and burn rate calculations
- [OpenSLO](https://github.com/openslo/openslo) - SLO specification standard
### CLI & Documentation
- [Rich](https://github.com/Textualize/rich) - Terminal formatting and styling (MIT)
- [Questionary](https://github.com/tmbo/questionary) - Interactive CLI prompts (MIT)
- [MkDocs Material](https://github.com/squidfunk/mkdocs-material) - Documentation theme (MIT)
- [VHS](https://github.com/charmbracelet/vhs) - Terminal demo recordings (MIT)
- [Nord Theme](https://www.nordtheme.com/) - Color palette inspiration (MIT)
### Tooling
- [Shields.io](https://shields.io/) - Badges
- [Slidev](https://sli.dev/) - Presentation framework