https://github.com/mizcausevic-dev/slo-budget-tracker
SLO + error-budget tracker for Python services. FastAPI middleware, Prometheus exporter, multi-window burn-rate alerts. Part of the Platform Reliability Stack.
https://github.com/mizcausevic-dev/slo-budget-tracker
asgi burn-rate error-budget fastapi monitoring prometheus python reliability slo sre
Last synced: 27 days ago
JSON representation
SLO + error-budget tracker for Python services. FastAPI middleware, Prometheus exporter, multi-window burn-rate alerts. Part of the Platform Reliability Stack.
- Host: GitHub
- URL: https://github.com/mizcausevic-dev/slo-budget-tracker
- Owner: mizcausevic-dev
- License: mit
- Created: 2026-05-14T22:22:37.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-15T19:28:01.000Z (about 1 month ago)
- Last Synced: 2026-05-15T22:18:54.612Z (about 1 month ago)
- Topics: asgi, burn-rate, error-budget, fastapi, monitoring, prometheus, python, reliability, slo, sre
- Language: Python
- Homepage: https://kineticgain.com/
- Size: 24.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# slo-budget-tracker
[](https://github.com/mizcausevic-dev/slo-budget-tracker/actions/workflows/ci.yml)
[](https://www.python.org/)
[](LICENSE)
**SLO + error-budget tracker for Python services** — drop-in FastAPI middleware, Prometheus exporter, and a small standalone library you can wire into any ASGI app or background worker.
Built around the math in the [Google SRE Workbook](https://sre.google/workbook/alerting-on-slos/): one rolling window, multi-window burn-rate alerts (defaults to 1h + 6h at burn rate ≥ 14.4), and an explicit error-budget remaining gauge so dashboards stop lying about reliability.
---
## Why
Most "SLO dashboards" you find in the wild conflate _availability_ with _uptime_ and surface neither error budget nor burn rate. You can't tell, at a glance, whether the freshly deployed service is **burning the next 30 days of error budget in the next 30 minutes**. This library makes that visible by default.
Two things matter:
1. **Error budget remaining** — a `[1.0 → ≤0]` ratio on every dashboard.
2. **Burn rate** — `(1 − actual_success_ratio) / (1 − target)`, sampled at short windows so fast-burn incidents page before the budget is spent.
---
## Install
```bash
pip install slo-budget-tracker
# or, with the FastAPI extras:
pip install "slo-budget-tracker[fastapi]"
```
Python 3.11+. Single runtime dep: `prometheus-client`.
---
## Quick start — standalone library
```python
from slo_budget_tracker import SLODefinition, SLOTracker
tracker = SLOTracker(
SLODefinition(
name="availability",
target=0.999, # three nines
window_seconds=30 * 24 * 3600, # 30-day rolling window
burn_rate_windows=(3600, 21600), # alert on 1h and 6h
burn_rate_threshold=14.4, # SRE workbook fast-burn page
)
)
# Hot path — O(1)
tracker.record_success()
tracker.record_failure()
snap = tracker.snapshot()
print(f"success ratio: {snap.success_ratio:.4f}")
print(f"budget left: {snap.error_budget_remaining:.2%}")
print(f"burn rate: {snap.burn_rate:.2f}")
if snap.is_budget_exhausted:
print("Freeze deploys.")
for alert in tracker.check_burn_rate():
print(f"FAST BURN over {alert.window_seconds}s: {alert.burn_rate:.1f}x budget")
```
---
## FastAPI middleware
`SLOMiddleware` auto-classifies every HTTP response — by default 5xx and unhandled exceptions are failures, everything else is a success. Override with your own classifier when 4xx (or specific routes) should burn budget.
```python
from fastapi import FastAPI
from fastapi.responses import Response
from slo_budget_tracker import (
PrometheusExporter,
SLODefinition,
SLOMiddleware,
SLORegistry,
)
registry = SLORegistry()
registry.define(SLODefinition(name="availability", target=0.999))
registry.define(SLODefinition(name="freshness", target=0.99))
app = FastAPI()
app.add_middleware(SLOMiddleware, registry=registry, slo_name="availability")
exporter = PrometheusExporter(registry)
@app.get("/metrics")
async def metrics() -> Response:
body, content_type = exporter.render()
return Response(content=body, media_type=content_type)
@app.get("/slo")
async def slo_snapshot() -> dict[str, object]:
return {"slos": [s.__dict__ for s in registry.snapshot_all()]}
```
Point your Prometheus scrape at `/metrics` and you get:
```
slo_target{slo="availability"} 0.999
slo_success_ratio{slo="availability"} 0.9991
slo_error_budget_remaining{slo="availability"} 0.42
slo_burn_rate{slo="availability",window_seconds="3600"} 2.1
slo_burn_rate{slo="availability",window_seconds="21600"} 0.8
slo_breached{slo="availability"} 0.0
```
---
## Custom classification
Default: anything `< 500` and no exception is a success. Want 4xx to burn budget? Pass `classify=`:
```python
app.add_middleware(
SLOMiddleware,
registry=registry,
slo_name="availability",
classify=lambda status, exc: exc is None and status < 400,
)
```
The classifier receives `(status_code, exception_or_None)` and returns `True` for success.
---
## API surface
| Object | Purpose |
| ------------------ | ------------------------------------------------------------------ |
| `SLODefinition` | Frozen dataclass: name, target, window, burn-rate windows + threshold. Validates at construction. |
| `SLOTracker` | Records observations, computes snapshots and burn-rate alerts. |
| `SLORegistry` | Holds many named trackers; supports `snapshot_all()` and `check_burn_rates()`. |
| `SLOMiddleware` | ASGI middleware that auto-records HTTP outcomes against a tracker. |
| `PrometheusExporter` | Renders the registry as Prometheus text format on demand. |
| `Observation` | `(timestamp, success)` event. |
| `SLOSnapshot` | Point-in-time view: ratios, failures, budget remaining, burn rate. |
| `BurnRateAlert` | One short window has crossed the configured threshold. |
| `BurnRateSample` | One short-window measurement attached to a snapshot. |
---
## Burn-rate math
```
error_budget = (1 - target) * total_requests_in_window
budget_used = failures_in_window
remaining_pct = (error_budget - budget_used) / error_budget
burn_rate(short_window) = (1 - success_ratio(short_window)) / (1 - target)
```
A `burn_rate == 1.0` means the service is failing at exactly the rate the SLO allows. `burn_rate == 14.4` means the next 30-day budget is being eaten in ~2 days. The default threshold of `14.4` follows the [SRE Workbook fast-burn page](https://sre.google/workbook/alerting-on-slos/#5-multiwindow-multi-burn-rate-alerts).
---
## Storage backends
The default `InMemoryStore` keeps a thread-safe deque trimmed to the window. For services pushing `>` ~100 rps you'll want a sampling or bucketed backend — wire one in by passing `store=` to `SLOTracker`. The protocol is small:
```python
class ObservationStore(Protocol):
def record(self, observation: Observation) -> None: ...
def window(self, now: float, seconds: int) -> list[Observation]: ...
def trim(self, before: float) -> None: ...
def __len__(self) -> int: ...
```
A Redis sorted-set backend is on the roadmap (`ZADD`/`ZREMRANGEBYSCORE`); contributions welcome.
---
## Tests
```bash
pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
mypy src
pytest -v
```
The CI matrix runs Python 3.11 / 3.12 / 3.13.
---
## Related work in this ecosystem
This is part of the [**Platform Reliability Stack**](https://github.com/mizcausevic-dev) — small, focused libraries that compose into a production reliability story:
- **[procurement-decision-api](https://github.com/mizcausevic-dev/procurement-decision-api)** — drafts AI Procurement Decision Cards from vendor Suite documents.
- **reliability-toolkit-rs** — async rate-limit + circuit-breaker + retry + bulkhead in Rust _(coming next)_.
- More at [kineticgain.com](https://kineticgain.com/).
---
## License
MIT. See [LICENSE](LICENSE).