An open API service indexing awesome lists of open source software.

https://github.com/gocardless/slo-builder

Templates for building SLOs with Prometheus rules and alerts
https://github.com/gocardless/slo-builder

observability prometheus

Last synced: 4 months ago
JSON representation

Templates for building SLOs with Prometheus rules and alerts

Awesome Lists containing this project

README

          

# slo-builder [![Documentation](https://godoc.org/github.com/gocardless/slo-builder/pkg/templates?status.svg)](http://godoc.org/github.com/gocardless/slo-builder/pkg/templates)

This repo provides a framework that developers can use to specify system SLOs
without requiring in-depth Prometheus knowledge.

## Installation

Run the following command **outside of a `go.mod` repository** to install the `slo-builder` binary.

```
go install github.com/gocardless/slo-builder/cmd/slo-builder@latest
```

You also need to ensure your `go/bin` directory is available in your `$PATH`:

```bash
export PATH="$HOME/go/bin:$PATH"
```

## Examples

See [`example-definitions.yaml`](./example-definitions.yaml) and [`example-rules.yaml`](./example-rules.yaml) for full examples of all available SLO templates.

## Why?

SLOs are often formulated in business terms first, then translated into
monitoring system rules. Good SLOs should be formed as a ratio of good events to
total events, and come with an associated error budget- the margin of error
you'd expect to consume in normal operation.

By forcing a homogenous format for every SLO, it becomes possible to apply
generically useful rules to all different types of SLO. This is even more
important when the implementation of such rules are so tricky, and the required
learning to produce them so large.

## Steps to an SLO

You start with a system, often with a number of SLIs. You then:

1. Formulate SLOs in business terms
2. Implement the SLOs in the monitoring system (Prometheus)
3. Write multi-window alerts for burning error budgets

System components will need different categories of SLO: an SLO for HTTP
requests will be structured differently than a batch processing system, for
example. This framework offers a collection of predefined templates that map to
different types of system, and can help someone unfamiliar with SLOs quickly
produce rules that Just Work.

When formulating SLOs in business terms (1), you can use these predefined
templates to help inform your selections. This framework can then produce the
rules (2) that generate a common input to multi-windowed alerts (3), which are
included at the end of this SLO pipeline.

## `BaseSLO`

Every SLO has some common behaviour, as represented by the `baseSLO` type. An
example core SLO definition would be:

```go
BaseSLO{
Name: "MarkPaymentsAsPaidMeetsDeadline",
ErrorBudget: 0.1,
}
```

In rule form, we produce a `job:slo_definition:none` rule which tracks the
parameters of the base SLO. This writes a time-series that can be inspected for
how the SLO definition changed with time.

We also produce a `job:slo_error_budget:ratio` which will be used at the end of
the SLO pipeline to apply alerting rules. Each of these rules has the `name`
label that is assumed to be unique to each SLO, allowing Prometheus to join
series on the `name` label.

```
job:slo_definition:none{name="MarkPaymentsAsPaidMeetsDeadline",error_budget="0.1"} 1.0
job:slo_error_budget:ratio{name="MarkPaymentsAsPaidMeetsDeadline"} 0.1
```

## `BatchProcessingSLO`

We'll use an example of a process that transitions many payments into a paid
state as a batch process for which we want to apply an SLO.

```go
MarkPaymentsAsPaidMeetsDeadline = BatchProcessingSLO{
BaseSLO: BaseSLO{"MarkPaymentsAsPaidMeetsDeadline", 0.1},
Deadline: time.Duration(2) * time.Hour,
Volume: `
1.5 * max_over_time(
(
sum by (namespace, release) (
increase(paysvc_mark_payments_as_paid_marked_as_paid_total[8h])
)
)[60d:1h]
)`,
Throughput: `
sum by (namespace, release) (
rate(paysvc_mark_payments_as_paid_marked_as_paid_total[1m])
) > 0`,
}
```

Users provide a time by which an entire batch must complete, along with an
estimation of max volume and a current measurement of throughput. This is enough
information to infer a target throughput (knowing how fast items need processing
to hit the deadline) which can be used to score each minute of activity from the
job.

In total, we produce three rules for this specific SLO:

```
job:slo_batch_volume:max{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"} 2058877.53186143
job:slo_batch_throughput_target:max{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"} 285.95521275853196
job:slo_batch_throughput:interval{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"} 369.7091257289802
```

These rules are then consumed by a generic batch processing rule that translates
into `job:slo_error:ratio` rules for each common alert window. The most
important rule is the generation of an error 'score' for the batch process,
which looks like:

```yaml
- record: job:slo_batch_error:interval
expr: |
1.0 - clamp_max(
job:slo_batch_throughput:interval / job:slo_batch_throughput_target:max,
1.0
)
```

In this context, `job:slo_batch_error:interval` is the error score for each
interval of the throughput given in the SLO specification. For
MarkPaymentsAsPaid, with the target throughput of 285 payments/s calculated over
a 1m window, you'd have the following scores:

| Throughput | `job:slo_batch_throughput_target:max` | `job:slo_batch_error:interval` |
| --- | --- | --- |
| 0 | 285 | 100% |
| 100 | 285 | 65% |
| 285 | 285 | 0% |
| 300 | 285 | 0% |

This means you burn your error budget when the batch job performs below the
target throughput, and the rate at which you burn it is dependent on how
significantly you fail to meet it. It's also important to note that minutes
where the throughput greatly exceeds the target don't 'recoup' error budget-
this is an implementation decision, and might be the wrong choice.

## Alerting

Every SLO template conforms to our definition of an SLO, which is something that
has a name, associated error budget and a constantly refreshed error ratio. In
Prometheus terms, that means your SLOs will eventually produce the following
time series:

- `job:slo_error:ratio1m`
- `job:slo_error:ratio5m`
- `job:slo_error:ratio30m`
- `job:slo_error:ratio1h`
- `job:slo_error:ratio2h`
- `job:slo_error:ratio6h`
- `job:slo_error:ratio1d`
- `job:slo_error:ratio7d`

As we get these series for every SLO, we can write generic alerting rules that
work across any SLO. It happens that building useful alerts on SLO measurements
is more complex than it might seem, and leveraging generic alerts is a huge
benefit for simplicity.

We use a combination of the [SRE
workbook](https://landing.google.com/sre/workbook/chapters/alerting-on-slos/)
and [SoundCloud: Alerting on SLOs like
Pros](https://developers.soundcloud.com/blog/alerting-on-slos) to form
multi-window error budget burn alerts. The term 'multi-window' indicates that
alerts are only triggered when error budget is being burned in both short and
long-term intervals: this reduces alert false positives and improves alert reset
time, causing alerts to resolve as soon as the problem has been corrected
instead of hours after.

Depending on the urgency of the detected error, we'll either page an on-call
engineer or open a ticket to handle the error budget burn in business hours. The
detection sensitivity windows are listed here:

| Alert | Long Window | Short Window | `for` Duration | Burn Rate Factor | Error Budget Consumed |
| --- | --- | --- | --- | --- | --- |
| Page | 1h | 5m | 2m | 14.4 | 2% |
| Page | 6h | 30m | 15m | 6 | 5% |
| Ticket | 1d | 2h | 1h | 3 | 10% |
| Ticket | 3d | 6h | 1h | 1 | 10% |

Every SLO created with this framework is automatically subscribed to these
alerts. Where they get routed- both who is paged, and where a ticket gets
created- depends on the team assigned to the SLO.