Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/last9/slo-computer
SLOs, Error windows and alerts are complicated. Here an attempt to make it easy
https://github.com/last9/slo-computer
metrics observability service-level-indicator service-level-objective sla sli slo sre sre-team
Last synced: about 1 month ago
JSON representation
SLOs, Error windows and alerts are complicated. Here an attempt to make it easy
- Host: GitHub
- URL: https://github.com/last9/slo-computer
- Owner: last9
- Created: 2021-04-29T02:53:24.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-11-18T06:04:05.000Z (10 months ago)
- Last Synced: 2024-06-20T10:17:13.418Z (3 months ago)
- Topics: metrics, observability, service-level-indicator, service-level-objective, sla, sli, slo, sre, sre-team
- Language: Go
- Homepage:
- Size: 47.9 KB
- Stars: 128
- Watchers: 24
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-platform-engineering-tools - SLO Calculator
- awesome-sre - SLO Computer - SLOs, Error windows and alerts are complicated. Here's an attempt to make it easy. (11. Tools)
README
# slo-computer
> [!Note]
> Around two years ago @last9 started advocating using Service Level Objectives.
> One of the biggest challenges was the lack of practical algorithms behind Burn Rate and alerting. This was our first attempt at it.
> If you would like us to release these algorithms, go ahead and help us reach 250 stars ⭐️.SLOs, Error windows and alerts are complicated. Here's an attempt to make it
easy.SLO, burn_rate, error_rate, budget_spend are convoluted terms that can throw one
off. Even the SRE workbook by Google can leave you with a lot of open questions.The concept of SLOs and SLIs has existed for a long time now, but we continue to
be amazed by how widely misunderstood this topic is (and how easy it can make
your lives if used well).We are building a sandbox for our DevOps and SRE community - SLO computer - a
product that makes setting and monitoring SLOs for all your services intuitively
seamless and blazingly fast.### Usage
```bash
usage: slo [] [ ...]Last9 SLO toolkit
Flags:
--help Show context-sensitive help (also try --help-long and --help-man).
--version Show application version.Commands:
help [...]
Show help.suggest --throughput=THROUGHPUT --slo=SLO --duration=DURATION
suggest alerts based on service throughput and SLO durationcpu-suggest --instance=INSTANCE --utilization=UTILIZATION
suggest alerts based on CPU utilization and Instance type
```The goal of this command (has an importable lib too) is to factor in some "bare
minimum" input to- Is this a Low traffic service in which case it makes little sense to use an
SLO approach
- Compute the _actual_ alert values and condition to set alerts on### Examples
**Q: What alerts should I set for my service to achieve 99.9 % availability over
30 days**```bash
✗ ./slo-computer suggest --throughput=4200 --slo=99.9 --duration=720Alert if error_rate > 0.002 for last [24h0m0s] and also last [2h0m0s]
This alert will trigger once 6.67% of error budget is consumed,
and leaves 360h0m0s before the SLO is defeated.Alert if error_rate > 0.010 for last [1h0m0s] and also last [5m0s]
This alert will trigger once 1.39% of error budget is consumed,
and leaves 72h0m0s before the SLO is defeated.
```**Q: What alerts should I set for my service with throughpput 100rpm to achieve
90 % availability over 7 days**```bash
✗ ./slo-computer suggest --throughput=100 --slo=99.9 --duration=168
slo-computer: error:
If this service reported 10.000 errors for a duration of 5m0s
SLO (for the entire duration) will be defeated wihin 1h40m47sProbably
- Use ONLY spike alert model, and not SLOs (easiest)
- Reduce the MTTR for this service (toughest)
- SLO is too aggressive and can be lowerd (business decision)
- Combine multiple services into one single service (teamwide)
, try --help
```**Q: What alerts should I set for my burst CPU**
```bash
✗ ./slo-computer cpu-suggest --instance=t3a.xlarge --utilization=15Alert if 100.00 % consumption sustains for 10m0s AND recent 5m0s.
At this rate, burst credits will deplete after 10h0m0sAlert if 80.00 % consumption sustains for 3h45m0s AND recent 55m0s.
At this rate, burst credits will deplete after 15h0m0s
```---
# About Last9
This project is sponsored and maintained by [Last9](https://last9.io). Last9 builds reliability tools for SRE and DevOps.