{"id":13463700,"url":"https://github.com/last9/slo-computer","last_synced_at":"2025-04-06T00:10:21.475Z","repository":{"id":40540039,"uuid":"362671182","full_name":"last9/slo-computer","owner":"last9","description":"SLOs, Error windows and alerts are complicated. Here an attempt to make it easy","archived":false,"fork":false,"pushed_at":"2025-03-04T23:12:42.000Z","size":67,"stargazers_count":130,"open_issues_count":1,"forks_count":3,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-03-29T23:11:11.033Z","etag":null,"topics":["metrics","observability","service-level-indicator","service-level-objective","sla","sli","slo","sre","sre-team"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/last9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-29T02:53:24.000Z","updated_at":"2025-03-04T22:59:20.000Z","dependencies_parsed_at":"2022-06-27T21:39:24.067Z","dependency_job_id":"79f68b8c-4c56-4b4f-9aa8-8ab56deb0da1","html_url":"https://github.com/last9/slo-computer","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/last9%2Fslo-computer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/last9%2Fslo-computer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/last9%2Fslo-computer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/last9%2Fslo-computer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/last9","download_url":"https://codeload.github.com/last9/slo-computer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415973,"owners_count":20935387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["metrics","observability","service-level-indicator","service-level-objective","sla","sli","slo","sre","sre-team"],"created_at":"2024-07-31T14:00:26.801Z","updated_at":"2025-04-06T00:10:21.455Z","avatar_url":"https://github.com/last9.png","language":"Go","funding_links":[],"categories":["Service Level Objectives","Incident Management / Incident Response / IT Alerting / On-Call","11. Tools"],"sub_categories":["Tools","Container Orchestration"],"readme":"\u003ca href=\"https://last9.io\"\u003e\u003cimg src=\"https://last9.github.io/assets/last9-github-badge.svg\" align=\"right\" /\u003e\u003c/a\u003e\n\n# SLO Computer\n\n\u003e [!Note]\n\u003e @last9 advocates using Service Level Objectives.\n\u003e One of the biggest challenges we run into is the lack of practical algorithms behind Burn Rate and alerting. This is our first attempt at it.\n\n## What is SLO Computer?\n\nSLO Computer simplifies the complex world of Service Level Objectives (SLOs), error budgets, and alerting. \n\nSLOs, error windows, burn rates, and budget spend are convoluted terms that can throw anyone off. Even the SRE workbook by Google can leave you with a lot of open questions. We continue to be amazed by how widely misunderstood this topic is (and how easy it can make your lives if used well).\n\nThis toolkit helps SREs and DevOps engineers:\n\n- Calculate appropriate alert thresholds based on service throughput and desired SLO targets\n- Determine if a service has enough traffic to benefit from SLO-based alerting\n- Generate alert policies for AWS burstable CPU instances\n\n## Installation and Setup\n\n### Prerequisites\n- Go 1.16 or later\n\n### Building from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/last9/slo-computer.git\ncd slo-computer\n\n# Build using Make\nmake build\n```\n\n### Quick Start\n\nThe project includes a Makefile with helpful commands:\n\n```bash\n# Build the application\nmake build\n\n# Run tests\nmake test\n\n# Run an example service SLO calculation\nmake example-service\n\n# Run an example CPU burst calculation\nmake example-cpu\n\n# See all available commands\nmake help\n```\n\n## Usage\n\n```bash\nusage: slo [\u003cflags\u003e] \u003ccommand\u003e [\u003cargs\u003e ...]\n\nLast9 SLO toolkit\n\nFlags:\n  --help     Show context-sensitive help (also try --help-long and --help-man).\n  --version  Show application version.\n\nCommands:\n  help [\u003ccommand\u003e...]\n    Show help.\n\n  suggest --throughput=THROUGHPUT --slo=SLO --duration=DURATION\n    suggest alerts based on service throughput and SLO duration\n\n  cpu-suggest --instance=INSTANCE --utilization=UTILIZATION\n    suggest alerts based on CPU utilization and Instance type\n```\n\n### Command Parameters\n\n#### `suggest` Command\n- `--throughput`: Number of requests per minute your service handles\n- `--slo`: Your desired SLO percentage (e.g., 99.9)\n- `--duration`: SLO time period in hours (e.g., 720 for 30 days)\n\n#### `cpu-suggest` Command\n- `--instance`: AWS instance type (e.g., t3.micro, t3a.xlarge)\n- `--utilization`: Average CPU utilization percentage (0-100)\n\nThe goal of these commands is to factor in some \"bare minimum\" input to:\n\n- Determine if this is a low traffic service where an SLO approach makes little sense\n- Compute the _actual_ alert values and conditions to set alerts on\n\n## Examples\n\n### Service SLO Alerts\n\n**Q: What alerts should I set for my service to achieve 99.9% availability over 30 days?**\n\n```bash\n./slo-computer suggest --throughput=4200 --slo=99.9 --duration=720\n```\n\nOutput:\n```\nAlert if error_rate \u003e 0.002 for last [24h0m0s] and also last [2h0m0s]\nThis alert will trigger once 6.67% of error budget is consumed,\nand leaves 360h0m0s before the SLO is defeated.\n\n\nAlert if error_rate \u003e 0.010 for last [1h0m0s] and also last [5m0s]\nThis alert will trigger once 1.39% of error budget is consumed,\nand leaves 72h0m0s before the SLO is defeated.\n```\n\n**Q: What about a low-traffic service?**\n\n```bash\n./slo-computer suggest --throughput=100 --slo=99.9 --duration=168\n```\n\nOutput:\n```\nslo-computer: error:\n\tIf this service reported 10.000 errors for a duration of 5m0s\n\tSLO (for the entire duration) will be defeated wihin 1h40m47s\n\n\tProbably\n\t- Use ONLY spike alert model, and not SLOs (easiest)\n\t- Reduce the MTTR for this service (toughest)\n\t- SLO is too aggressive and can be lowered (business decision)\n\t- Combine multiple services into one single service (team wide)\n```\n\n### CPU Burst Credit Alerts\n\n**Q: What alerts should I set for my AWS burstable instance?**\n\n```bash\n./slo-computer cpu-suggest --instance=t3a.xlarge --utilization=15\n```\n\nOutput:\n```\nAlert if 100.00 % consumption sustains for 10m0s AND recent 5m0s.\nAt this rate, burst credits will deplete after 10h0m0s\n\n\nAlert if 80.00 % consumption sustains for 3h45m0s AND recent 55m0s.\nAt this rate, burst credits will deplete after 15h0m0s\n```\n\n## Understanding the Results\n\n### For Service SLOs\n\nThe tool generates two types of alerts:\n1. **Slow burn alert**: Detects gradual error rate increases that would eventually exhaust your error budget\n2. **Fast burn alert**: Detects sudden spikes in error rates that require immediate attention\n\nEach alert includes:\n- The error rate threshold to monitor\n- The time windows to evaluate\n- How much of your error budget would be consumed when the alert triggers\n- How much time remains before your SLO is breached if the error rate continues\n\n### For CPU Burst Credits\n\nThe tool generates alerts that help you monitor when your AWS burstable instance might run out of CPU credits:\n- Alert thresholds for different CPU utilization levels\n- Time windows to monitor\n- Time until credit depletion at the current rate\n\n## Key Concepts\n\n### Service SLOs\n- **Throughput**: The number of requests your service handles per minute\n- **SLO**: Your Service Level Objective (e.g., 99.9% availability)\n- **Duration**: The time period for your SLO in hours (e.g., 720 for 30 days)\n- **Error Budget**: The amount of allowable errors within your SLO period (calculated as `(100% - SLO%) * total requests`)\n- **Burn Rate**: How quickly you're consuming your error budget relative to the expected rate\n\n### CPU Burst Credits\n- **Instance**: AWS burstable instance type (T2, T3, T4g families)\n- **Utilization**: Average CPU utilization percentage\n- **Credit Rate**: How quickly the instance earns CPU credits\n- **Baseline Performance**: The CPU performance level the instance can sustain indefinitely\n\n## Using as a Library\n\nYou can also use SLO Computer as a library in your Go projects:\n\n```go\nimport (\n    \"time\"\n    \"github.com/last9/slo-computer/slo\"\n)\n\n// Create a new SLO\ns, err := slo.NewSLO(\n    time.Duration(720)*time.Hour, // SLO period of 30 days\n    4200,                         // 4200 requests per minute\n    99.9,                         // 99.9% availability target\n)\n\n// Calculate alerts\nalerts := slo.AlertCalculator(s)\n\n// For CPU burst calculations\ncc := slo.InstanceCapacity(\"t3.micro\")\nb, err := slo.NewBurstCPU(cc, 75.0) // 75% utilization\nburstAlerts := slo.BurstCalculator(b)\n```\n\n## Troubleshooting\n\n### Common Errors\n\n**Error: \"strconv.ParseFloat: parsing \"SLO\": invalid syntax\"**  \nMake sure to replace \"SLO\" with an actual number (e.g., 99.9) in your command:\n```bash\n# Incorrect\n./slo-computer suggest --throughput=1000000 --slo=SLO --duration=90\n\n# Correct\n./slo-computer suggest --throughput=1000000 --slo=99.9 --duration=90\n```\n\n**Error about low traffic services**  \nIf you receive a message about your service being low-traffic, consider:\n- Using spike-based alerting instead of SLO-based alerting\n- Combining multiple services to increase the traffic volume\n- Lowering your SLO target to a more achievable level\n\n## Roadmap\n\nWe're actively working on improving SLO Computer. Check out our roadmap:\n- [Open Issues](OPEN_ISSUES.md) - Planned improvements and bug fixes\n- [Feature Enhancements](FEATURES.md) - Upcoming features and user experience improvements\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n---\n\n# About Last9\n\nThis project is sponsored and maintained by [Last9](https://last9.io). Last9 is a telemetry data platform.\n\n\u003ca href=\"https://last9.io\"\u003e\u003cimg src=\"https://last9.github.io/assets/email-logo-green.png\" alt=\"\" loading=\"lazy\" height=\"40px\" /\u003e\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flast9%2Fslo-computer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flast9%2Fslo-computer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flast9%2Fslo-computer/lists"}