Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/adriannovegil/awesome-sre

Awesome SRE page
https://github.com/adriannovegil/awesome-sre

List: awesome-sre

awesome awesome-list sre

Last synced: about 1 month ago
JSON representation

Awesome SRE page

Awesome Lists containing this project

README

        

# Awesome SRE [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

You want your computer systems to run well, and the subjective definition of what well means depends on the nature of the system and your goals regarding it.

Most of the time, the primary motivation for companies is to create profit for the owners and shareholders.

The definition of running well will therefore be a derivative of the business model objectives.

> "Hope is not a strategy."

## Contents

- [1. Site Reliability Engineering](#1-site-reliability-engineering)
- [2. SRE Culture](#2-sre-culture)
- [3. DevOps](#3-devops)
- [4. Monitoring and Observability](#4-monitoring-and-observability)
- [5. Alerting](#5-alerting)
- [6. Incident Response and Post-Mortem](#6-incident-response-and-post-mortem)
- [7. On-Call](#7-on-call)
- [8. Chaos Engineering](#8-chaos-engineering)
- [9. Automation](#9-automation)
- [10. Performance](#10-performance)
- [11. Tools](#11-tools)
- [12. Books](#12-books)
- [13. References](#13-references)
- [14. License](#14-license)
- [15. Contributing](#15-contributing)

## 1. Site Reliability Engineering

## 2. SRE Culture

## 3. DevOps

## 4. Monitoring and Observability

- [My Awesome Observability Repo ;-)](https://github.com/adriannovegil/awesome-observability)

## 5. Alerting

- [My Awesome Observability Repo ;-)](https://github.com/adriannovegil/awesome-observability)

## 6. Incident Response and Post-Mortem

- [A collection of post-mortems](https://github.com/danluu/post-mortems)
- [A collection of postmortem templates](https://github.com/dastergon/postmortem-templates)
- [Our incident postmortem template](https://www.hostedgraphite.com/blog/incident-postmortem-template) - Hosted Graphite postmotem template.
- [Postmortem exercise](https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit)
- [Squadcast](https://www.squadcast.com) - Experience the journey from On-Call to SRE.
- [PagerDuty](https://www.pagerduty.com/) - Your platform for digital operations management.
- [VictorOps](https://victorops.com/) - VictorOps is now Splunk On-Call.
- [Splunk On-Call](https://www.splunk.com/en_us/investor-relations/acquisitions/splunk-on-call.html) - Developers, devops and operations teams make on-call suck less while reducing mean time to acknowledge and restore outages.
- [OpsGenie](https://www.opsgenie.com/) - On-call and alert management to keep services always on.
- [AlertOps](https://alertops.com/) - Transform real-time operational intelligence into automated incident response.
- [Blameless](https://www.blameless.com/) - The Blameless SRE Platform empowers engineering and DevOps teams through incidents, retrospectives, and detecting the interesting patterns. With the right data, of course.
- [OnPage](https://www.onpage.com/) - Incident alert management system with a secure smartphone app, enabling response teams to get the most out of their digital technology investments.
- [PagerTree](https://pagertree.com/) - Intelligent alert routing for the modern team.
- [Cabot](https://cabotapp.com/) - Get alerted when services go down or metrics go crazy.
- [xMatters](https://www.xmatters.com/) - Automate operations workflows, ensure applications are always working, and deliver remarkable products at scale with the xMatters service reliability platform.
- [Derdack Enterprise Alert](https://www.derdack.com/) - Enterprise Alert Notification Software.
- [Bigpanda](https://www.bigpanda.io/) - AIOps Event Correlation and Automation platform enables Tech Ops teams to keep the digital economy running.
- [OpenDuty](https://github.com/ustream/openduty) - Openduty is an incident escalation tool, just like Pagerduty (no longer maintaining).
- [ngDesk](https://www.ngdesk.com/) - ngDesk includes support, sales, asset management, marketing and pager in an all-in-one application that is ready to go and easy to use.
- [Geneos](https://www.itrsgroup.com/products/geneos) - Real-time monitoring for all your environments in one platform.
- [FireHydrant](https://www.firehydrant.com) - Gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives.
- [Rootly](https://www.rootly.io) - The fastest way to declare an incident.

## 7. On-Call

## 8. Chaos Engineering

- [My Awesome Chaos Repo ;-)](https://github.com/adriannovegil/awesome-chaos-engineering)

## 9. Automation

## 10. Performance

## 11. Tools

- [SLO Generator](https://github.com/google/slo-generator) - Tool to compute and export Service Level Objectives (SLOs), Error Budgets and Burn Rates, using configurations written in YAML (or JSON) format.
- [SLO Computer](https://github.com/last9/slo-computer) - SLOs, Error windows and alerts are complicated. Here's an attempt to make it easy.
- [SLO Tracker](https://github.com/roshan8/slo-tracker) - A simple but effective way to track SLO's and Error budgets. SLO-tracker can be integrated with few alerting tools via webhook integration to receive SLO voilating incidents.
- [SLO exporter](https://github.com/seznam/slo-exporter) - Computes standardized Service Level Indicator (SLI) and Service Level Objectives (SLO) metrics based on events coming from various data sources.
- [Pyrra](https://github.com/pyrra-dev/pyrra) - Making SLOs with Prometheus manageable, accessible, and easy to use for everyone.

## 12. Books

- [Site Reliability Engineering](https://landing.google.com/sre/book/index.html)
- [Site Reliability Workbook](https://landing.google.com/sre/workbook/toc/)
- [Building Secure and Reliable Systems](https://landing.google.com/sre/resources/foundationsandprinciples/srs-book/)

## 13. References

- https://livebook.manning.com/book/chaos-engineering/chapter-1/17
- https://github.com/dastergon/awesome-sre
- https://github.com/michael-kehoe/awesome-sre-cheatsheets
- https://github.com/andrealmar/sre-university
- https://github.com/awesome-sre/awesome-sre
- https://github.com/jdrowne/awesome-sre-books
- https://github.com/hekonsek/awesome-sre
- https://github.com/mterwill/awesome-sre
- https://github.com/operate-first/SRE
- https://github.com/SquadcastHub/awesome-sre-tools
- https://github.com/mxssl/sre-interview-prep-guide
- https://github.com/rishiloyola/SRE-Interviews
- https://github.com/unixorn/sysadmin-reading-list
- https://github.com/linkedin/school-of-sre
- https://github.com/upgundecha/howtheysre
- [Site Reliability Engineering - Rodolpho Eckhardt](https://www.youtube.com/watch?v=XI2zUFIsMwg)
- [Site Reliability Engineering: How Google Runs Production Systems](https://www.amazon.com/gp/product/149192912X/ref=x_gr_w_bb?ie=UTF8&tag=x_gr_w_bb-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=149192912X&SubscriptionId=1MGPYB6YW3HWK55XCGG2)
- [Site Reliability Engineering at Dropbox](https://www.youtube.com/watch?v=ggizCjUCCqE)
- [Site Reliability Engineering: How Google Runs Production Systems](https://www.goodreads.com/book/show/27968891-site-reliability-engineering)
- [Episódio 98: Rodolpho Eckhardt - YouTube, Google e SRE](http://castalio.info/episodio-98-rodolpho-eckhardt-youtube-google-e-sre.html)
- [Site Reliability Engineering](https://landing.google.com/sre/)
- [What is the role of a Site Reliability Engineer?](https://cloudacademy.com/blog/what-is-the-role-of-a-site-reliability-engineer/)
- [Love DevOps? Wait until you meet SRE](https://www.atlassian.com/it-unplugged/devops/site-reliability-engineering-sre)

## 14. License

[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0)

## 15. Contributing

Contributions welcome! Read the [contribution guidelines](contributing.md) first.

Thank you!