{"id":33127015,"url":"https://github.com/shibumi/SRE-cheat-sheet","last_synced_at":"2025-11-29T15:02:00.035Z","repository":{"id":40323053,"uuid":"150025150","full_name":"shibumi/SRE-cheat-sheet","owner":"shibumi","description":"A vocabulary collection for SREs","archived":false,"fork":false,"pushed_at":"2022-05-15T04:33:08.000Z","size":56,"stargazers_count":228,"open_issues_count":0,"forks_count":31,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-10-07T03:58:17.341Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shibumi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-23T20:59:45.000Z","updated_at":"2025-09-17T20:21:41.000Z","dependencies_parsed_at":"2022-07-12T03:00:38.823Z","dependency_job_id":null,"html_url":"https://github.com/shibumi/SRE-cheat-sheet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shibumi/SRE-cheat-sheet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibumi%2FSRE-cheat-sheet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibumi%2FSRE-cheat-sheet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibumi%2FSRE-cheat-sheet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibumi%2FSRE-cheat-sheet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shibumi","download_url":"https://codeload.github.com/shibumi/SRE-cheat-sheet/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shibumi%2FSRE-cheat-sheet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27355543,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-29T02:00:06.589Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-15T07:00:30.204Z","updated_at":"2025-11-29T15:02:00.025Z","avatar_url":"https://github.com/shibumi.png","language":null,"readme":"# SRE-cheat-sheet\nA cheatsheet for SREs (mostly influenced by Google SREs). It is meant as a landing page to quickly look up a certain keyword. If you want to go more into the details, I suggest you read the Google SRE book. It is for free: https://landing.google.com/sre/book/\n\n## Dictionary\n### Site Reliability Engineering\n\"Fundamentally, it’s what happens when you ask a software engineer to\ndesign an operations function\" -- Ben Treynor, VP of Engineering @ Google.\u003csup id=\"a1\"\u003e[1](#f1)\u003c/sup\u003e\n\n### Uptime\n| Availability % | Downtime per year | Downtime per month | Downtime per Week |\n|----------------|-------------------|--------------------|-------------------|\n| 90%            | 36.5 days         | 72 hours           | 16.8 hours        |\n| 95%            | 18.25 days        | 36 hours           | 8.4 hours         |\n| 98%            | 7.30 days         | 14.4 hours         | 3.36 hours        |\n| 99%            | 3.65 days         | 7.20 hours         | 1.68 hours        |\n| 99.5%          | 1.83 days         | 3.60 hours         | 50.4 minutes      |\n| 99.8%          | 17.52 hours       | 86.23 minutes      | 20.16 minutes     |\n| 99.9%          | 8.76 hours        | 43.2 minutes       | 10.1 minutes      |\n| 99.95%         | 4.38 hours        | 21.56 minutes      | 5.04 minutes      |\n| 99.99%         | 52.6 minutes      | 4.32 minutes       | 1.01 minutes      |\n| 99.999%        | 5.26 minutes      | 25.9 seconds       | 6.05 seconds      |\n| 99.9999%       | 31.5 seconds      | 2.59 seconds       | 0.605 seconds     |\n\nDowntime per month is calculated at 30 days.\u003csup id=\"a2\"\u003e[2](#f2)\u003c/sup\u003e\n\n### Error Budget\nError budget is generall the budget you can spend on pushing features. Let's say you have an uptime of 90% for your application or service. This means that you can have a downtime of 36.5 days per year, this is a downtime of 72 hours per month. You can either spend this downtime on fixing errors or you build your system reliable and spend it on pushing new features. It's fully up to you. You should just make sure that you freeze new features until your error budget has recovered. This has several advantages:\n\n1. Your Software Engineers will try to build your application as much as stable. Because if your application is unstable they will need their error budget for fixing these errors instead of pushing new features.\n2. If you have a stable application you are free to push new features as much as your error budget allows you to.\n3. Your uptime will be consistent to your SLA. Nobody wants to get sued, trust me.\n\n### Dickerson's Hierarchy of Service Reliability\n\n### Four Golden Signals\nA group of basic questions about your service regarding monitoring.\u003csup id=\"a5\"\u003e[5](#f5)\u003c/sup\u003e  \n\u003cimg src=\"https://github.com/shibumi/SRE-vocabulary/raw/master/figures/four_golden_signals.png\" width=\"352\" height=\"307\" /\u003e\n\n#### Saturation\nThis definition is up to you. It can be the capacity of the service like the CPU utilization. Ask yourself at what point your service could fall over and try to measure the metric for that point.\n\n#### Latency\nUsers expect blazing fast apps these days. So you want to definitly monitor your latency.\nAt Google they measure latency in three numbers:\n\n* P50: The 50th percentile or the median latency.\n* P90: The 90th percentile.\n* P99: The 99th percentile.\n\nDo not see latency as an average.\u003csup id=\"a5\"\u003e[5](#f5)\u003c/sup\u003e\n\n#### Errors\nIndicator for failures while serving your traffic. Usually measured in EPS (Errors Per Second).\n\n#### Traffic\nNormally measured in RPS (Requests Per Second) or QPS (Query Per Second).\n\n### Valid Monitoring Output\n#### Alerts\nAn alert is something where a human needs to take action immediately to prevent a system crash or a degeneration of your service.\n\n#### Tickets\nTickets are everything, where a human needs to take action but not now. Usually you should have enough time to fix this issue, in any other case it's an alert.\n\n#### Logs\nThis metrics are only for diagnostic, forensic purposes and post mortems.\n\n### Defense In Depth\nFailures will always happen. Get used to it. There is nothing you can do about it, but what you can do is tolerate them and have them automatically get fixed. If you design your system, that it is tolerating point failures you will have already one problem less.\u003csup id=\"a1\"\u003e[1](#f1)\u003c/sup\u003e\n\n### Graceful Degredation\n\"Graceful degradation is the ability to tolerate failures without having complete collapse. For example, if a user's network is running slowly, the Hangout video system will reduce the video resolution and preserve the audio. For Gmail, a slow network might mean that big attachments won't load, but users can still read their email. All these are automated responses that give you high availability without a human having to do anything.\"\u003csup id=\"a1\"\u003e[1](#f1)\u003c/sup\u003e\n\n### Wheel Of Misfortune\nThe \"Wheel Of Misfortune\" is a role-game, where a previous postmortem is reenacted with a cast of engineers playing roles as laid out in the postmortem.\u003csup id=\"a6\"\u003e[6](#f6)\u003c/sup\u003e\n\n### Mean Time To Recover (MTTR)\nMTTR is the average time that a device will take to recover from any failure.\u003csup id=\"a3\"\u003e[3](#f3)\u003c/sup\u003e\n\n### Mean Time Between Failures (MTBF)\nMTBF is the predicted elapsed time between inherent failures of a mechanical or electronic system, during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems, while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.\u003csup id=\"a4\"\u003e[4](#f4)\u003c/sup\u003e\n\n### Mean Time To Failure (MTTF)\nMTTF denotes the expected time to failure for a non-repairable system.\u003csup id=\"a4\"\u003e[4](#f4)\u003c/sup\u003e\n\n### Service Level Indicator (SLI)\nSLI is a carefully defined quantitative measure of some aspect of the level of service that is provided.\u003csup id=\"a5\"\u003e[5](#f5)\u003c/sup\u003e\n\n### Service Level Objective (SLO)\nSLO is a target value or range of values for a service level that is measured by an SLI.\u003csup id=\"a5\"\u003e[5](#f5)\u003c/sup\u003e\n\n### Service Level Agreement (SLA)\nSLA is a (legal) agreement with repercussions for failure to meet.\u003csup id=\"a5\"\u003e[5](#f5)\u003c/sup\u003e\n\n### Capability Maturity Model (CMM)\nCMM is a development model created after a study of data collected from organizations that contracted with the U.S. Department of Defense, who funded the research. The term \"maturity\" relates to the degree of formality and optimization of processes, from ad hoc practices, to formally defined steps, to managed result metrics, to active optimization of the processes.\u003csup id=\"a7\"\u003e[7](#f7)\u003c/sup\u003e\n\n### Postmortem\n\n## Sources\n\u003cb id=\"f1\"\u003e1\u003c/b\u003e Google SRE Interview, Niall Murphy and Ben Treynor, \"What is 'Site Reliability Engineering', 2018-09-26, https://landing.google.com/sre/interview/ben-treynor.html [↩](#a1)  \n\u003cb id=\"f2\"\u003e2\u003c/b\u003e https://interworks.com/blog/rclapp/2010/05/06/what-does-availabilityuptime-mean-real-world/ [↩](#a2)  \n\u003cb id=\"f3\"\u003e3\u003c/b\u003e https://en.wikipedia.org/wiki/Mean_time_to_recovery [↩](#a3)  \n\u003cb id=\"f4\"\u003e4\u003c/b\u003e https://en.wikipedia.org/wiki/Mean_time_between_failures [↩](#a4)  \n\u003cb id=\"f5\"\u003e5\u003c/b\u003e Google Cloud Next 2018: Nori and Dan, \"Best Practices from Google SRE\", 2018-07-26, https://www.youtube.com/watch?v=XPtoEjqJexs [↩](#a5)  \n\u003cb id=\"f6\"\u003e6\u003c/b\u003e \"Postmortem Culture: Learning from Failure\", John Lunney and Sue Lueder, 2018-09-26, https://landing.google.com/sre/book/chapters/postmortem-culture.html [↩](#a6)  \n\u003cb id=\"f7\"\u003e7\u003c/b\u003e https://en.wikipedia.org/wiki/Capability_Maturity_Model [↩](#a7)  \n\n## Additional Links\nA more technical cheatsheet: https://github.com/michael-kehoe/awesome-sre-cheatsheets  \nDevOps: \"Where do I start? Cheatsheet\" by Microsoft: https://blogs.technet.microsoft.com/juliens/2016/02/14/devops-where-do-i-start-cheat-sheet/  \n\"So you want to be an SRE\" by hackernoon.com: https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c  \nThe Google SRE Landing page: https://google.com/sre\n","funding_links":[],"categories":["SRE Tools","📋 Templates e Checklists"],"sub_categories":["Checklists"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshibumi%2FSRE-cheat-sheet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshibumi%2FSRE-cheat-sheet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshibumi%2FSRE-cheat-sheet/lists"}