{"id":572,"url":"https://github.com/dastergon/awesome-sre","last_synced_at":"2025-09-27T10:30:48.990Z","repository":{"id":38474748,"uuid":"56103979","full_name":"dastergon/awesome-sre","owner":"dastergon","description":"A curated list of Site Reliability and Production Engineering resources.","archived":false,"fork":false,"pushed_at":"2023-12-03T02:39:07.000Z","size":1224,"stargazers_count":11567,"open_issues_count":36,"forks_count":1542,"subscribers_count":502,"default_branch":"master","last_synced_at":"2024-05-23T06:05:14.113Z","etag":null,"topics":["alerting","availability","awesome","awesome-list","capacity-planning","devops","incident-response","list","monitoring","on-call","post-mortem","postmortem","production","reliability","reliability-engineering","scalability","service-level-agreement","site-reliability","site-reliability-engineering","sre"],"latest_commit_sha":null,"homepage":"https://sre.xyz","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dastergon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-04-12T23:06:20.000Z","updated_at":"2024-05-22T22:57:05.000Z","dependencies_parsed_at":"2023-02-08T19:31:35.344Z","dependency_job_id":"bcd26826-17ca-4891-b859-bf76ca789f74","html_url":"https://github.com/dastergon/awesome-sre","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dastergon%2Fawesome-sre","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dastergon%2Fawesome-sre/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dastergon%2Fawesome-sre/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dastergon%2Fawesome-sre/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dastergon","download_url":"https://codeload.github.com/dastergon/awesome-sre/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219867307,"owners_count":16554317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alerting","availability","awesome","awesome-list","capacity-planning","devops","incident-response","list","monitoring","on-call","post-mortem","postmortem","production","reliability","reliability-engineering","scalability","service-level-agreement","site-reliability","site-reliability-engineering","sre"],"created_at":"2024-01-05T20:12:58.512Z","updated_at":"2025-09-27T10:30:43.905Z","avatar_url":"https://github.com/dastergon.png","language":null,"funding_links":[],"categories":["Operations","Miscellaneous","Others","Technical","Site Reliability Engineering","Bookmarks","Blogs \u0026 Newsletters","Resourses","DevOps","Uncategorized","HarmonyOS","my-awesome-list","杂项","Backend","Live Site:   [searchAwesome](https://search-awesome.vercel.app/)","Related Awesome Lists","网络服务","其他","Miscellaneous \u0026 Experimental Tools","Articles","Other Lists","awesome-list","devops","Don't forget to give a :star: to make the project popular","资源链接：","Bachelor-Level","Awesome Awesome","📚 Learning \u0026 Resources","Themed Directories","Reference materials","SRE/DevOps/WebOps","16. Community and Forums"],"sub_categories":["awesome-*","Burn Iso","Github","Reliability (SRE)","Uncategorized","Windows Manager","Orchestration \u0026 CD","网络服务_其他","🙌 Acknowledgements","Kubernetes","TeX Lists","B.Sc.: Big Data and Cloud Computing for AI","Free software (free as in freedom)","Lists","Updated in the last year"],"readme":"# Awesome Site Reliability Engineering  [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n[\u003cimg src=\"awesome-sre-logo.svg\" align=\"right\" width=\"100\"\u003e](https://dastergon.gr/awesome-sre)\n\n\nA curated list of awesome [Site Reliability](https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre) and [Production](https://www.usenix.org/conference/srecon15/program/presentation/canahuati) Engineering resources.\n\n#### What is Site Reliability Engineering?\n\u003e \"Fundamentally, it's what happens when you ask a software engineer to design an operations function.\" - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE\n\n## Contributing\n\nPlease take a look at the [contribution guidelines](CONTRIBUTING.md) first.\nContributions are always welcome!\n\n## Contents\n- [Culture](#culture)\n- [Education](#education)\n- [Books](#books)\n- [Hiring](#hiring)\n- [Reliability](#reliability)\n- [Monitoring \u0026 Observability \u0026 Alerting](#monitoring--observability--alerting)\n- [On-Call](#on-call)\n- [Post-Mortem](#post-mortem)\n- [Capacity Planning](#capacity-planning)\n- [Service Level Agreement](#service-level-agreement)\n- [Performance](#performance)\n- [Programming](#programming)\n- [Misc Articles](#misc-articles)\n- [Real-time Messaging](#real-time-messaging)\n- [Blogs](#blogs)\n- [Newsletters](#newsletters)\n- [Conferences \u0026 Meetups](#conferences-meetups)\n- [Twitter](#twitter)\n- [SRE Tools](#sre-tools)\n- [SRE Podcasts](#podcasts)\n\n## Culture\n* [What is Site Reliability Engineering?](https://landing.google.com/sre/interview/ben-treynor.html)\n* [Keys To SRE by Ben Treynor](https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre)\n* [Google SRE Resources](https://landing.google.com/sre/resources.html)\n* [Notes from Production Engineering by Pedro Canahuati](https://www.usenix.org/conference/srecon15/program/presentation/canahuati)\n* [PostOps: Recovery from Operations](https://www.usenix.org/conference/srecon15europe/program/presentation/underwood)\n* [Love DevOps? Wait 'till you meet SRE](https://www.atlassian.com/it-service/site-reliability-engineering-sre) [[video]](https://youtu.be/fsTpRx8Pt-k)\n* [How Google Does Planet-Scale Engineering for Planet-Scale Infra](https://www.youtube.com/watch?v=H4vMcD7zKM0)\n* [Site Reliability Engineering at Facebook](https://www.facebook.com/notes/facebook-engineering/site-reliability-engineering-at-facebook/291616313919/)\n* [A History of Site Reliability Engineering at Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE\u0026nohtml5=False)\n* [Case Study: Adopting SRE Principles at StackOverflow](https://www.usenix.org/conference/srecon15/program/presentation/limoncelli)\n* [Site Reliability Engineering at Dropbox](https://www.youtube.com/watch?v=ggizCjUCCqE)\n* [Site Reliability Engineers — Keeping Google up and running 24/7](https://www.youtube.com/watch?v=yXI7r0_J29M)\n* [Site Reliability Engineering at Salesforce](https://www.salesforce.com/video/193050/)\n* From Sys Admin to Netflix SRE - [video](https://www.youtube.com/watch?v=lZI51YzIgVE) and [slides](https://www.socallinuxexpo.org/sites/default/files/presentations/Scale%20x14%20Slides.pdf)\n* [SRE@Google: Thousands of DevOps Since 2004](https://www.youtube.com/watch?v=iIuTnhdTzK0)\n* [Transactional System Administration Is Killing Us and Must be Stopped](https://www.usenix.org/conference/lisa15/conference-program/presentation/limoncelli)\n* [A hierarchy of SRE needs](https://web.archive.org/web/20190401220948/https://plus.google.com/+lizthegrey/posts/MLAJFVyEb2f)\n* [PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability](https://www.usenix.org/conference/lisa13/technical-sessions/plenary/underwood)\n* [SRE: An incomplete guide to cultural Narnia](https://web.archive.org/web/20180820235243/http://anthonycaiafa.com/2016/04/10/sre-cultural-narnia/) - [[Video]](https://www.youtube.com/watch?v=__wypEhdcrQ\u0026t=0s)\n* [Putting Together Great SRE Teams](https://www.usenix.org/conference/srecon16/program/presentation/krishnan)\n* [Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air](https://www.youtube.com/watch?v=bwt6TZjefGM)\n* [Toil: A Word Every Engineer Should Know](https://sharpend.io/toil-a-word-every-engineer-should-know/)\n* [Engineering Reliability into Web Sites: Google SRE](https://research.google.com/pubs/pub32583.html)\n* [DEVOPS \u0026 SRE AMA - Building High Performance Organizations](https://vimeo.com/179914447)\n* [John Allspaw's AMA on Incident Analysis and Postmortems](https://community.atlassian.com/t5/Jira-Ops-questions/I-m-John-Allspaw-Ask-Me-Anything-about-incident-analysis-and/qaq-p/957084)\n* Site Reliability Engineering with Paul Newson - [Part 1](https://www.gcppodcast.com/post/episode-38-site-reliability-engineering-with-paul-newson/) \u0026 [Part 2](https://gcppodcast.com/post/episode-59-sre-ii-with-paul-newson/)\n* [How SysAdmins Devalue Themselves](https://queue.acm.org/detail.cfm?id=2891413)\n* [The Softer Side of DevOps](https://www.youtube.com/watch?v=ry51Llzil1I)\n* [SRE, noun. See also: confidence, trust.](https://medium.com/@kobolog/sre-noun-see-also-confidence-trust-e7e33e19efc1)\n* [Site Reliability Engineering with Stephen Weinberg](https://youtu.be/24xb7oZgu-I?t=29m24s)\n* [We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!](https://www.reddit.com/r/IAmA/comments/177267/we_are_the_google_site_reliability_team_we_make)\n* [We are the Google Site Reliability Engineering team. Ask us Anything!](https://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_google_site_reliability_engineering/)\n* [The Ops Identity Crisis](http://www.susanjfowler.com/blog/2016/10/13/the-ops-identity-crisis)\n* [The Irreproducibility Of Bugs In Large-Scale Production Systems](http://www.susanjfowler.com/blog/2016/11/2/the-irreproducibility-of-bugs-in-large-scale-production-systems)\n* [SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering](http://www.se-radio.net/2016/12/se-radio-episode-276-bjorn-rabenstein-on-site-reliability-engineering/)\n* [Microservices, DevOps and Production Complexity](https://blog.netsil.com/microservices-devops-and-operational-complexity-be98cb01b660)\n* [Introducing Google Customer Reliability Engineering](https://cloudplatform.googleblog.com/2016/10/introducing-a-new-era-of-customer-support-Google-Customer-Reliability-Engineering.html)\n* [Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)](https://robhirschfeld.com/2016/12/29/evolution-or-rebellion-the-rise-of-site-reliability-engineers-sre/)\n* [The difference between Site Reliability Engineering, System Administration, and DevOps](https://standalone-sysadmin.com/the-difference-between-site-reliability-engineering-system-administration-and-devops-d05031495499)\n* [SRE in the Small and in the Large](https://www.usenix.org/conference/lisa16/conference-program/presentation/closing-plenary)\n* [SBSRE Meetup: Different SRE roles and challenges(Netflix)](https://www.youtube.com/watch?v=zLXf0cKDOv0)\n* [Panel: Who/What Is SRE?](https://www.usenix.org/conference/srecon16/program/presentation/definition-of-sre-panel)\n* [Hope Is Not a Strategy](https://medium.com/@jerub/hope-is-not-a-strategy-6a7d0a3b1c08)\n* [Tenets of SRE](https://medium.com/@jerub/tenets-of-sre-8af6238ae8a8)\n* [Site Reliability Engineering Demystified](https://medium.com/@venkatachalamrangasamy/site-reliability-engineering-demystified-ed676e0a7d56)\n* [Is Site Reliability Engineering the True ‘Ops’ in DevOps?](https://devops.com/site-reliability-engineering-sre-true-ops-devops/)\n* [SRE vs. DevOps vs. Cloud Native: The Server Cage Match](https://devops.com/sre-devops-cloud-native-server-cage-match/)\n* [SRE: What’s The Big Idea?](https://youtu.be/8dfYLRAWn_c)\n* [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin)\n* [Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate](https://stackoverflow.blog/2017/06/12/podcast-111-sre-occasionally-maintaining-infrastructure-hate/)\n* [Splicing SRE DNA Sequences in the Biggest Software Company on the Planet](https://www.usenix.org/conference/srecon16europe/program/presentation/splicing-sre-dna-sequences-biggest-software-company)\n* [Why should your app get SRE support? - CRE life lessons](https://cloudplatform.googleblog.com/2017/06/why-should-your-app-get-SRE-support-CRE-life-lessons.html)\n* [How SREs find the landmines in a service - CRE life lessons](https://cloudplatform.googleblog.com/2017/06/how-SREs-find-the-landmines-in-a-service-CRE-life-lessons.html)\n* [Making the most of an SRE service takeover - CRE life lessons](https://cloudplatform.googleblog.com/2017/07/making-the-most-of-an-SRE-service-takeover-CRE-life-lessons.html)\n* [The Cloudcast #301: SRE and Infrastructure Operations (Podcast)](https://dzone.com/articles/the-cloudcast-301-sre-and-infrastructure-operation)\n* [The SRE model](https://medium.com/@rakyll/the-sre-model-6e19376ef986)\n* [Onboarding New Site Reliability Engineers](https://circleci.com/blog/onboarding-new-site-reliability-engineers/)\n* [Building Blocks for Site Reliability At Google](https://www.youtube.com/watch?v=nQv9ySa8MTU)\n* [Beyond Google SRE: What is Site Reliability Engineering like at Medium?](https://blog.netsil.com/beyond-google-sre-what-is-site-reliability-engineering-like-at-medium-71c65bd35f4e)\n* [Intelligent Site Reliability Engineering – A Machine Learning Perspective](http://blog.adnanmasood.com/2016/05/19/intelligent-site-reliability-engineering-a-machine-learning-perspective/)\n* [A crash course in LinkedIn's global site operations](https://engineering.linkedin.com/day-life/crash-course-linkedins-global-site-operations)\n* [Google’s Site Reliability Engineering with Todd Underwood](https://softwareengineeringdaily.com/2016/06/14/googles-site-reliability-engineering-todd-underwood/)\n* [What is Site Reliability Engineering? (VMware)](https://blogs.vmware.com/services-education-insights/2018/02/site-reliability-engineering.html)\n* [A Gentle Introduction to SRE](http://geekologist.co/introduction-to-sre/)\n* [Understanding Site Reliability Engineering through Movies and Books](http://engineering.medallia.com/blog/posts/understanding-site-reliability-engineering-through-movies-and-books/)\n* [GOTO 2017 • Site Reliability Engineering at Google • Christof Leng](https://www.youtube.com/watch?v=Cxb7a8lTv8A)\n* The Makeup of Successful Geographically-Distributed SRE Teams - [Part1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p) \u0026 [Part2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)\n* [Tech Leadership in SRE](https://www.youtube.com/watch?v=6G2V1xPIM64)\n* [The Azure Podcast: Episode 227 - Azure SRE](http://azpodcast.azurewebsites.net/post/Episode-227-Azure-SRE1)\n* [The human scalability of \"DevOps\"](https://medium.com/@mattklein123/the-human-scalability-of-devops-e36c37d3db6a)\n* [Podcast: Site Reliability Management with Mike Hiraga](https://softwareengineeringdaily.com/2018/04/09/site-reliability-management-with-mike-hiraga/)\n* [How a cat inspired system reliability at Knowlarity](https://medium.com/@Knowlarity_Engineering/how-a-cat-inspired-system-reliability-at-knowlarity-ad73c24f29a7)\n* [Getting Started with Site Reliability Engineering](https://github.com/devopsenterprise/2018-London/blob/master/Tuesday/Breakout%20Sessions/Throne%2C%20Stephen%2C%20Getting%20Started%20with%20Site%20Reliability%20Engineering.pdf)\n* [\"Practical Applications of the Dickerson Pyramid\" by Nat Welch](https://www.youtube.com/watch?v=xWAfTAu0Mww)\n* [LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations](https://blameless.com/blog/sre-implementations-blindspots/)\n* [Interview with Betsy Beyer, Stephen Thorne of Google](https://driftboatdave.com/2018/10/09/interview-with-betsy-beyer-stephen-thorne-of-google/)\n* [Less Risk Through Greater Humanity - Dave Rensin](https://www.youtube.com/watch?v=0zqBlRW_6jA)\n* [Getting Started with SRE - Stephen Thorne, Google](https://www.youtube.com/watch?v=c-w_GYvi0eA)\n* [Building Successful SRE in Large Enterprises](https://drive.google.com/file/d/1FXwHm6mpmRA9NaIJEu4cB1s6ffbyGBfl/view)\n* [Solving Reliability Fears with Site Reliability Engineering](https://www.youtube.com/watch?v=ZcZtU_TiFEM)\n* [SRE vs. DevOps: competing standards or close friends?](https://cloud.google.com/blog/products/gcp/sre-vs-devops-competing-standards-or-close-friends)\n* [How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams](https://thenewstack.io/how-to-avoid-the-5-sre-implementation-traps-that-catch-even-the-best-teams/)\n* [Reliability Engineering – The Essential Discipline for Complex Systems](https://vimeo.com/344515149)\n* [The Modern Site Reliability Workbench on Top of OCI](https://www.youtube.com/watch?v=bC5dIPzNH24)\n* [SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)\n* [About SRE and how (not) to apply it](https://www.youtube.com/watch?v=vF6ajM3P_wM)\n* [Transitioning a typical engineering ops team into an SRE powerhouse](https://cloud.google.com/blog/products/management-tools/transitioning-a-typical-engineering-ops-team-into-an-sre-powerhouse)\n* [Making a Lion Bulletproof: SRE in Banking](https://www.infoq.com/presentations/ing-sre-teams-practices/)\n* [Identifying and tracking toil using SRE principles](https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles)\n* [From Ops to SRE: Evolution of the OpenShift Dedicated Team](https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team)\n* [Meeting reliability challenges with SRE principles](https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles)\n* [A quick introduction to SRE principles](https://github.com/fhivemind/sre-playground)\n* [The SRE I Aspire to Be](https://www.youtube.com/watch?v=KnC2eRUZMKY)\n* [Taming Operational Load with VMware CRE](https://tanzu.vmware.com/content/blog/taming-operational-load-vmware-cre)\n* [SRE Cultural Values](https://dubrie.medium.com/sre-cultural-values-a0073b475183)\n* [Are we there yet? Thoughts on assessing an SRE team’s maturity](https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum)\n* [What SREs have to do with project-based services?](https://www.linkedin.com/pulse/what-sres-have-do-project-based-services-rod-anami/)\n* [Making operational work more visible](https://github.com/readme/guides/ops-work-visible)\n* [SRE vs. DevOps: What’s the Difference Between Them?](https://spacelift.io/blog/sre-vs-devops)\n\n## Education\n* [Panel: Educating SRE](https://www.usenix.org/conference/srecon15/program/presentation/sebenik)\n* [From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams](https://www.usenix.org/conference/srecon15/program/presentation/widdowson)\n* [New to an SRE team?](https://www.linkedin.com/pulse/new-sre-team-anthony-caiafa/)\n* [The Systems Engineering Side of Site Reliability Engineering](https://www.usenix.org/publications/login/june15/hixson)\n* [Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?](https://medium.com/@tammybutow/graduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b)\n* [So you want to be a Site Reliability Engineer?](https://www.loomsystems.com/single-post/2016/03/23/So-you-want-to-be-a-Site-Reliability-Engineer)\n* [Spiraling Ops Debt \u0026 the SRE Coding Imperative](https://www.loomsystems.com/blog/2017/02/06/spiraling-ops-debt-the-sre-coding-imperative)\n* [So you want to be an SRE?](https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c)\n* [Career Profiles/Site Reliability Engineer](https://www.khanacademy.org/college-careers-more/career-content/career-profile-videos/site-reliability-engineer/v/ruth-grace-site-reliability-engineer-what-i-do-and-how-much-i-make)\n* [What is the role of a Site Reliability Engineer?](https://cloudacademy.com/blog/what-is-the-role-of-a-site-reliability-engineer/)\n* [Lynda.com: DevOps Foundations: Site Reliability Engineering](https://www.lynda.com/Software-Development-tutorials/DevOps-Foundations-Site-Reliability-Engineering/669542-2.html)\n* [Incident Management Training: Wheel of Misfortune](https://dastergon.gr/wheel-of-misfortune/)\n* [Site Un-Reliability Engineering [Video Series]](https://www.youtube.com/watch?v=rmY8_PHanuI)\n* [The Ultimate Guide to Structuring a 90-Day Onboarding Plan](https://medium.com/swlh/the-ultimate-guide-to-structuring-a-90-day-onboarding-plan-c91af947376)\n* [SRE fundamentals: SLIs, SLAs and SLOs](https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos)\n* [How to Get Into SRE](https://blog.alicegoldfuss.com/how-to-get-into-sre/)\n* [Do you have an SRE team yet? How to start and assess your journey](https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey)\n* [How SRE teams are organized, and how to get started](https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started)\n* [Why SRE Documents Matter](https://queue.acm.org/detail.cfm?id=3283589)\n* [How to get started with site reliability engineering (SRE)](https://www.oreilly.com/ideas/how-to-get-started-with-site-reliability-engineering-sre)\n* [Duties of a Site Reliability Engineering Manager](https://victorops.com/blog/duties-of-a-site-reliability-engineering-manager)\n* [Designing distributed systems using NALSD flashcards](https://cloud.google.com/blog/products/management-tools/sre-principles-and-flashcards-to-design-nalsd)\n* [Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program](https://landing.google.com/sre/resources/practicesandprocesses/training-site-reliability-engineers)\n* [SRE Classroom: Distributed PubSub workshop](https://landing.google.com/sre/resources/practicesandprocesses/sre-classroom/)\n* [School of SRE: Curriculum for onboarding non-traditional hires and new grads](https://linkedin.github.io/school-of-sre/)\n\n## Books\n* [Practical Linux Infrastructure](https://link.springer.com/book/10.1007/978-1-4842-0511-2)\n* [Site Reliability Engineering: How Google Runs Production Systems](https://landing.google.com/sre/book.html)\n* [The Site Reliability Workbook: Practical Ways to Implement SRE](https://landing.google.com/sre/book.html)\n* [Observability Engineering: Achieving Production Excellence](https://info.honeycomb.io/observability-engineering-oreilly-book-2022)\n* [The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems](http://the-cloud-book.com/)\n* [Web Operations - Keeping the Data On Time](http://shop.oreilly.com/product/0636920000136.do)\n* [The Checklist Manifesto: How to Get Things Right](http://atulgawande.com/book/the-checklist-manifesto/)\n* [Microservices in Production - Standard Principles and Requirements](http://www.oreilly.com/programming/free/microservices-in-production.csp)\n* [Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization](http://shop.oreilly.com/product/0636920053675.do)\n* [Systems Performance: Enterprise and the Cloud](https://www.amazon.com/Systems-Performance-Enterprise-Brendan-Gregg/dp/0133390098/) \\[Sample chapter titled [CPUs](http://ptgmedia.pearsoncmg.com/images/9780133390094/samplepages/0133390098.pdf)\n* [Monitoring Distributed Systems: Case Studies from Google's SRE Teams](http://www.oreilly.com/webops-perf/free/monitoring-distributed-systems.csp)\n* [The Human Side of Postmortems: Managing Stress and Cognitive Biases](http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp)\n* [Chaos Engineering: Building Confidence in System Behavior through Experiment](http://www.oreilly.com/webops-perf/free/chaos-engineering.csp)\n* [Post-Incident Reviews: Learning from Failure for Improved Incident Responses](https://victorops.com/oreilly-post-incident-review/)\n* [Antifragile Systems and Teams](http://www.oreilly.com/webops-perf/free/antifragile-systems-and-teams.csp)\n* [How to Monitoring the SRE Golden Signals (E-Book)](https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook/)\n* [Incident Management for Operations](http://shop.oreilly.com/product/0636920036159.do)\n* [Real-World SRE](https://www.packtpub.com/web-development/real-world-sre)\n* [Seeking SRE](http://shop.oreilly.com/product/0636920063964.do)\n* [What is SRE?](https://www.verizondigitalmedia.com/e-book/oreilly-what-is-sre/)\n* [Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications](https://landing.google.com/sre/resources/practicesandprocesses/engineering-reliable-mobile-applications/)\n* [Building Secure and Reliable Systems](https://landing.google.com/sre/book.html)\n* [Chaos Engineering: Crash test your applications](https://www.manning.com/books/chaos-engineering/)\n* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/)\n* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)\n* [The Linux Programming Interface](https://nostarch.com/tlpi)\n\n## Hiring\n* [SRE Hiring](https://www.usenix.org/conference/srecon15/program/presentation/fong)\n* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)\n* [Hiring Site Reliability Engineers](https://www.usenix.org/publications/login/june15/hiring-site-reliability-engineers)\n* [Hiring your first SRE](https://sreally.com/hiring-your-first-sre-bdda38ee175d#.2m3sqyuw9)\n* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard](https://www.youtube.com/watch?v=ZemNg9GYvOA)\n* [Engineering Manager - Site Reliability Engineering Interview Preparation](https://danrl.com/blog/srm)\n\n## Reliability\n* [The Realities of the Job of Delivering Reliability](https://www.usenix.org/conference/srecon16/program/presentation/kroll)\n* [Fail at Scale by Ben Maurer](http://queue.acm.org/detail.cfm?id=2839461)\n* [Embracing Failure: Fault-Injection and Service Reliability](https://www.youtube.com/watch?v=wrY7XoOnysg)\n* [10 Years of Crashing Google](https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan)\n* [How we break things at Twitter: failure testing](https://blog.twitter.com/2015/how-we-break-things-at-twitter-failure-testing)\n* [Reliable Cron across the Planet](http://queue.acm.org/detail.cfm?id=2745840)\n* [Push our limits - reliability testing at Twitter](https://blog.twitter.com/2014/push-our-limits-reliability-testing-at-twitter)\n* [The Verification of a Distributed System by Caitie McCaffrey](http://queue.acm.org/detail.cfm?ref=rss\u0026id=2889274)\n* [Weathering the Unexpected](http://queue.acm.org/detail.cfm?id=2371516)\n* [SRE Hour: Tech Talks by Box \u0026 Yelp](https://www.youtube.com/watch?v=YFDwdRVTg4g)\n* [Simplicity: A Prerequisite for Reliability](https://sharpend.io/simplicity-a-prerequisite-for-reliability/)\n* [The Two Sides to Google Infrastructure for Everyone Else](https://speakerdeck.com/garethr/the-two-sides-to-google-infrastructure-for-everyone-else)\n* [How Embracing Continuous Release Reduced Change Complexity](https://www.usenix.org/conference/ures14west/summit-program/presentation/dickson)\n* [Making \"Push On Green\" a Reality](https://www.usenix.org/publications/login/october-2014-vol-39-no-5/making-push-green-reality)\n* [BeyondCorp: A New Approach to Enterprise Security](https://www.usenix.org/publications/login/dec14/ward)\n* [Brainstorming Failure by Jeff Smith](https://www.youtube.com/watch?v=dKe9S8u44Yk)\n* [The Ripple Effect Of Outages And Downtime Cannot Be Underestimated](http://cloudtweaks.com/2016/04/outages-and-downtime/)\n* [The infrastructure behind Twitter: efficiency and optimization](https://blog.twitter.com/2016/the-infrastructure-behind-twitter-efficiency-and-optimization)\n* [Dickerson's Hierarchy of Reliability](https://docs.google.com/drawings/d/1kshrK2RLkW-XV8enmWZxeRFRgADj6d4Ru_w5txz_k9I/edit)\n* [The Morning Paper on Operability](https://blog.acolyer.org/2016/09/21/the-morning-paper-on-operability/)\n* [Production is all that matters](http://naildrivin5.com/blog/2013/06/16/production-is-all-that-matters.html)\n* [Using load shedding to survive a success disaster - CRE life lessons](https://cloudplatform.googleblog.com/2016/12/using-load-shedding-to-survive-a-success-disaster-CRE-life-lessons.html)\n* [How to avoid a self-inflicted DDoS Attack - CRE life lessons](https://cloudplatform.googleblog.com/2016/11/how-to-avoid-a-self-inflicted-DDoS-Attack-CRE-life-lessons.html)\n* [Don't gamble when it comes to reliability](https://www.oreilly.com/ideas/dont-gamble-when-it-comes-to-reliability)\n* [Resilience Engineering: Learning to Embrace Failure](https://queue.acm.org/detail.cfm?id=2371297)\n* [The Infrastructure Behind Twitter: Scale](https://blog.twitter.com/2017/the-infrastructure-behind-twitter-scale)\n* [Scaling Reliability at Twitter: So You Want to Add a 9](https://www.youtube.com/watch?v=hYu13kBenjE)\n* [Principles Of Chaos Engineering](http://principlesofchaos.org/)\n* [Chaos Engineering](https://www.infoq.com/articles/chaos-engineering)\n* [Available...or not? That is the question - CRE life lessons](https://cloudplatform.googleblog.com/2017/01/available-or-not-that-is-the-question-CRE-life-lessons.html)\n* [How Google Backs Up The Internet Along With Exabytes Of Other Data](http://highscalability.com/blog/2014/2/3/how-google-backs-up-the-internet-along-with-exabytes-of-othe.html)\n* [Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements](http://highscalability.com/blog/2017/2/2/performance-scalability-and-high-availability-3-key-infrastr.html)\n* The Production Environment at Google - [Part 1](https://medium.com/@jerub/the-production-environment-at-google-8a1aaece3767) \u0026 [Part 2](https://medium.com/@jerub/the-production-environment-at-google-part-2-610884268aaa)\n* [Reliable releases and rollbacks - CRE life lessons](https://cloudplatform.googleblog.com/2017/03/reliable-releases-and-rollbacks-CRE-life-lessons.html)\n* [How release canaries can save your bacon - CRE life lessons](https://cloudplatform.googleblog.com/2017/03/how-release-canaries-can-save-your-bacon-CRE-life-lessons.html)\n* [Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites](https://zwischenzugs.wordpress.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/)\n* [Every Day Is Monday in Operations](https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason)\n* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)\n* [Designing reliable systems with cloud infrastructure (Google Cloud Next '17)](https://www.youtube.com/watch?v=7Hy_6SMn8pY)\n* [A Google SRE explores GitHub reliability with BigQuery](https://cloud.google.com/blog/big-data/2016/10/a-google-sre-explores-github-reliability-with-bigquery)\n* [Know thy enemy: how to prioritize and communicate risks - CRE life lessons](https://cloudplatform.googleblog.com/2017/05/know-thy-enemy-how-to-prioritize-and-communicate-risks-CRE-life-lessons.html)\n* [Chaos Engineering resources](https://github.com/dastergon/awesome-chaos-engineering)\n* [CRE life lessons: What is a dark launch, and what does it do for me?](https://cloudplatform.googleblog.com/2017/08/CRE-life-lessons-what-is-a-dark-launch-and-what-does-it-do-for-me.html)\n* [Why you should pick strong consistency, whenever possible](https://cloudplatform.googleblog.com/2018/01/why-you-should-pick-strong-consistency-whenever-possible.html)\n* [The Network is Reliable](https://queue.acm.org/detail.cfm?id=2655736)\n* [Are You Load Balancing Wrong?](https://queue.acm.org/detail.cfm?id=3028689)\n* [How production engineers support global events on Facebook](https://code.facebook.com/posts/166966743929963/how-production-engineers-support-global-events-on-facebook/)\n* [Google: A Collection Of Best Practices For Production Services](http://highscalability.com/blog/2018/4/16/google-a-collection-of-best-practices-for-production-service.html)\n* [Canary Analysis Service](https://queue.acm.org/detail.cfm?id=3194655)\n* [Tips for High Availability](https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c)\n* [Progressive Service Architecture At Auth0](https://auth0.com/blog/progressive-service-architecture-at-auth0/)\n* [Google Cloud Production Guideline](https://medium.com/google-cloud/production-guideline-9d5d10c8f1e)\n* [production readiness](https://jbd.dev/prod-readiness/)\n* [Trust By Design: The Fusion of Operational Maturity and Risk Modeling](https://www.youtube.com/watch?v=Vvd3uvNvMns)\n* [Top Seven Myths of Robust Systems](https://www.verica.io/top-seven-myths-of-robust-systems/)\n* [Taming chaos: Preparing for your next incident](https://www.oreilly.com/ideas/taming-chaos-preparing-for-your-next-incident)\n* [PID Loops and the Art of Keeping Systems Stable](https://www.youtube.com/watch?v=3AxSwCC7I4s)\n* [Are you ready for production?](https://www.youtube.com/watch?v=YptJ2rrGAYY) - [Slides](https://speakerdeck.com/rakyll/are-you-ready-for-production)\n* [Production Checklist for Web Apps on Kubernetes](https://srcco.de/posts/web-service-on-kubernetes-production-checklist-2019.html)\n* [Finding a problem at the bottom of the Google stack](https://cloud.google.com/blog/products/management-tools/sre-keeps-digging-to-prevent-problems)\n* [Rethinking Task Size in SRE](https://www.oreilly.com/content/rethinking-task-size-in-sre/)\n* [How maintenance windows affect your error budget](https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows)\n* [The Production Readiness Spectrum](https://dastergon.gr/posts/2020/09/the-production-readiness-spectrum/)\n* [Generic mitigations](https://www.oreilly.com/content/generic-mitigations/)\n* [How we’re building a production readiness review process at Grafana Labs](https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs/)\n* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)\n* [Using Fault Injection Testing to Improve DoorDash Reliability](https://doordash.engineering/2022/04/25/using-fault-injection-testing-to-improve-doordash-reliability/)\n\n## Monitoring \u0026 Observability \u0026 Alerting\n* [A Working Theory-of-Monitoring](https://www.usenix.org/conference/lisa13/working-theory-monitoring)\n* [The Evolution of Monitoring Systems at Google - Tony Rippy](https://vimeo.com/131484321)\n* [Monitoring without Infrastructure @ Airbnb](https://www.usenix.org/conference/srecon15/program/presentation/serebryany)\n* [Monitoring distributed systems](https://www.oreilly.com/ideas/monitoring-distributed-systems)\n* [Observability at Uber Engineering: Past, Present, Future](https://www.youtube.com/watch?v=2JAnmzVwgP8)\n* [The 4 Golden Signals of API Health and Performance in Cloud-Native Applications](https://blog.netsil.com/the-4-golden-signals-of-api-health-and-performance-in-cloud-native-applications-a6e87526e74)\n* [My Philosophy on Alerting by Rob Ewaschuk](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview#)\n* [Time To Detect - Netflix](https://www.youtube.com/watch?v=wsgpV67MLFo)\n* [Why Percentiles Don’t Work the Way you Think](https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think)\n* [Building Twitter’s Next-Gen Alerting System](https://www.youtube.com/watch?v=jQggG0qIjTM)\n* [Instrumentation: Worst case performance matters](https://honeycomb.io/blog/2017/01/instrumentation-worst-case-performance-matters/)\n* [Instrumentation: What does 'uptime' mean?](https://honeycomb.io/blog/2017/01/instrumentation-what-does-uptime-mean/)\n* [Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned](https://circleci.com/blog/incidents-outages-at-circleci-our-playbook-and-what-we-ve-learned/)\n* [An introduction to monitoring and alerting with timeseries at scale, with Prometheus](https://www.youtube.com/watch?v=gNmWzkGViAY)\n* [Detecting outliers and anomalies in realtime at Datadog](https://www.youtube.com/watch?v=mG4ZpEhRKHA)\n* [How to Monitor the SRE Golden Signals](https://medium.com/devopslinks/how-to-monitor-the-sre-golden-signals-1391cadc7524)\n* [Monitoring in a DevOps World](https://queue.acm.org/detail.cfm?id=3178371)\n* [Monitoring Your Monitoring’s Monitoring](https://medium.com/@jerub/monitoring-your-monitorings-monitoring-51d479100f4c)\n* [Observability: the new wave or buzzword?](https://medium.com/@dlite/observability-the-new-wave-or-buzzword-fc23a68abf72)\n* [Monitoring Isn't Observability](https://www.vividcortex.com/blog/monitoring-isnt-observability)\n* [Monitoring in the time of Cloud Native](https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e)\n* [Principles of Monitoring Microservices](https://www.youtube.com/watch?v=2LNHv0JyBUk)\n* [The Many Ways Your Monitoring Is Lying to You](https://www.usenix.org/node/197446)\n* [GitOps Part 3 - Observability](https://www.weave.works/blog/gitops-part-3-observability)\n* [Want to Debug Latency?](https://medium.com/observability/want-to-debug-latency-7aa48ecbe8f7)\n* [Debugging Latency in Go 1.11](https://medium.com/observability/debugging-latency-in-go-1-11-9f97a7910d68)\n* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)\n* [Applied Alerting Philosophy](https://www.youtube.com/watch?v=JhxfZ0VIPP0)\n* [Observations on Observability](https://blog.colinbreck.com/observations-on-observability/)\n* [Deploys: It's Not Actually About Fridays](https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/)\n* [Site Reliability Engineering Best Practices for Data Pipelines](https://medium.com/better-programming/site-reliability-engineering-best-practices-for-data-pipelines-44a78e91f6f0)\n* [Elastic Observability in SRE and Incident Response](https://www.elastic.co/blog/elastic-observability-sre-incident-response)\n* [Error Budget Policy - Part 1 - Adoption at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policy-adoption-at-expedia-group-7d80d41c4a8b)\n* [Error Budget Policy - Part 2 - Practices at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policies-in-practice-4c98f56a28c1)\n\n## On-Call\n* [Being an On-Call Engineer: A Google SRE Perspective](http://research.google.com/pubs/pub44813.html)\n* [Inside Atlassian: how our site reliability engineers do incident management](https://www.atlassian.com/blog/it-teams/inside-atlassian-site-reliability-engineers-incident-management)\n* [Inside Atlassian: how IT \u0026 SRE use ChatOps to run incident management](https://www.atlassian.com/blog/2016/02/inside-atlassian-sre-use-chatops-run-incident-management)\n* [Incident Response at Heroku](https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku)\n* [Who's On Call?](http://www.susanjfowler.com/blog/2016/9/6/whos-on-call)\n* [SysAdvent - Day 6 - No More On-Call Martyrs](https://sysadvent.blogspot.com/2016/12/day-6-no-more-on-call-martyrs.html)\n* [On Being On Call](http://naildrivin5.com/blog/2016/12/07/on-call.html)\n* [The On-Call Handbook](https://github.com/alicegoldfuss/oncall-handbook)\n* [Incident management at Google — adventures in SRE-land](https://cloudplatform.googleblog.com/2017/02/Incident-management-at-Google-adventures-in-SRE-land.html)\n* [Run Book / Operations Manual template](https://github.com/SkeltonThatcher/run-book-template)\n* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)\n* [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)\n* [SRE@Xero: Managing Incidents Part I](https://devblog.xero.com/sre-xero-managing-incidents-part-i-7d02d650a71c)\n* [SRE@Xero: Managing Incidents Part II](https://devblog.xero.com/sre-xero-managing-incidents-part-ii-224a6e06f426)\n* [How To Establish a High Severity Incident Management Program](https://www.gremlin.com/how-to-establish-a-high-severity-incident-management-program/)\n* [How Your Systems Keep Running Day After Day - John Allspaw](https://www.youtube.com/watch?v=xA5U85LSk0M)\n* [On-call doesn’t have to suck](https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0)\n* [Why, as a Netflix infrastructure manager, am I on call?](https://medium.com/@awspyker/why-as-a-netflix-infrastructure-manager-am-i-on-call-bdc551ac01fe)\n* [Oncall and Sustainable Software Development](https://honeycomb.io/blog/2018/02/oncall-and-sustainable-software-development/)\n* [On Call Rotations: How Best to Wake Devs Up in the Middle of the Night](https://thenewstack.io/call-rotations-best-wake-devs-middle-night/)\n* [Understanding The Role Of The Incident Manager On-Call (IMOC)](https://www.gremlin.com/community/tutorials/understanding-the-role-of-the-incident-manager-on-call-imoc/)\n* [3 Ways to Minimize the Impact of High Severity Incidents](https://devops.com/three-ways-to-minimize-the-impact-of-high-severity-incidents/)\n* [Advice to Management Teams While Enrolling Changes to On-Call Systems](https://thenewstack.io/advice-management-teams-enrolling-changes-on-call-systems/)\n* [Moving Past Shallow Incident Data](http://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/)\n* [Sustainable On-Call](https://codywilbourn.com/2018/03/22/sustainable-on-call/)\n* [dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident](https://youtu.be/8pPrtf1J1Z8)\n* [Incident Management at Netflix Velocity](https://www.infoq.com/presentations/netflix-incident-management)\n* [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3)\n* [10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use](https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c)\n* [Checklists: a stupidly simple but valuable operational gift](https://tech.buzzfeed.com/checklists-an-operational-gift-aaf42cf0be12)\n* [How to write a status page update](https://blog.hostedgraphite.com/2018/09/13/how-to-write-a-status-page-update/)\n* [Atlassian Incident Handbook](https://www.atlassian.com/software/jira/ops/handbook)\n* [PagerDuty Incident Response Handbook](https://response.pagerduty.com/)\n* [Avoiding Burnout for SREs](https://blog.zenduty.com/blog/2019/05/02/Avoiding-SRE-Burnout)\n* [Better On-Call the SRE way](https://vimeo.com/344516642)\n* [Managing Incidents at Monzo](https://www.youtube.com/watch?v=ZqwVlsIonIw)\n* [Making On-Call Not Suck](https://dev.to/molly_struve/making-on-call-not-suck-490)\n* [How we (Monzo) respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)\n* [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)\n* [Code Yellow: When Operations Isn’t Perfect](https://devops.com/code-yellow-when-operations-isnt-perfect/)\n* [MTTR is dead, long live CIRT](https://opensource.com/article/19/7/measure-operational-performance)\n* [Extended Dreyfus Model for Incident Lifecycles](https://github.com/preed/incident-lifecycle-model)\n* [Inhumanity of Root Cause Analysis](https://www.verica.io/inhumanity-of-root-cause-analysis/)\n* [Incident insights from NASA, NTSB, and the CDC](https://www.youtube.com/watch?v=ODYO2MPymJ4)\n* [How to avoid On-Call Burnout the SRE Way](https://www.squadcast.com/blog/how-to-avoid-on-call-burnout)\n* [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/)\n* [How our production team runs the weekly on-call handover](https://about.gitlab.com/blog/2018/03/14/the-on-call-handover-at-gitlab/)\n* [Writing Runbook Documentation When You’re An SRE](https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/)\n* [Incident response, programs and you(r startup)](https://lethain.com/incident-response-programs-and-your-startup/)\n* [An Incident Command Training Handbook](https://blog.danslimmon.com/2019/06/24/an-incident-command-training-handbook/)\n* [Shrinking the time to mitigate production incidents](https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents)\n* [Incident writeup as sociological storytelling](https://surfingcomplexity.blog/2021/06/11/incident-writeup-as-sociological-storytelling/)\n* [Elephant in the Blameless War Room: Accountability](https://www.blameless.com/incident-response/elephant-in-the-blameless-war-room-accountability)\n* [Naming names in incident writeups](https://surfingcomplexity.blog/2021/05/22/naming-names-in-incident-writeups/)\n* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)\n\n## Post-Mortem\n* [A collection of post-mortems](https://github.com/danluu/post-mortems)\n* [Collection of Kubernetes Failure Stories](https://github.com/hjacobs/kubernetes-failure-stories)\n* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)\n* [A Tale of Postmortems](https://blog.box.com/blog/a-tale-of-postmortems/)\n* [Building a Blameless Post-Mortem Culture with Jason Hand](http://runasradio.com/Shows/Show/486)\n* [The infinite hows](https://www.oreilly.com/ideas/the-infinite-hows)\n* [Failure is Always An Option: How a Blameless Culture Leads to Better Results](https://victorops.com/blog/blameless-culture/)\n* [SysAdvent - Day 1 - Why You Need a Postmortem Process](https://sysadvent.blogspot.com/2016/12/day-1-why-you-need-postmortem-process.html)\n* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)\n* [Writing Your First Postmortem](https://sharpend.io/writing-your-first-postmortem/)\n* [How to Write Great Outage Post-Mortems](https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/)\n* [A collection of postmortem templates](https://github.com/dastergon/postmortem-templates)\n* [Embracing Feedback](https://blog.heptio.com/embracing-feedback-2fd703da714f)\n* [Postmortem Action Items: Plan the Work and Work the Plan](https://www.usenix.org/conference/srecon17americas/program/presentation/lueder)\n* [Social Issues In Postmortems](https://medium.com/@allspaw/social-issues-in-postmortems-d48dde624d18)\n* [Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant](https://www.inc.com/justin-bariso/meet-postmortem-googles-brilliant-process-tool-for-learning-from-failure.html)\n* [Postmortem culture: how you can learn from failure](https://rework.withgoogle.com/blog/postmortem-culture-how-you-can-learn-from-failure/)\n* [re:Work - Postmortem discussion template](https://docs.google.com/document/d/1ob0dfG_gefr_gQ8kbKr0kS4XpaKbc0oVAk4Te9tbDqM/edit)\n* [Post-mortems to the rescue](https://increment.com/documentation/post-mortems-to-the-rescue/)\n* [Postmortem Action Items: Plan the Work and Work the Plan](https://ai.google/research/pubs/pub45906)\n* [Why Every Company Can Benefit from a Blameless Culture](https://www.blameless.com/why-companies-can-benefit-from-blameless-culture/)\n* [\"It's dead, Jim\": How we write an incident postmortem](https://www.hostedgraphite.com/blog/its-dead-jim-how-we-write-an-incident-postmortem)\n* [Our incident postmortem template](https://www.hostedgraphite.com/blog/incident-postmortem-template)\n* [Learn out of mistakes. Postmortems to the rescue.](https://fernandocejas.com/2020/03/21/learn-out-of-mistakes-postmortems/)\n* [Improving Postmortem Practices with Veteran Google SRE, Steve McGhee](https://www.blameless.com/improve-postmortem-with-sre-steve-mcghee/)\n* [Inhumanity of Root Cause Analysis](https://www.verica.io/blog/inhumanity-of-root-cause-analysis/)\n\n## Capacity Planning\n* [Capacity Planning](https://www.usenix.org/system/files/login/articles/login_feb15_07_hixson.pdf)\n* [SouthBay SRE: Cloud Capacity Planning](https://www.youtube.com/watch?v=MDQ0uEUmLOo)\n* [Intent-based Capacity Planning and Autoscaling with Kubernetes](https://www.squadcast.com/blog/intent-based-capacity-planning-and-autoscaling-with-kubernetes)\n* [How do you do Capacity Planning](https://jvns.ca/blog/2016/03/20/how-do-you-do-capacity-planning/)\n* [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)\n\n## Service Level Agreement\n* [If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues](http://er.educause.edu/articles/2010/6/if-its-in-the-cloud-get-it-on-paper-cloud-computing-contract-issues)\n* [Service Level Agreements in the Cloud: Who cares?](http://www.wired.com/insights/2011/12/service-level-agreements-in-the-cloud-who-cares/)\n* [SysAdvent- Day 20 - How to set and monitor SLAs](https://sysadvent.blogspot.com/2016/12/day-20-how-to-set-and-monitor-slas.html)\n* [SLOs, SLIs, SLAs, oh my - CRE life lessons](https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html)\n* [Service Levels and Error Budgets](https://www.usenix.org/conference/srecon16/program/presentation/jones)\n* [(Un)Reliability Budgets - Finding Balance between Innovation and Reliability](https://www.usenix.org/system/files/login/articles/login_aug15_06_roth.pdf)\n* [The Calculus of Service Availability](https://queue.acm.org/detail.cfm?id=3096459\u0026__s=dnkxuaws9pogqdnxmx8i)\n* [Availability Calculator: Calculate how much downtime should be permitted in your SLA](https://dastergon.github.io/availability-calculator/)\n* [Standardize cloud SLA availability with numerical performance data](https://www.ibm.com/developerworks/cloud/library/cl-SLAloadbalance-numanalysis/)\n* [Best practices to develop SLAs for cloud computing](https://www.ibm.com/developerworks/cloud/library/cl-slastandards/)\n* [A Practical Guide to SLAs](https://www.catchpoint.com/blog/sla-management-guide/)\n* [Building good SLOs - CRE life lessons](https://cloudplatform.googleblog.com/2017/10/building-good-SLOs-CRE-life-lessons.html)\n* [No Grumpy Humans and Other Site Reliability Engineering Lessons from Google](https://thenewstack.io/sre-lessons-google-no-grumpy-humans/)\n* [Consequences of SLO violations — CRE life lessons](https://cloudplatform.googleblog.com/2018/01/consequences-of-SLO-violations-CRE-life-lessons.html)\n* [Service Level Objectives in Practice](https://medium.com/@jerub/service-level-objectives-in-practice-ed1200502d5)\n* [SRE Consensus Building](https://medium.com/@jerub/sre-consensus-building-36ad5d2e470b)\n* [An example escalation policy — CRE life lessons](https://cloudplatform.googleblog.com/2018/01/an-example-escalation-policy-CRE-life-lessons.html)\n* [Error Budget Calculator](https://dastergon.gr/error-budget-calculator/)\n* [Understanding error budget overspend - part one - CRE life lessons](https://cloudplatform.googleblog.com/2018/06/understanding-error-budget-overspend-cre-life-lessons.html)\n* [Good housekeeping for error budgets - part two - CRE life lessons](https://cloudplatform.googleblog.com/2018/06/cre-life-lessons-good-housekeeping-for-error-budgets.html)\n* [SRE fundamentals: SLIs, SLAs and SLOs](https://cloudplatform.googleblog.com/2018/07/sre-fundamentals-slis-slas-and-slos.html)\n* [SLOs \u0026 You: A Guide To Service Level Objectives](https://www.circonus.com/2018/07/a-guide-to-service-level-objectives/)\n* [Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment](https://medium.com/concourse-ci/earning-our-wings-a0c307fa73e6)\n* [Nines are Not Enough: Meaningful Metrics for Clouds](https://ai.google/research/pubs/pub48033)\n* [How many nines is my storage system?](https://medium.com/@jamesacowling/how-many-nines-is-my-storage-system-7d16e852d56d)\n* [Don't follow the sun.](https://lethain.com/dont-follow-the-sun/)\n* [The Tyranny of the SLA](https://www.youtube.com/watch?v=4cPqLuIXBnw)\n* [Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter](https://www.backblaze.com/blog/cloud-storage-durability/)\n* [DevOpsDays Chicago 2019 - The Art of SLOs](https://youtu.be/Dfnbw5dJQ5I)\n* [The Art of SLOs Workshop Materials](https://cre.page.link/art-of-slos)\n* [How to Include Latency in SLO-Based Alerting](https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/)\n* [Succeeding With Service Level Objectives](https://www.squadcast.com/blog/succeeding-with-service-level-objectives)\n* [Putting customers first with SLIs and SLOs](https://medium.com/the-telegraph-engineering/putting-customers-first-with-slis-and-slos-15352f9b6cbc)\n* [SRE Leadership: Have Tiered SLAs](https://medium.com/site-reliability-engineering-leadership/sre-tip-have-tiered-slas-2c432ffe46a)\n* [How SLOs Enable Fast, Reliable Application Delivery](https://www.blameless.com/blog/how-slos-enable-fast-reliable-application-delivery)\n* [The Tail at Scale](https://billduncan.org/the-tail-at-scale/)\n* [The Tail at Scale Revisited](https://billduncan.org/the-tail-at-scale-revisited/)\n* [Defining SLOs for services with dependencies](https://cloud.google.com/blog/products/gcp/defining-slos-for-services-with-dependencies-cre-life-lessons)\n* [Service Level Disagreements](https://blog.b3k.us/2009/07/15/service-level-disagreements.html)\n* [How We Use Sloth to do SLO Monitoring and Alerting with Prometheus](https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/)\n* [SLI Deep Dive](https://medium.com/site-reliability-engineering-leadership/sli-deep-dive-cae92bd90a79)\n* [Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox](https://medium.com/google-cloud/measuring-reliability-in-gcp-step-by-step-slo-creation-guide-using-cloud-operation-sandbox-99043bd0e70f)\n* [SLO tracker](https://slotracker.com/)\n* [SLO Alerting for Mortals](https://ervinbarta.com/2021/10/19/slo-alerting-for-mortals/)\n* [SRE methods and climate change](https://bpetit.nce.re/2021/03/sre-methods-and-climate-change/)\n* [What made SLOs so messy (and what we can do about it)](https://medium.com/lightstephq/what-made-slos-so-messy-and-what-we-can-do-about-it-89be415a80b3)\n* [SLICK: Adopting SLOs for improved reliability](https://engineering.fb.com/2021/12/13/production-engineering/slick/)\n* [Calculating composite SLA](https://alexewerlof.medium.com/calculating-composite-sla-d855eaf2c655)\n* [Best practices for setting SLOs and SLIs for modern, complex systems](https://newrelic.com/blog/best-practices/best-practices-for-setting-slos-and-slis-for-modern-complex-systems)\n\n## Performance\n* [Performance Checklists for SREs](https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html)\n* [South Bay SRE Meetup - Netflix Cloud Performance Team](https://youtu.be/uQ0flQOtQEA)\n* [Software Performance Analysis Guided By SLOs](https://medium.com/dm03514-tech-blog/sre-performance-analysis-tuning-methodology-using-a-simple-http-webserver-in-go-d475460f27ca)\n* [A framework for pragmatic performance engineering](https://mterwill.com/posts/framework-for-performance-engineering/)\n\n## Programming\n* [Go Language for Ops and Site Reliability Engineering](http://www.oreilly.com/pub/e/2712)\n* [Go for SREs using Python](https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_hamilton.pdf)\n* [Operability in Go](https://speakerdeck.com/ianschenck/operability-in-go)\n* [Go Reliability and Durability at Dropbox](https://www.youtube.com/watch?v=5doOcaMXx08)\n\n## Misc Articles\n* [What is SRE (Site Reliability Engineering)?](https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering)\n* [Here’s How Google Makes Sure It (Almost) Never Goes Down](http://www.wired.com/2016/04/google-ensures-services-almost-never-go/)\n* [Are site reliability engineers the next data scientists?](http://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/)\n* [Site Reliability Engineers: \"solving the most interesting problems\"](http://googleresearch.blogspot.gr/2012/07/site-reliability-engineers-solving-most.html)\n* [Site Reliability Engineers: the \"world’s most intense pit crew\"](http://googleforstudents.blogspot.gr/2012/06/site-reliability-engineers-worlds-most.html)\n* [Site reliability engineering kicks rote tasks out of IT ops](http://searchitoperations.techtarget.com/feature/Site-reliability-engineering-kicks-rote-tasks-out-of-IT-ops)\n* [Notes on Site Reliability Engineering](http://danluu.com/google-sre-book/)\n* [Adventures in SRE-land: Welcome to Google Mission Control](https://cloudplatform.googleblog.com/2016/07/adventures-in-SRE-land-welcome-to-Google-Mission-Control.html)\n* [Book Review: Site Reliability Engineering - How Google Runs Production Systems](https://www.infoq.com/articles/site-reliability-engineering)\n* [Site Reliability Engineers: “We solve cooler problems”](https://www.google.com/about/careers/stories/site-reliability-engineering-profile-google/)\n* [SREcon17: Brave new world of site reliability engineering](http://www.networkworld.com/article/3182827/cloud-computing/srecon17-brave-new-world-of-site-reliability-engineering.html)\n* [Open AWS guide](https://github.com/open-guides/og-aws)\n* [Commentary on Site Reliability Engineering](https://medium.com/@jerub/commentary-on-site-reliability-engineering-9ba9e1be2a8c)\n* [Site Reliability Engineering: 4 Things to Know](https://www.networkcomputing.com/data-centers/site-reliability-engineering-4-things-know/888724300)\n* [Looking for SRE Success? Then Find the Intrapreneurs!](https://www.linkedin.com/pulse/looking-sre-success-find-intrapreneurs-josh-gilliland/)\n* [What Team Structure is Right for DevOps to Flourish?](http://web.devopstopologies.com/)\n* [Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency](https://www.sidewalksafari.com/2018/12/sre-in-a-travel-emergency.html)\n* [Building blameless working environment](https://sobolevn.me/2018/12/blameless-environment)\n* [SRE Adoption Report](https://techbeacon.com/devops/how-accenture-retrofitted-site-reliability-engineering)\n* [SREs: The Happiest – and Highest Paid – in the Industry](https://devops.com/sres-the-happiest-and-highest-paid-in-the-industry/)\n* [The Role of Site Reliability Engineering, Today and Tomorrow](https://thenewstack.io/the-role-of-site-reliability-engineering-today-and-tomorrow/)\n* [SRE as a Lifestyle Choice](https://medium.com/@bellmar/sre-as-a-lifestyle-choice-de9f5a82d73d)\n* [SRECon EMEA 2019 Recap](https://speakerdeck.com/dastergon/srecon-emea-2019-recap-sre-muc-meetup)\n* [Life of an SRE at Google - JC van Winkel](https://www.youtube.com/watch?v=7Oe8mYPBZmw)\n* [Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa](https://www.infoq.com/articles/site-reliability-engineering-mobile-apps/) - Case study: Halodoc adaptation of SRE principles for Native Mobile Apps\n* [SRE Best Practices by InfraCloud](https://www.infracloud.io/blogs/sre-best-practices/)\n\n## Real-time Messaging\n* [#sre channel at Hangops Slack](https://hangops.slack.com/) - Discussion of Site Reliability Engineering generally.\n* [#incident_response channel at Hangops Slack](https://hangops.slack.com/) - Discussion about Incident Response.\n* [USENIX SREcon Slack](https://usenix-srecon.slack.com)\n\n## Blogs\n* [Brendan Gregg's Blog](http://www.brendangregg.com/blog/index.html) - Highly Technical Blog Posts About Systems Internals, Performance and SRE.\n* [Everything Sysadmin](http://everythingsysadmin.com/) - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.\n* [High Scalability](http://highscalability.com/) - Technical Blog Posts About Systems Architecture.\n* [rachelbythebay](https://rachelbythebay.com/w/) - Techincal Blog Posts.\n* [Susan J. Fowler](http://www.susanjfowler.com/blog/) - Various blog posts about SRE, Software Engineering and Microservices.\n* [SysAdvent](https://sysadvent.blogspot.com) - One article for each day of December, ending on the 25th article.\n* [Stephen Thorne's Blog](https://medium.com/@jerub) - Blog Posts About SRE\n* [Increment](https://increment.com/) - A digital magazine about how teams build and operate software systems at scale.\n* [GopherSRE](http://www.gophersre.com/) - Blog Posts about Go and SRE.\n* [Cindy Sridharan](https://medium.com/@copyconstruct) - Blog posts about distributed systems and their management.\n* [Blameless Blog](https://www.blameless.com/blog/) - Blog posts about SRE culture and practices.\n* [Resilience Roundup](https://ResilienceRoundup.com) - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems\n* [Squadcast Blog](https://www.squadcast.com/blog) - Blog posts about SRE best practices, reliability, on-call and incident management.\n* [FireHydrant Blog](https://www.firehydrant.io/blog) - Posts about complex systems, incident response, and SRE best practices.\n* [Rootly Blog](https://www.rootly.io/blog) - Incident management best practices and guides.\n* [incident.io Blog](https://www.incident.io/blog) - Guides, advice and resources on incident management and response.\n* [Logit.io Blog](https://logit.io/blog) - Resources on log management, SRE and devOps.\n\n## Newsletters\n* [DevOpsLinks](https://faun.dev) - A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.\n* [KubeWeekly](https://kubeweekly.io/) - The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas\n* [SRE Weekly](https://sreweekly.com/) - Weekly Site Reliability Newsletter.\n* [O’Reilly Systems Engineering and Operations Newsletter](http://www.oreilly.com/webops-perf/newsletter.html) - Weekly systems engineering and operations news and insights from industry insiders.\n* [ChaosEngineering.news](https://chaosengineering.news/) - Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!\n* [Monitoring Weekly](https://monitoring.love/) - What's new in monitoring? Curated monitoring articles to your inbox each week.\n* [Observability news](https://o11y.news/) - Updates around observability (o11y) with a special focus on open source.\n\n## Conferences \u0026 Meetups\n* [SRECon Conferences](https://www.usenix.org/conferences/byname/925) - The Official SRE Conference.\n* [LISA Conferences](https://www.usenix.org/conferences/byname/5) - Prominent Conference About SysAdmin/DevOps/SRE.\n* [SRE Tech Talks](https://developers.google.com/events/sre/) - SRE Talks Hosted by Google.\n* [South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup](https://www.meetup.com/South-Bay-Site-Reliability-Engineering/) - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.\n* [San Francisco Reliability Engineering](https://www.meetup.com/San-Francisco-Reliability-Engineering/) - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.\n* [Site Reliability Engineering Munich, Germany](https://www.meetup.com/Site-Reliability-Engineering-Munich/) - SRE Meetup in the greater area of Oktoberfest city.\n* [ADDO - All Day DevOps](https://www.alldaydevops.com/) - A 24 hour conference that is completely online and free.\n* [Site Reliability Engineering Paris, France](https://www.meetup.com/Site-Reliability-Engineering-Paris/) - SRE Meetup in the city of light.\n* [Site Reliability Engineering India](https://www.meetup.com/site-reliability-enggineering/) - SRE Meetup India\n\n## Twitter\n* [Google SRE Twitter Account](https://twitter.com/googlesre) - Google's SRE Twitter Account.\n* [SREBook](https://twitter.com/SREBook) - The Official Twitter Account of Site Reliability Engineering Book.\n* [SREcon](https://twitter.com/SREcon) - SRECon's Official Twitter Account.\n* [SREWorkbook](https://twitter.com/SREWorkbook) - The Official Twitter Account of Site Reliability Workbook.\n* [The SRE Dev](https://twitter.com/The_SRE_Dev) - SRE-related Posts from [dev.to](https://dev.to).\n* [Twitter SRE](https://twitter.com/TwitterSRE) - The Official Twitter Account of Twitter's SRE team.\n* [Twitter SRE Weekly](https://twitter.com/SREWeekly) - The Official Twitter Account of SRE Weekly Newsletter.\n* [USENIX Association](https://twitter.com/usenix) - The Official USENIX Twitter Account.\n\n## SRE Tools\n* [Awesome SRE Tools](https://github.com/SquadcastHub/awesome-sre-tools) - A curated list of Site Reliability and Production Engineering tools\n* [List of Continuous Integration services](https://github.com/ligurio/awesome-ci)\n* [SRE cheat sheet](https://github.com/shibumi/SRE-cheat-sheet) - A cheat sheet for Site Reliability Engineering principles and numbers\n\n## Podcasts\n* [Blameless / Resilience in Action](https://podcasts.apple.com/us/podcast/resilience-in-action/id1506828506)\n* [Google SRE Prodcast](https://sre.google/prodcast)\n* [o11y Observability Podcast](https://www.honeycomb.io/usecase/o11ycast/ )\n* [On Call Nightmares (retired)](https://podcasts.apple.com/us/podcast/on-call-nightmares-podcast/id1447430839)\n* [Making of the SRE Omelette](https://open.spotify.com/show/1KxLVUduNdDRAiOw8BB32J)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdastergon%2Fawesome-sre","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdastergon%2Fawesome-sre","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdastergon%2Fawesome-sre/lists"}