Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-sre
A curated list of awesome Site Reliability and Production Engineering resources.
https://github.com/zeroc0d3lab/awesome-sre
Last synced: 2 days ago
JSON representation
-
Culture
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Keys To SRE by Ben Treynor
- Notes from Production Engineering by Pedro Canahuati
- PostOps: Recovery from Operations
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- video
- SRE@Google: Thousands of DevOps Since 2004
- Transactional System Administration Is Killing Us and Must be Stopped
- A hierarchy of SRE needs
- PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
- SRE: An incomplete guide to cultural Narnia - [[Video]](https://www.youtube.com/watch?v=__wypEhdcrQ&t=0s)
- Toil: A Word Every Engineer Should Know
- Engineering Reliability into Web Sites: Google SRE
- Putting Together Great SRE Teams
- Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air
- DEVOPS & SRE AMA - Building High Performance Organizations
- John Allspaw's AMA on Incident Analysis and Postmortems
- Part 1 - 59-sre-ii-with-paul-newson/)
- How SysAdmins Devalue Themselves
- The Softer Side of DevOps
- SRE, noun. See also: confidence, trust.
- We are the Google Site Reliability Engineering team. Ask us Anything!
- The Cloudcast #301: SRE and Infrastructure Operations (Podcast)
- SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
- Microservices, DevOps and Production Complexity
- Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)
- The difference between Site Reliability Engineering, System Administration, and DevOps
- SRE in the Small and in the Large
- SBSRE Meetup: Different SRE roles and challenges(Netflix)
- Panel: Who/What Is SRE?
- Hope Is Not a Strategy
- Tenets of SRE
- Site Reliability Engineering Demystified
- Is Site Reliability Engineering the True ‘Ops’ in DevOps?
- SRE vs. DevOps vs. Cloud Native: The Server Cage Match
- Building the SRE Culture at LinkedIn
- Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
- Building Blocks for Site Reliability At Google
- Beyond Google SRE: What is Site Reliability Engineering like at Medium?
- A crash course in LinkedIn's global site operations
- Google’s Site Reliability Engineering with Todd Underwood
- What is Site Reliability Engineering? (VMware)
- Understanding Site Reliability Engineering through Movies and Books
- Part1 - makeup-of-successful-geographically-distributed-sre-teams--p0)
- Tech Leadership in SRE
- The human scalability of "DevOps"
- Podcast: Site Reliability Management with Mike Hiraga
- How a cat inspired system reliability at Knowlarity
- Getting Started with Site Reliability Engineering
- "Practical Applications of the Dickerson Pyramid" by Nat Welch
- Interview with Betsy Beyer, Stephen Thorne of Google
- Less Risk Through Greater Humanity - Dave Rensin
- Getting Started with SRE - Stephen Thorne, Google
- Building Successful SRE in Large Enterprises
- Solving Reliability Fears with Site Reliability Engineering
- How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams
- The Modern Site Reliability Workbench on Top of OCI
- SRE in the Third Age
- About SRE and how (not) to apply it
- Transitioning a typical engineering ops team into an SRE powerhouse
- Making a Lion Bulletproof: SRE in Banking
- Identifying and tracking toil using SRE principles
- Meeting reliability challenges with SRE principles
- The SRE I Aspire to Be
- Taming Operational Load with VMware CRE
- SRE Cultural Values
- Making operational work more visible
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- Site Reliability Engineering at Salesforce
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- A History of Site Reliability Engineering at Uber
- Site Reliability Engineering with Stephen Weinberg
- The difference between Site Reliability Engineering, System Administration, and DevOps
- Intelligent Site Reliability Engineering – A Machine Learning Perspective
- Love DevOps? Wait 'till you meet SRE - k)
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- Site Reliability Engineers — Keeping Google up and running 24/7
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- SRE vs. DevOps: competing standards or close friends?
- Reliability Engineering – The Essential Discipline for Complex Systems
- We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!
- The difference between Site Reliability Engineering, System Administration, and DevOps
- GOTO 2017 • Site Reliability Engineering at Google • Christof Leng
- Are we there yet? Thoughts on assessing an SRE team’s maturity
- SRE vs. DevOps: What’s the Difference Between Them?
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- How SREs find the landmines in a service - CRE life lessons
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- Making the most of an SRE service takeover - CRE life lessons
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
- The difference between Site Reliability Engineering, System Administration, and DevOps
-
Books
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Observability Engineering: Achieving Production Excellence
- The Checklist Manifesto: How to Get Things Right
- Microservices in Production - Standard Principles and Requirements
- Monitoring Distributed Systems: Case Studies from Google's SRE Teams
- Chaos Engineering: Building Confidence in System Behavior through Experiment
- What is SRE?
- 97 Things Every SRE Should Know
- Four Steps to Creating Effective Game Day Tests
- The Linux Programming Interface
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Real-World SRE
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
- Practical Linux Infrastructure
-
Monitoring & Observability & Alerting
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Observability at Uber Engineering: Past, Present, Future
- The 4 Golden Signals of API Health and Performance in Cloud-Native Applications
- My Philosophy on Alerting by Rob Ewaschuk
- Time To Detect - Netflix
- Building Twitter’s Next-Gen Alerting System
- An introduction to monitoring and alerting with timeseries at scale, with Prometheus
- Detecting outliers and anomalies in realtime at Datadog
- How to Monitor the SRE Golden Signals
- Monitoring in a DevOps World
- Monitoring Your Monitoring’s Monitoring
- Observability: the new wave or buzzword?
- Principles of Monitoring Microservices
- GitOps Part 3 - Observability
- Applied Alerting Philosophy
- Observations on Observability
- Deploys: It's Not Actually About Fridays
- Site Reliability Engineering Best Practices for Data Pipelines
- Elastic Observability in SRE and Incident Response
- Error Budget Policy - Part 1 - Adoption at Expedia Group
- Error Budget Policy - Part 2 - Practices at Expedia Group
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
- How to Monitor the SRE Golden Signals
- Site Reliability Engineering Best Practices for Data Pipelines
-
On-Call
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- SysAdvent - Day 6 - No More On-Call Martyrs
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
- Project STAR*: Streamlining Our On-Call Process
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- How Your Systems Keep Running Day After Day - John Allspaw
- On Call Rotations: How Best to Wake Devs Up in the Middle of the Night
- Understanding The Role Of The Incident Manager On-Call (IMOC)
- 3 Ways to Minimize the Impact of High Severity Incidents
- Advice to Management Teams While Enrolling Changes to On-Call Systems
- Sustainable On-Call
- Incidents, fixes, and the day after
- Checklists: a stupidly simple but valuable operational gift
- How to write a status page update
- PagerDuty Incident Response Handbook
- Better On-Call the SRE way
- Managing Incidents at Monzo
- MTTR is dead, long live CIRT
- Incident insights from NASA, NTSB, and the CDC
- How to avoid On-Call Burnout the SRE Way
- My week shadowing a GitLab Site Reliability Engineer
- How our production team runs the weekly on-call handover
- Incident response, programs and you(r startup)
- An Incident Command Training Handbook
- Shrinking the time to mitigate production incidents
- Incident writeup as sociological storytelling
- Naming names in incident writeups
- Building On-Call Culture at GitHub
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Checklists: a stupidly simple but valuable operational gift
- How we (Monzo) respond to incidents
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- How we’ve evolved on-call at Monzo
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Code Yellow: When Operations Isn’t Perfect
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- Atlassian Incident Handbook
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
- SRE@Xero: Managing Incidents Part II
- Checklists: a stupidly simple but valuable operational gift
-
Post-Mortem
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- A Tale of Postmortems
- SysAdvent - Day 1 - Why You Need a Postmortem Process
- Writing Your First Postmortem
- How to Write Great Outage Post-Mortems
- Embracing Feedback
- Postmortem Action Items: Plan the Work and Work the Plan
- Social Issues In Postmortems
- Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant
- re:Work - Postmortem discussion template
- Post-mortems to the rescue
- "It's dead, Jim": How we write an incident postmortem
- Our incident postmortem template
- Learn out of mistakes. Postmortems to the rescue.
- Inhumanity of Root Cause Analysis
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
- Embracing Feedback
-
Capacity Planning
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- Capacity Planning
- SouthBay SRE: Cloud Capacity Planning
- Intent-based Capacity Planning and Autoscaling with Kubernetes
- How do you do Capacity Planning
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
- How Back Market SREs prepared for Black Friday
-
Education
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- The Systems Engineering Side of Site Reliability Engineering
- So you want to be a Site Reliability Engineer?
- Spiraling Ops Debt & the SRE Coding Imperative
- So you want to be an SRE?
- Career Profiles/Site Reliability Engineer
- Incident Management Training: Wheel of Misfortune
- Site Un-Reliability Engineering [Video Series
- The Ultimate Guide to Structuring a 90-Day Onboarding Plan
- How to Get Into SRE
- Do you have an SRE team yet? How to start and assess your journey
- How SRE teams are organized, and how to get started
- Why SRE Documents Matter
- Designing distributed systems using NALSD flashcards
-
Hiring
-
Reliability
- The Realities of the Job of Delivering Reliability
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- SRE Hour: Tech Talks by Box & Yelp
- Simplicity: A Prerequisite for Reliability
- The Two Sides to Google Infrastructure for Everyone Else
- How Embracing Continuous Release Reduced Change Complexity
- Making "Push On Green" a Reality
- BeyondCorp: A New Approach to Enterprise Security
- Brainstorming Failure by Jeff Smith
- Dickerson's Hierarchy of Reliability
- The Morning Paper on Operability
- Resilience Engineering: Learning to Embrace Failure
- Scaling Reliability at Twitter: So You Want to Add a 9
- Principles Of Chaos Engineering
- How Google Backs Up The Internet Along With Exabytes Of Other Data
- Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
- Part 1 - production-environment-at-google-part-2-610884268aaa)
- Every Day Is Monday in Operations
- Under the Hood: Ensuring Site Reliability
- Designing reliable systems with cloud infrastructure (Google Cloud Next '17)
- A Google SRE explores GitHub reliability with BigQuery
- The Network is Reliable
- Are You Load Balancing Wrong?
- Google: A Collection Of Best Practices For Production Services
- Canary Analysis Service
- Progressive Service Architecture At Auth0
- Google Cloud Production Guideline
- production readiness
- Trust By Design: The Fusion of Operational Maturity and Risk Modeling
- PID Loops and the Art of Keeping Systems Stable
- Are you ready for production? - [Slides](https://speakerdeck.com/rakyll/are-you-ready-for-production)
- Production Checklist for Web Apps on Kubernetes
- Finding a problem at the bottom of the Google stack
- Rethinking Task Size in SRE
- How maintenance windows affect your error budget
- The Production Readiness Spectrum
- Generic mitigations
- How we’re building a production readiness review process at Grafana Labs
- Resiliency Planning for High-Traffic Events
- Using Fault Injection Testing to Improve DoorDash Reliability
- How we break things at Twitter: failure testing
- Push our limits - reliability testing at Twitter
- The infrastructure behind Twitter: efficiency and optimization
- The Infrastructure Behind Twitter: Scale
- How release canaries can save your bacon - CRE life lessons
- Know thy enemy: how to prioritize and communicate risks - CRE life lessons
- How to avoid a self-inflicted DDoS Attack - CRE life lessons
- CRE life lessons: What is a dark launch, and what does it do for me?
-
Service Level Agreement
- SysAdvent- Day 20 - How to set and monitor SLAs
- Service Levels and Error Budgets
- (Un)Reliability Budgets - Finding Balance between Innovation and Reliability
- No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
- Service Level Objectives in Practice
- SRE Consensus Building
- Error Budget Calculator
- Understanding error budget overspend - part one - CRE life lessons
- Good housekeeping for error budgets - part two - CRE life lessons
- SRE fundamentals: SLIs, SLAs and SLOs
- Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment
- How many nines is my storage system?
- Don't follow the sun.
- The Tyranny of the SLA
- Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter
- How to Include Latency in SLO-Based Alerting
- Succeeding With Service Level Objectives
- Putting customers first with SLIs and SLOs
- SRE Leadership: Have Tiered SLAs
- How SLOs Enable Fast, Reliable Application Delivery
- The Tail at Scale
- The Tail at Scale Revisited
- Service Level Disagreements
- How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
- SLI Deep Dive
- Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox
- SLO tracker
- SLO Alerting for Mortals
- SRE methods and climate change
- What made SLOs so messy (and what we can do about it)
- SLICK: Adopting SLOs for improved reliability
- Calculating composite SLA
- Best practices for setting SLOs and SLIs for modern, complex systems
- Service Level Agreements in the Cloud: Who cares?
- Best practices to develop SLAs for cloud computing
- Building good SLOs - CRE life lessons
-
Performance
-
Programming
-
Misc Articles
- Site Reliability Engineers: "solving the most interesting problems"
- Notes on Site Reliability Engineering
- SREcon17: Brave new world of site reliability engineering
- Commentary on Site Reliability Engineering
- Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency
- Building blameless working environment
- SREs: The Happiest – and Highest Paid – in the Industry
- The Role of Site Reliability Engineering, Today and Tomorrow
- SRECon EMEA 2019 Recap
- Life of an SRE at Google - JC van Winkel
- Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa - Case study: Halodoc adaptation of SRE principles for Native Mobile Apps
- SRE Best Practices by InfraCloud
-
Blogs
- rachelbythebay - Techincal Blog Posts.
- Stephen Thorne's Blog - Blog Posts About SRE
- Increment - A digital magazine about how teams build and operate software systems at scale.
- GopherSRE - Blog Posts about Go and SRE.
- Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.
- Rootly Blog - Incident management best practices and guides.
- Rootly Blog - Incident management best practices and guides.
-
Newsletters
- SRE Weekly - Weekly Site Reliability Newsletter.
- ChaosEngineering.news - Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!
- Monitoring Weekly - What's new in monitoring? Curated monitoring articles to your inbox each week.
- Observability news - Updates around observability (o11y) with a special focus on open source.
-
Conferences & Meetups
- SRECon Conferences - The Official SRE Conference.
- LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
- South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
- San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
- Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.
- ADDO - All Day DevOps - A 24 hour conference that is completely online and free.
- Site Reliability Engineering Paris, France - SRE Meetup in the city of light.
- Site Reliability Engineering India - SRE Meetup India
-
Twitter
- Google SRE Twitter Account - Google's SRE Twitter Account.
- SREBook - The Official Twitter Account of Site Reliability Engineering Book.
- SREcon - SRECon's Official Twitter Account.
- SREWorkbook - The Official Twitter Account of Site Reliability Workbook.
- The SRE Dev - SRE-related Posts from [dev.to](https://dev.to).
- Twitter SRE - The Official Twitter Account of Twitter's SRE team.
- Twitter SRE Weekly - The Official Twitter Account of SRE Weekly Newsletter.
- USENIX Association - The Official USENIX Twitter Account.
-
Podcasts
-
Real-time Messaging
- #incident_response channel at Hangops Slack - Discussion about Incident Response.
Categories
On-Call
148
Culture
125
Monitoring & Observability & Alerting
83
Post-Mortem
56
Books
54
Reliability
49
Capacity Planning
47
Service Level Agreement
36
Education
15
Misc Articles
12
Conferences & Meetups
8
Twitter
8
Blogs
7
Hiring
5
Podcasts
4
Newsletters
4
Performance
3
Programming
3
Real-time Messaging
1
Sub Categories