SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-06-25 00:25:51 UTC
- JSON Representation
https://github.com/qainsights/performance-engineers-devops
This repository helps performance testers and engineers who wants to dive into DevOps and SRE world.
aws-devops chaos chaos-engineering devops docker engineering engineers kubernetes linux microsoft performance performance-engineers-devops rancher roadmap site-reliability-engineering sre testing
Last synced: 05 May 2025
https://github.com/adhorn/aws-fis-templates-cdk
Collection of AWS Fault Injection Simulator (FIS) experiment templates deploy-able via the AWS CDK
amazon-web-services automation aws aws-fis cdk-examples cdk-library chaos-engineering chaos-testing devops-tools sre testing
Last synced: 16 May 2025
https://github.com/bjarneo/rip
Rest in peace(s) - HTTP/UDP load testing tool
ddos go golang http learning-by-doing load-testing rip security-tools sre sre-infra udp-flood
Last synced: 25 Apr 2025
https://github.com/ari-hacks/command-line-cheat-sheet
📝 A place to quickly lookup commands (bash, vim, git, AWS, Docker, Terraform, Ansible, kubectl)
ansible aws bash command-line devops docker git k8s kubectl kubernetes sre terraform vim
Last synced: 10 Apr 2026
https://github.com/alexkroman/ollychat
Create custom DevOps AI agents that understand and manage your infrastructure.
agent agents ai ai-agent-framework ai-agents ai-agents-framework llm observability observability-data prometheus sre
Last synced: 16 May 2025
https://github.com/tedilabs/terraform-aws-account
🌳 A sustainable Terraform Package which creates Account & IAM resources on AWS
aws aws-iam devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 13 Feb 2026
https://github.com/microsoft/sqlcallstackresolver
A sample tool for users of Microsoft SQL Server to aid in troubleshooting otherwise difficult to diagnose issues. Provided AS-IS - see SUPPORT.md.
azuresql azuresqldb azuresqlmanagedinstance callstack debugging debugging-symbol msdia140 pdb pdb-files sqlserver sqlserver-2017 sqlserver-2019 sqlserver-2022 sre symbols tool xevent xevents
Last synced: 07 May 2025
https://github.com/microsoft/SQLCallStackResolver
A sample tool for users of Microsoft SQL Server to aid in troubleshooting otherwise difficult to diagnose issues. Provided AS-IS - see SUPPORT.md.
azuresql azuresqldb azuresqlmanagedinstance callstack debugging debugging-symbol msdia140 pdb pdb-files sqlserver sqlserver-2017 sqlserver-2019 sqlserver-2022 sre symbols tool xevent xevents
Last synced: 08 Apr 2025
https://github.com/blacklane/kiev
A set of tools to do distributed logging for Ruby web applications
distributed-tracing elk-stack logging ruby sre
Last synced: 04 Apr 2025
https://github.com/loftwah/loftwahs-cheatsheet
My own personal tech cheatsheet. This covers the stuff I use quite regularly.
bash devops hacktoberfest linux nodejs python sre typescript
Last synced: 20 Jun 2025
https://github.com/huseynovvusal/blamebot
AI on-call agent that detects deploy failures explains what broke pages the responsible team and rolls back automatically.
ai-agent devops hackathon incident-management nextjs postmortem redis slack-bot sre upstash vercel
Last synced: 09 May 2026
https://github.com/tedilabs/terraform-aws-container
🌳 A sustainable Terraform Package which creates resources for Container Services on AWS
aws aws-ecr aws-eks devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 23 Feb 2026
https://github.com/k8sgpt-ai/docs
Documentation for K8sGPT
ai chatgpt docs kubernetes sre
Last synced: 06 Apr 2025
https://github.com/nobl9/sloctl
A command line tool to cast SLO spells 🪄
cli go golang nobl9 reliability slo sre
Last synced: 27 Feb 2026
https://github.com/getyourguide/istio-config-validator
go121 istio sre validation virtualservice
Last synced: 11 Apr 2025
https://github.com/FluidifyAI/Regen
Open-source incident management Alerts, on-call, AI post-mortems. Self-hosted alternative to PagerDuty & incident.io. Works with Prometheus, Grafana, Datadog, Slack, and Teams. Free forever, BYO-AI.
ai alerting devops grafana incident-management observability on-call open-source pagerduty-alternative prometheus self-hosted slack sre
Last synced: 28 May 2026
https://github.com/ory/jobs
Want to build the next generation identity stack? You've come to the right place!
go hiring jobs kubernetes open-source opensource ory react sre
Last synced: 17 Mar 2025
https://github.com/icco/postmortems
Postmortem metadata from danluu/post-mortems.
hacktoberfest postmortem-metadata sre
Last synced: 21 Mar 2025
https://github.com/sitectl/cuttle
Blue Box SRE Operations Platform
ansible bastion bluebox elk operations sensu sre
Last synced: 11 Apr 2025
https://github.com/ramizpolic/sre-playground
A set of Site Reliability Engineering notes & challenges
cicd cloud guide infrastructure site-reliability-engineer sre tasks
Last synced: 14 Apr 2025
https://github.com/seveas/herd
Massively parallel ssh client
cli orchestration sre ssh sysadmin system-administration
Last synced: 25 Jun 2025
https://github.com/fkie-cad/logprep
log data pre processing, generation and shipping in python
etl kafka log logdata loggenerator logshipper opensearch preprocessing python soar sre
Last synced: 02 Mar 2026
https://github.com/alexewerlof/slc
A simple service level calculator
error-budget servicelevels sla sli slo sre
Last synced: 03 Apr 2026
https://github.com/last9/openmetrics-registry
Do more with your metrics
exporter hcl modules open-metrics openmetrics prometheus registry sre
Last synced: 15 May 2026
https://github.com/enola-dev/enola
Enola 🕵🏾♀️ Holmes was an SRE.
graph graphviz mermaid modeling rdf semantic-web sre visualization
Last synced: 16 Jun 2025
https://github.com/bobek/masscan_as_a_service
masscan as a service
audit bare-metal cloud containers git-scraping masscan phabricator security security-scanner security-tools sre
Last synced: 25 Jan 2026
https://github.com/lwindolf/multi-status
Aggregator PWA for status pages of online services. Know which of your 3rd party SaaS/PaaS are having issues right now.
cloud devops monitoring paas pwa saas sre
Last synced: 11 Apr 2025
https://github.com/keycloak/keycloak-sre-sig
Keycloak's Site Reliability Engineers Special Interest Group (Keycloak SRE SIG): To improve the lives of people running and operating Keycloak
Last synced: 12 Apr 2025
https://github.com/dxcfg/dxcfg
Configuration as code for the masses
configuration deno denoland deployment-automation devops iaac jkcfg kubernetes kubernetes-deployment sre
Last synced: 24 Jan 2026
https://github.com/tedilabs/terraform-aws-network
🌳 A sustainable Terraform Package which creates VPC resources (VPC, Subnet, NACL, NAT Gateway, Route Table) on AWS
aws aws-vpc devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 15 Apr 2025
https://github.com/ntt-dkiku/chaos-eater
An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)
aiops chaos-engineering chaos-mesh k6 k8s kubernetes langchain large-language-models llm-agents llm-application llm-apps microservices software-engineering sre
Last synced: 02 Apr 2026
https://github.com/nobl9/terraform-provider-nobl9
Terraform provider for Nobl9
google-slo metrics monitoring nobl9 observability openslo reliability reliability-engineering slo sre
Last synced: 01 Apr 2026
https://github.com/stacksimplify/terraform-on-google-kubernetes-engine
GCP GKE Terraform on Google Kubernetes Engine with DevOps, SRE 40 Real-World Demos
devops gateway-api gcp gke gke-cluster gke-terraform google google-cloud google-cloud-platform google-kubernetes google-kubernetes-engine iac kubernetes-cluster kubernetes-deployment kubernetes-service kubernetes-service-account sre terraform
Last synced: 21 Apr 2025
https://github.com/immobiliare/collectd-haproxy-plugin
Collectd plugin to pull metrics from HAProxy instances
collectd collectd-plugin grafana haproxy metrics monitoring sre
Last synced: 01 Apr 2025
https://github.com/kjkuan/Runbook.md
Write Bash executable runbooks in Markdown.
bash devops devops-tools literate-programming markdown operations ops playbook runbook sre sre-automation task-runner
Last synced: 01 May 2025
https://github.com/dynatrace-oss/customersuccess
Open source solutions that help you level up your observability game with Dynatrace.
adoption ai automation dashboards dynatrace intelligence notebooks observability obsolescence software sre value workflows
Last synced: 07 Jan 2026
https://github.com/grafana/xk6-chaos
xk6 extension for running chaos experiments with k6 💣
chaos chaos-engineering k6-extension reliability sre testing xk6
Last synced: 01 Oct 2025
https://github.com/tedilabs/terraform-aws-load-balancer
🌳 A sustainable Terraform Package which creates resources for Load Balancers on AWS
aws aws-alb aws-clb aws-elb aws-load-balancer aws-nlb devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 14 Mar 2026
https://github.com/ghostinthewires/Team-Handbook-Template
An employee / team handbook template
devops devops-team devops-tools devopsteam handbook process sre team-devops template
Last synced: 10 May 2025
https://github.com/operate-first/operations
The sig-operations repository.
site-reliability-engineering sre
Last synced: 16 Jan 2026
https://github.com/k8sgpt-ai/community
Community Management for K8sGPT
devops kubernetes openai sre tooling
Last synced: 15 Apr 2025
https://github.com/nathanielvarona/pritunl-client-github-action
Establish automated secure Pritunl VPN connections with Pritunl Client in GitHub Actions, supporting OpenVPN and WireGuard.
cicd devops github-actions hacktoberfest openvpn pritunl pritunl-vpn sre vpn-client vpn-server wireguard
Last synced: 10 Mar 2026
https://github.com/ocheops/paths-in-tech
For People finding it hard in tech
career-development career-guide career-path careercoach ceh computer-science devops product sre ui ui-design ux-design
Last synced: 18 Mar 2026
https://github.com/be-next/awesome-performance-engineering
A curated, opinionated collection of tools and resources dedicated to Performance Engineering, covering both Observability and Performance Testing.
awesome awesome-list devops load-testing monitoring observability performance performance-engineering performance-testing sre
Last synced: 08 Mar 2026
https://github.com/build5nines/terraform-quickstart-templates
Terraform Quickstart Templates
azure cloud devops hashicorp hcl iac microsoft microsoft-azure microsoftazure sre terraform terraform-modules terraform-templates
Last synced: 11 Apr 2025
https://github.com/googlecloudplatform/reliable-app-platforms
A MVP of a platform for delivering reliable applications on Google Cloud
gke google-cloud kubernetes reliability slos sre terraform
Last synced: 20 Oct 2025
https://github.com/microsoft/tdslib
Open implementation of the TDS protocol (version 7.4) in managed C# code.
Last synced: 17 Aug 2025
https://github.com/devopsext/sre
Golang SRE framework for logs, metrics, traces and events. It supports: Jaeger, Prometheus, DataDog, Opentelemetry, NewRelic, Grafana
events logs metrics observability sre traces
Last synced: 12 Jan 2026
https://github.com/tedilabs/terraform-aws-security
🌳 A sustainable Terraform Package which creates Security resources on AWS
aws aws-access-analyzer aws-config devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 25 Feb 2026
https://github.com/tedilabs/terraform-aws-domain
🌳 A sustainable Terraform Package which creates resources for Domain Services on AWS
aws aws-route53 devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 15 Apr 2025
https://github.com/diogopms/monit-docker
Monit is a free open source utility for managing and monitoring, processes, programs, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations.
devops docker kubernetes monit monitoring sre status
Last synced: 25 Oct 2025
https://github.com/rootlyhq/terraform-provider-rootly
Terraform provider for Rootly - manage incident management, on-call schedules, workflows, and alerts as code
devops go golang hashicorp iac incident-management incident-response infrastructure-as-code on-call rootly site-reliability-engineering sre terraform terraform-provider
Last synced: 11 Mar 2026
https://github.com/wpjunior/multi-burn-rate-calculator
Calculator to view detection time using error budget consumption rates, based on lessons from Site Reliability Engineering Workbook
Last synced: 17 Mar 2026
https://github.com/aptible/unpage
Unpage is the open source framework for building SRE agents with infrastructure context and secure access to any dev tool.
agent agentic-workflow agents ai-agent ai-sre aiops automation devops dspy incident-response incident-response-tooling mcp monitoring observability site-reliability-engineering sre sre-agent
Last synced: 08 Sep 2025
https://github.com/angelopoerio/oom-notifier
Notify about oomed processes reporting full command line
devops kubernetes linux observability rust site-reliability-engineering sre
Last synced: 17 Jan 2026
https://github.com/Ramsbaby/openclaw-self-healing
AI-powered self-healing system for OpenClaw Gateway • 4-tier autonomous recovery • macOS & Linux
ai-agent artificial-intelligence automation bash claude-ai claude-code crash-recovery devops homelab launchd macos monitoring observability openclaw reliability self-healing sre watchdog
Last synced: 19 Feb 2026
https://github.com/dpogorzelski/speedrun
Control your compute fleet at scale
automation cli cloud command-execution devops gcp go google-cloud sre sysadmin
Last synced: 12 Mar 2026
https://github.com/qainsights/performance-engineers-clubhouse
Join Performance Engineers Clubhouse 🏡
clubhouse devops performance performance-monitoring performance-testing sre testing
Last synced: 08 Jan 2026
https://github.com/lucasepe/kctx
A Kubernetes context engine for humans and AI agents.
agent-workflows cloud-native context-engineering developer-experience golang kubernetes platform-engineering sre
Last synced: 24 Jun 2026
https://github.com/luan78zaoha/kaldi-timit-sre-ivector
Develop speaker recognition model based on i-vector using TIMIT database
chinese i-vector kaldi speaker-recognition speaker-verification sre
Last synced: 11 Mar 2025
https://github.com/tedilabs/terraform-aws-data
🌳 A sustainable Terraform Package which creates resources for Data Services on AWS
aws aws-athena devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 03 Oct 2025
https://github.com/last9/last9-integrations
Sample applications of supported integrations by Last9 Products
integrations last9 reliability-engineering sre timeseries-database
Last synced: 28 Apr 2025
https://github.com/runvoy/runvoy
Serverless command runner
admin-tool cli cloudcomputing containers devops fargate golang serverless sre terraform
Last synced: 08 Mar 2026
https://github.com/tedilabs/k8s-repository
♻️ Repository for Reusable Kubernetes App Manifests with Kustomize
devops gitops hacktoberfest k8s kubernetes kustomize lang-yaml sre tedilabs
Last synced: 19 Oct 2025
https://github.com/nudgebee/nudgebee
AI-driven incident management and observability for Kubernetes, AWS, Azure, and GCP — LLM-powered RCA, runbook automation, and cost optimization.
agentic ai-agents aiops aws azure cost-optimization devops finops gcp golang helm incident-management kubernetes llm multi-cloud nextjs observability root-cause-analysis runbook-automation sre
Last synced: 11 Jun 2026
https://github.com/rsionnach/nthlayer
Generate the complete reliability stack from a service spec in 5 minutes. Dashboards, alerts, SLOs, PagerDuty - zero toil.
alerts devops grafana monitoring observability pagerduty prometheus python slo sre
Last synced: 18 Jan 2026
https://github.com/dingus-technology/dingus
Identify and squash bugs in your code with Dingus!
ai bugs deployment devops docker grafana infrastructure k8s kubernetes llm logging loki metrics monitoring openai prometheus python sre
Last synced: 12 Apr 2026
https://github.com/dkorunic/axfr2hosts
Fetches one or more DNS zones via AXFR and dumps in Unix hosts format for local use
bind bind9 bind9-dns dns dns-server domain linux networking security sre sysops unix zone
Last synced: 12 Apr 2025
https://github.com/bjarneo/gecho
Gecho - a HTTP request echo debugging service
debugging devops echo golang http http-server request sre
Last synced: 25 Apr 2025
https://github.com/tedilabs/terraform-aws-db
🌳 A sustainable Terraform Package which creates resources for Databases on AWS
aws aws-db aws-elasticache aws-rds devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 15 Apr 2025
https://github.com/shantoroy/site-reliability-engineering-101
This GitHub repository contains a comprehensive tutorial on Site Reliability Engineering (SRE), covering topics such as SLAs, SLOs, SLIs, Chaos Engineering, monitoring, alerting, and much more. It also includes a bonus content on SRE best practices. Follow along with the #100daysofSRE challenge and improve your reliability engineering skills.
100daysofcode alerting automation chaos-engineering devops devsecops monitoring reliability-engineering service-level-agreement service-level-indicator service-level-objective site-reliability-engineering sre
Last synced: 27 Mar 2026
https://github.com/todd-dsm/mac-ops
QnD Automation to build a MacBook Pro for DevOps
customizable devops devops-tools macbook-configuration macbook-setup macos sre
Last synced: 13 Apr 2025
https://github.com/chatwoot/faultline
An open-source AI agent for infrastructure debugging.
Last synced: 24 Feb 2026
https://github.com/tedilabs/terraform-aws-observability
🌳 A sustainable Terraform Package which creates resources for Observability Services on AWS
aws aws-cloudawtch-logs aws-cloudwatch aws-logs devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 02 Mar 2026
https://github.com/fault-project/fault-cli
Build Exciting Applications Your Users Can Rely On
chaos-engineering reliability-engineering resilience-engineering sre
Last synced: 29 May 2026
https://github.com/Lethe044/hermes-incident-commander
Autonomous SRE agent built on Hermes - detects, heals, and learns from production incidents. Uses Memory + Skills + Cron + Gateway + Subagents + Atropos RL.
atropos autonomous-agents devops hermes-agent incident-response llm-agent nous-research sre
Last synced: 05 May 2026
https://github.com/xe-nvdk/terraform-recipes
This is the repo where I save #Terraform recipes, mostly posted in cduser.com
devops iaac infrastructure-as-code sre terraform
Last synced: 11 Apr 2025
https://github.com/gopatchy/bkl
Layered Configuration Language
configuration deployment devops json k8s kubernetes sre toml yaml
Last synced: 17 Jan 2026
https://github.com/avivl/cloud-sre-agent
An autonomous SRE agent that monitors cloud logs across multiple platforms, leveraging AI models from various providers to detect anomalies, perform root cause analysis, and automate remediation by creating GitHub Pull Requests.
ai-agents ai-ops automation aws cloud devops gcp gemini-ai google-cloud incident-response llm log-analysis log-monitoring platform-engineering python resilience sre vertex-ai
Last synced: 09 Mar 2026
https://github.com/tedilabs/terraform-aws-firewall
🌳 A sustainable Terraform Package which creates resources for Firewall Services on AWS
aws aws-firewall aws-waf devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 21 Jan 2026
https://github.com/input-output-hk/devshell-capsules
Space Capsules for the Modern DevShell
Last synced: 13 Oct 2025
https://github.com/christiangalsterer/mongodb-driver-prometheus-exporter
A prometheus exporter exposing metrics for the official MongoDB Node.js driver.
grafana grafana-dashboard metrics mongodb monitoring node-js nodejs prometheus prometheus-exporter sre typescript
Last synced: 15 Mar 2026
https://github.com/last9/last9-cdk
Last9 CDK
observability prometheus prometheus-metrics python sre
Last synced: 28 Apr 2025
https://github.com/woodprogrammer/postgresql-connection-manager
This is project to manage postgresql connections via cgroup V2
cgroups devops pg postgresql sre
Last synced: 28 Apr 2025
https://github.com/tedilabs/terraform-aws-secret
🌳 A sustainable Terraform Package which creates Secret resources on AWS
aws aws-kms aws-parameter-store aws-secrets-manager aws-ssm-parameter-store devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 03 Aug 2025
https://github.com/apiaryio/ivy
A Node.js queue library focused on easy, yet flexible task execution.
Last synced: 30 Jul 2025
https://github.com/fluxninja/aperture-go
SDK to interact with Aperture Agent
concurrency-limiter flow-control rate-limiter sdk sre
Last synced: 14 Oct 2025
https://github.com/tedilabs/terraform-aws-vpc-connectivity
🌳 A sustainable Terraform Package which creates VPC Connectivity resources (Private Link, Client VPN, Site-to-Site VPN, DX, VPC Lattice) on AWS
aws aws-client-vpn aws-direct-connect aws-dx aws-site-to-site-vpn aws-vpc aws-vpc-lattice aws-vpc-private-link aws-vpn devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 24 Oct 2025
https://github.com/antolius/deployments-and-disasters
A tabletop RPG for practicing incident management.
Last synced: 05 May 2025