SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-06-25 00:25:51 UTC
- JSON Representation
https://github.com/tedilabs/.github
📣 Default community health files for @tedilabs organization on GitHub
devops github hacktoberfest sre tedilabs
Last synced: 15 Apr 2025
https://github.com/tedilabs/github
♥️ The best way to manage GitHub organization in @tedilabs
devops github github-organization github-repository github-team hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-github terraform-module terraform-modules
Last synced: 15 Apr 2025
https://github.com/madetech/productionisation
The Made Tech Productionisation Checklist for Software Projects
Last synced: 12 Apr 2025
https://github.com/tedilabs/terraform-aws-misc
🌳 A sustainable Terraform Package which creates MISC resources on AWS
aws devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 28 Oct 2025
https://github.com/apiaryio/docker-base-images
Base docker images for Apiary applications
Last synced: 26 Jun 2025
https://github.com/misterzurg/tbank-sre
🏦 TBank backend academy SRE course.
helm kubernetes kubernetes-operator minikube oncall oncall-prober oncall-sla sre t-bank
Last synced: 14 Oct 2025
https://github.com/linhng98/mess-around
playground to demonstrate many awesome devops tools, enforce gitops pattern, build scalable and sustainable application cluster
devops homelab kubernetes mess-around sre
Last synced: 17 Jan 2026
https://github.com/tedilabs/terraform-aws-cloudfront
🌳 A sustainable Terraform Package which creates CloudFront resources on AWS
aws aws-cloudfront aws-network devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 15 Apr 2025
https://github.com/arun0009/pulse
Batteries-included Spring Boot observability starter. Cardinality firewall, timeout-budget propagation, SLO-as-code, async context, PII masking, error fingerprints - one dependency, zero agents
cardinality distributed-tracing micrometer observability opentelemetry slo spring-boot-starter sre structured-logging
Last synced: 30 May 2026
https://github.com/hom3chuk/psqlrc-helpers
A pack of psql helper commands to maintain a PostgreSQL
cheatsheet dba devops devops-tools maintenance performance performance-analysis postgres postgresql psql psql-client psqlrc sre
Last synced: 27 Oct 2025
https://github.com/tedilabs/terraform-aws-ml
🌳 A sustainable Terraform Package which creates Machine Learning resources on AWS
aws devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 16 Feb 2026
https://github.com/hamidgholami/k8s-lab
Kubernetes Labratory
cncf devops devops-tools k3s k3s-architecture k3s-cluster k3s-minicluster k8s k8s-cluster k8s-learn kubernetes kubernetes-cluster kubernetes-labs kubernetes-learning sre
Last synced: 01 May 2025
https://github.com/diptochakrabarty/learn_devops_with_projects
Learn Devops by practical projects . Includes all tech stacks including k8s, ansible , docker , python and more
ansible devops golang hacktoberfest kubernetes python sre
Last synced: 13 Jun 2025
https://github.com/k-krew/omen
A lightweight, declarative chaos engineering operator for Kubernetes
chaos-engineering chaos-testing controller-runtime fault-injection golang helm-chart kubebuilder kubernetes kubernetes-operator reliability sre
Last synced: 26 Apr 2026
https://github.com/powerhome/keess
Keep secrets and configmaps syncronized across clusters and namespaces
Last synced: 04 Mar 2026
https://github.com/ajinkyakadam/systemhealthai
An AI SRE for triaging system health
agents ai aiops devops devops-tools llm llmops mlops observability sre
Last synced: 03 Nov 2025
https://github.com/christiangalsterer/node-postgres-prometheus-exporter
A prometheus exporter for node-postgres
grafana grafana-dashboards metrics monitoring node-js node-postgres nodejs pg postgres postgresql prometheus prometheus-exporter sre
Last synced: 10 Apr 2025
https://github.com/ohmydevops/devops-culture-or-tools
فایل ارائه "دوآپس، فرهنگ یا ابزار؟" در دورهمی شماره ۲ برنامهنویسان کارخانه نوآوری مشهد
agile devops devops-handbook devopsdays sre
Last synced: 18 Feb 2026
https://github.com/kubeshop/fuse-releases
Platform Engineering Copilot. AI-powered expertise with deep domain knowledge at your fingertips.
ai cli copilot devops platform-engineering sre
Last synced: 24 Jan 2026
https://github.com/skyzyx/engineering-for-site-reliability
Overall map of topics to cover for my “Engineering for Site Reliability” blog series.
ci-cd cicd devops docker security site-reliability site-reliability-engineering sre terraform
Last synced: 25 Mar 2025
https://github.com/christiangalsterer/kafkajs-prometheus-exporter
A prometheus exporter exposing metrics for KafkaJS
grafana grafana-dashboard kafka kafkajs metrics monitoring node-js nodejs prometheus prometheus-exporter sre
Last synced: 28 Jul 2025
https://github.com/samgiles/health
Async healthchecks for Golang applications supporting liveness and readiness checks
golang healthcheck k8s kubernetes microservice sre
Last synced: 14 Jan 2026
https://github.com/tedilabs/terraform-aws-organization
🌳 A sustainable Terraform Package to manage Organization resources on AWS
aws aws-organization aws-ram aws-sso devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 08 Apr 2026
https://github.com/guilt/chaossquirrel
Like Netflix's Chaos Monkey, packaged to run standalone.
chaos-monkey reliability-engineering sre
Last synced: 12 Aug 2025
https://github.com/meysam81/prometheus-command-timer
Run any command, time its execution and push the metrics to Prometheus Pushgateway
cli-tool command-line command-wrapper container-metrics cross-platform devops docker execution-time golang job-monitoring kubernetes metrics metrics-collector monitoring observability performance-monitoring prometheus pushgateway sre time-tracking
Last synced: 17 Apr 2026
https://github.com/prologic/prologic
Hiya 👋 I'm James Mills a Senior SRE / DevOps formally Software Engineer and enthusiastic Gopher (Golang) Programmer I love open source and contributing back,unfortunately recent events have lead me to self-host more of my own projects and data. Please read on! 🙇♂️
devops golang open-source software software-engineering sre
Last synced: 13 Apr 2025
https://github.com/lawouach/ebpf-2021-talk
Code for my talk at ebpf 2021 conference
devops ebpf reliability reliably sre
Last synced: 12 Apr 2025
https://github.com/tedilabs/terraform-github-modules
🌳 A sustainable Terraform Package which manage all of things on GitHub
devops github hacktoberfest hcl2 lang-hcl sre tedilabs terraform terraform-module terraform-modules type-module
Last synced: 05 Apr 2026
https://github.com/thotischner/observability-mcp
Unified observability gateway for AI agents — one MCP server for Prometheus, Loki, and any backend, with cross-signal anomaly detection and a built-in Web UI.
ai-agents anomaly-detection anthropic claude gateway helm kubernetes llm loki mcp mcp-server model-context-protocol monitoring observability prometheus sre
Last synced: 11 Jun 2026
https://github.com/certwatch-app/cw-agent
SSL/TLS certificate monitoring agent for Kubernetes and on-prem infrastructure. Scan certificates, detect expiration, validate chains, and sync to CertWatch cloud.
certificate cli cloud-native devops golang kubernetes monitoring security sre ssl tls
Last synced: 13 Jan 2026
https://github.com/johndeere/work-tracker
Observe and protect your Java web application.
elasticsearch java java11 java8 mdc metadata observability spring spring-boot sre
Last synced: 04 Oct 2025
https://github.com/mstryoda/sre-ai-agent
An autonomous Kubernetes troubleshooting and healing agent powered by AI Agents and LLMs
agent ai kubernetes llm python sre troubleshooting
Last synced: 13 Oct 2025
https://github.com/persys-dev/persys-cloud
Community Driven Cloud Automation :)
automation cloud cluster golang kubernetes pipelines platform sre
Last synced: 08 Apr 2026
https://github.com/tedilabs/terraform-aws-messaging
🌳 A sustainable Terraform Package which creates resources for Messaging Services (EventBridge, MSK, SNS, SQS) on AWS
aws aws-eventbridge aws-msk aws-sns aws-sqs devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 19 Sep 2025
https://github.com/tedilabs/terraform-aws-ec2
🌳 A sustainable Terraform Package which creates resources for EC2 Services on AWS
aws aws-ec2 devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 27 Feb 2026
https://github.com/aligoren/sre-book-tr
Google SRE kitabının Türkçe çevirisi. Site Reliability Engineering prensiplerini ve uygulamalarını Türkçe teknik topluluğa kazandırmak için hazırlanmıştır.
site-reliability-engineering sre turkish
Last synced: 07 Feb 2026
https://github.com/tedilabs/terraform-aws-lambda
🌳 A sustainable Terraform Package which creates Lambda & Step Functions resources on AWS
aws aws-lambda aws-sfn aws-step-functions devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 08 Mar 2026
https://github.com/spechtlabs/tka
Zero-friction Kubernetes access using Tailscale and ephemeral service accounts
access-control authentication kubernetes sre tailscale
Last synced: 18 May 2026
https://github.com/beztebya666/k8s-view
Fast, self-hosted Kubernetes web UI for multi-cluster ops — stream pod logs, exec, port-forward, edit YAML with live diff, rollout history & one-click rollback, Prometheus + metrics-server charts. Single Go binary, no agents, scales to 150k+ objects per cluster.
data-visualization devops devops-tools docker k8s k8s-cluster k8s-dashboard k8s-ui k8s-view kubernetes kubernetes-dashboard kubernetes-ui kubernetes-view linux monitoring sre
Last synced: 01 Jun 2026
https://github.com/projecthelena/warden
Open-source uptime monitoring built in Go. Multi-zone checks, status pages, unlimited team members — the production-grade upgrade from Uptime Kuma.
devops docker go golang monitoring open-source self-hosted sre status-page uptime uptime-kuma-alternative uptime-monitor
Last synced: 02 Apr 2026
https://github.com/mattermost/ponos
A ChatOps SRE toil elimination tool
chatops sre sre-team toil-elimination
Last synced: 14 Jan 2026
https://github.com/tedilabs/terraform-aws-ipam
🌳 A sustainable Terraform Package which creates IPAM resources (IPAM, Elastic IP, Prefix List) on AWS
aws aws-eip aws-elastic-ip aws-ipam aws-prefix-list aws-vpc devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 02 Aug 2025
https://github.com/jayvdb/sre-tools
Helpers for sre_parse, transforming regexes
python3 regex regular-expressions sre sre-parse
Last synced: 19 Aug 2025
https://github.com/timothystiles/buster
A Go Package for running Go package CICD pipelines in Go...
cd ci ci-cd cicd continous-deployment continous-integration devops github-actions go golang infrastructure platform-engineering sre
Last synced: 10 Apr 2025
https://github.com/vacovsky/poolse
Control health checks and toggle upstream node status in load balancers with ease.
application-monitoring devops devops-tools f5-health-monitor go golang health-check healthcheck load-balancer nginx-proxy proxy site-reliability-engineering sre
Last synced: 26 Mar 2025
https://github.com/k8sgpt-ai/charts
Helm Charts for K8sGPT
devops kubernetes openai sre tooling
Last synced: 11 Mar 2026
https://github.com/priyanshujain/infragpt
InfraGPT is an AI SRE Copilot for the Cloud that provides infrastructure management agents through Slack integration. The system consists of multiple services that work together to deliver intelligent DevOps workflows.
artificial-intelligence google-cloud-platform infragpt infrastructure sre terraform
Last synced: 28 Jun 2025
https://github.com/ctsrc/mdrun
Runs command-line pipelines embedded in Markdown and CommonMark documents. Keeps your authored docs up to date. Even usable as an alternative to IPython notebooks.
abstract-syntax-tree ast commonmark computer-science data-science devops iac infrastructure-as-code markdown qa quality-assurance software-development software-engineering sre technical-writing writing-tool
Last synced: 02 Aug 2025
https://github.com/ayazhankadessova/grafana-prometheus
Prometheus-based Grafana dashboard featuring latency chart, CPU usage gauge, and request rates table, and node host metrics, using PromLabs' public server.
grafana monitoring prometheus sre
Last synced: 22 Jul 2025
https://github.com/gorillati/guias
Guias de instalacion y configuacion de servidores y servicios en un Data Center basados en tecnologias open source y software libre.
configuration-management cpd datacenter free-software guias linux-server manuals open-source server services sre sre-infra sysadmin system-administration
Last synced: 11 Mar 2025
https://github.com/fusakla/coordinator
Tool to coordinate on-call, incident and maintenance management
alerting communication coordination dashboard devops oncall sre
Last synced: 22 Apr 2026
https://github.com/abhishekpanda0620/eol-check
A CLI tool to check the End-Of-Life (EOL) status of your development environment and project dependencies.
cli-tool devops end-of-life eol nodejs security sre typescript
Last synced: 13 Jan 2026
https://github.com/tedilabs/terraform-aws-cost
🌳 A sustainable Terraform Package which creates resources for Cost on AWS
aws aws-billing aws-budget aws-cost aws-cur devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 16 Jun 2025
https://github.com/dynatrace/obslab-release-validation
Use Grafana k6, Dynatrace business events, workflows and site reliability guardian to validate software releases
automation demo dynatrace grafana-k6 k6 load-testing obslab openfeature release-validation site-reliability-engineering site-reliability-guardian sre workflow
Last synced: 11 Jul 2025
https://github.com/dknauss/wordpress-runbook-template
WordPress operations runbook template: production procedures for deployment, maintenance, backup, incident response, and recovery.
incident-response operations runbook sre wordpress wordpress-security
Last synced: 01 Apr 2026
https://github.com/purbon/kafka-interesting-stories
Compilation of public incident/interesting/horror stories related to Kafka operations
incidents kafka post-mortem production-engineering sre
Last synced: 18 Mar 2026
https://github.com/richwrd/postgres-ha-cluster-lab
Bachelor's Thesis project focusing on the quantitative analysis and implementation of a PostgreSQL HA cluster with Patroni, etcd, and Pgpool-II.
database-administration disaster-recovery docker docker-compose etcd failover high-availability iac metrics open-source patroni pgpool-ii postgresql sre streaming-replication
Last synced: 24 May 2026
https://github.com/johndeere/outstanding
A Java concurrent collection for in-progress work
java java-collections java8 sre
Last synced: 14 Jan 2026
https://github.com/fguisso/sre-checker
Simple server status checker
monitoring monitoring-tool server-status sre sre-checker
Last synced: 07 Mar 2026
https://github.com/amaurybsouza/devops-deep-dive
DevOps week of the Linux Tips chanel - Ansible, Kubernetes, Docker and AWS.
ansible automation aws bash cloud devops devops-tools linux playbook sre terraform
Last synced: 02 Apr 2026
https://github.com/sourcehawk/triagent
An agent driven incident investigation platform
agentic incident-analysis incident-investigation investigation-tool sre
Last synced: 18 Jun 2026
https://github.com/shmokmt/tfhk
Terraform Housekeeper. The utility tool to remove blocks for refactoring such as moved block.
devops iac infrastructure-as-code sre terraform
Last synced: 15 May 2026
https://github.com/eabykov/sre
Надежность — это не отсутствие сбоев. Это способность системы, команды и человека вместе подняться после падения, переосмыслить, перестроить и идти дальше — с новыми правилами игры, где человеческая уязвимость не угроза, а часть уравнения
chaos-testing error-budget incident monitoring mttd mttm mttr postmortem reliability sla sli slo sre stamp
Last synced: 19 Jan 2026
https://github.com/amaurybsouza/aws-solutions-architect-associate
AWS Certified Solutions Architect - Associate (SAA-S02) Exam Notes
aws aws-ec2 aws-lambda certificate devops engeneering infrastructure-as-code solutions-architect sre tech terraform
Last synced: 01 Apr 2025
https://github.com/tedilabs/terraform-okta-modules
🌳 A sustainable Terraform Package which manage all of things on Okta
devops hacktoberfest hcl2 iac lang-hcl okta sre tedilabs terraform terraform-module terraform-modules terraform-okta
Last synced: 29 Jan 2026
https://github.com/russlank/backup-cleanup
A lightweight SQL Server backup cleanup utility that safely removes expired FULL, DIFF, and LOG backup files according to configurable Grandfather-Father-Son (GFS) retention rules.
backup backup-cleanup backup-retention cli devops gfs golang linux mssql sql-server sre windows
Last synced: 24 May 2026
https://github.com/alivzh/rahbia-live-coding
In the RahBia Live Coding Series, we’ll walk through a complete DevOps journey from start to finish. Together, we'll cover every step—from initial server configuration to final production-ready service deployment that mr AhmadRafiee is hosting it
argocd ceph cicd docker elk gitops grafana haproxy linux observability openstack prometheus sre sre-team terraform traefik
Last synced: 10 Apr 2025
https://github.com/revengai/reai-cutter
RevEng.AI Cutter Plugin
artificial-intelligence cutter machine-learning radare2 radare2-plugin reverse-engineering rizin sre
Last synced: 11 Jul 2025
https://github.com/oguzhan-yilmaz/auto-blackbox-exporter
SSL Certificate Expiry alerts for existing K8s Ingress hosts — install with Helm or ArgoCD
alertmanager alerts argocd blackbox-exporter gitops grafana grafana-dashboard helm helm-chart kubernetes monitoring prometheus prometheus-exporter sre ssl-certificate ssl-certificate-expired-check
Last synced: 06 May 2026
https://github.com/amaurybsouza/portfolio
Helps global projects improve security posture while optimizing costs and ensuring business continuity. I'm a dedicated Cloud Security Engineer committed to safeguarding cloud environments and fostering a DevSecOps culture.
ansible aws cicd cloud devops devops-team devops-tools git github gitlab gitops infrastructure-as-code kubernetes sre terraform
Last synced: 12 May 2025
https://github.com/apurvabhandari/interview-que-for-devops
Interview Questions for DevOps / SRE
cloud container devops hacktoberfest sre technology
Last synced: 29 Jun 2025
https://github.com/amaurybsouza/my-github-actions
🪐🤖🚀An awesome list of useful Github actions with workflows examples and cases of market to you use on daily basis.
actions ansible aws azure azure-devops cicd cloud devops devops-pipeline devops-tools github github-actions gitlab infrastructure-as-code kubernetes-deployment pipeline sre terraform
Last synced: 11 Apr 2026
https://github.com/gdagil/vmprober
Network connectivity and service availability prober with WAL-backed metrics export to VictoriaMetrics
devops dns-probe golang grpc-probe health-check http-probe icmp infrastructure metrics monitoring network-monitoring observability probing prometheus sre tcp-probes victoriametrics
Last synced: 13 Jan 2026
https://github.com/lukaspustina/usereport-rs
Collect system information for the first 60 seconds of a performance analysis
analysis cli performance sre statistics
Last synced: 03 Aug 2025
https://github.com/ohmydevops/now-status
A social network of people's status! This is a simple project to organizing my mind for interviews. Wish me luck
Last synced: 31 Mar 2025
https://github.com/exospherehost/ai-reliability-standards
Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).
ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre
Last synced: 15 Feb 2026
https://github.com/tedilabs/helm-charts
♻️ Repository for Reusable Kubernetes Helm Charts
devops gitops hacktoberfest helm helm-charts k8s kubernetes lang-yaml sre tedilabs
Last synced: 01 Mar 2026
https://github.com/rajatguptarg/samantha
Bot for SRE and DevOps
ansible automation bot ci-cd devops slackapi slackbot sre
Last synced: 19 Jan 2026
https://github.com/andrewaylett/self-throttle
Helps clients to not overwhelm the services they call
Last synced: 05 Feb 2026
https://github.com/rootly-ai-labs/gmcq-benchmark
Evaluation benchmark for language models to understand code to close pull requests.
ai benchmark evals evaluation-metrics llm sre
Last synced: 25 Feb 2026
https://github.com/mathisve/aiosre
All In One SRE Docker Container
aws docker hacktoberfest kubernetes sre
Last synced: 26 Feb 2026
https://github.com/anqorithm/fastapi-helm-chart
This repository contains a Helm chart for a FastAPI application to be deployed on OpenShift clusters with minimal effort with customizable configurations.
automation cicd deployment devops docker fastapi helm k8s kubernetes oc openshift poetry python sre
Last synced: 12 Feb 2026
https://github.com/getbettr/www
Source code for my personal homepage.
advent-of-code adventure devops journal kubernetes rust sre technology
Last synced: 14 Feb 2026
https://github.com/ehsaniara/delay-box
This tool simplifies the workflow by removing the need to write code to handle Redis and Kafka development complexities. It manages these tasks for you through straightforward REST calls.
delayed-job devops distributed-systems kafka redis scheduled-tasks sre
Last synced: 01 Mar 2026
https://github.com/geekxflood/prometheus-inventory-manager
PRIM export in CSV All Prometheus metrics and rules set on your Prometheus instance
monitoring observability prometheus reporting sre
Last synced: 04 Mar 2026
https://github.com/mattyopon/faultray
Zero-risk infrastructure chaos simulation — 5 engines, 2000+ scenarios, 3-Layer availability proof. No production fault injection.
availability chaos-engineering devops infrastructure python resilience simulation sre
Last synced: 02 Apr 2026
https://github.com/conallob/mcp-ssh-wingman
MCP Server for providing read only access to your shell prompt
automation debugging devops llm mcp mcp-server pair-programming platform-engineering shell sre tmux
Last synced: 03 Apr 2026
https://github.com/sergkondr/skondrashov-zsh-theme
My superminimalistic theme for oh-my-zsh
devops oh-my-zsh oh-my-zsh-theme shell-theme sre zsh zsh-theme
Last synced: 23 Apr 2026
https://github.com/lucasepe/resto
A minimalist CLI REST client that calls APIs, waits for conditions, and retries intelligently.
command-line devops expression-evaluator jq kubernetes rest-client retry sre tooling
Last synced: 27 Apr 2026
https://github.com/meysam81/liveness-check
Kubernetes-native health checker that automatically finds and verifies your latest pods are ready before considering deployments successful - perfect for preview environments
ci-cd cli-tool container-native cross-platform devops docker golang health-check http-client kubernetes kubernetes-health liveness-probe microservices monitoring preview-environments readiness-probe single-binary site-reliability sre zero-dependencies
Last synced: 29 Apr 2026