SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-06-25 00:25:51 UTC
- JSON Representation
https://github.com/tedilabs/terraform-aws-quicksight
🌳 A sustainable Terraform Package to manage QuickSight resources on AWS
aws aws-data aws-quicksight devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 19 May 2026
https://github.com/fsaintjacques/survivalkit
A survival kit is a package of basic tools and supplies prepared in advance as an aid to survival in an emergency.
c health-check healthcheck logger monitoring sre
Last synced: 21 Mar 2025
https://github.com/cleancloud-io/scan-action
GitHub Action for CleanCloud — read-only cloud hygiene scanner for AWS, Azure and GCP
aws azure cloud cloud-computing cost-optimization devops finops hygiene sre
Last synced: 07 Apr 2026
https://github.com/srujantata/opentelemetry-observability
OpenTelemetry Collector on Kubernetes: unified traces (Jaeger), metrics (Prometheus + Grafana), logs (Loki), auto-instrumentation, exemplar linking
grafana loki observability opentelemetry prometheus slo sre tempo
Last synced: 28 Jun 2026
https://github.com/akintunero/netdiag
Stdlib-only CLI for SRE on-call: traceroute, DNS, ping, TLS, VPN checks, and incident presets. JSON output, stable exit codes.
automation cli command-line-tools devops dns json network-diagnostics networking on-call python sre sysadmin traceroute troubleshooting vpn
Last synced: 21 Jun 2026
https://github.com/goseind/schablone
Template repository for app infrastructure based on SRE principles
actions azure cicd helm kubernetes sre terraform
Last synced: 05 Apr 2026
https://github.com/korchasa/severin
PoC server chat agent
agent agentic-ai devops llm sre
Last synced: 31 Jan 2026
https://github.com/altikva/spero
Self-healing supervision agent for Linux hosts and Kubernetes
agentic-ai aiops anomaly-detection asyncio automation devops fastapi homelab incident-response kubernetes llm monitoring observability python self-healing self-hosted sre ssh supervision
Last synced: 21 Jun 2026
https://github.com/mikeshobes718/eks-orchestrator
Python CLI to manage EKS cluster/nodegroup lifecycle, RBAC, addons, and GitOps-safe rollouts with dry-run plans.
cli devops eks gitops helm kubectl kubernetes python sre
Last synced: 17 Mar 2026
https://github.com/mikeshobes718/cluster-admin-toolkit
SRE toolkit for day‑2 ops: nodes/deployments views, health, logs/events, rollout status, cordon/drain, restart workflows.
cli kubectl kubernetes observability operations sre
Last synced: 17 Mar 2026
https://github.com/assafdori/tcdp
This repository tracks my progress while taking part of TCDP course.
Last synced: 07 Apr 2026
https://github.com/knaeckebrothero/kubernetes-cluster-project
This project focuses on setting up and managing a Kubernetes Cluster, because who doesn't want one?
deployment devops kubernetes kubernetes-cluster sre
Last synced: 05 Apr 2025
https://github.com/ishantanu/gcp-status-exporter
A Prometheus Exporter for generating metrics for GCP Service Status and Incidents :rocket:
gcp opentelemetry prometheus-exporter sre
Last synced: 10 May 2026
https://github.com/abd-ulbasit/goplatform
Learning-focused Kubernetes operator (kubebuilder v4) that provisions in-cluster databases, caches, and queues via a custom Application CRD — with tier-based observability and drift detection.
cloud-native cnpg controller-runtime crd devops go golang kubebuilder kubernetes kubernetes-operator operator platform-engineering postgresql prometheus rabbitmq redis sre
Last synced: 10 Jun 2026
https://gitlab.com/ek-it/guias
Guiás de instalación y configuacion de servidores y servicios en un Data Center basados en tecnologias open source y software libre.
Last synced: 11 Mar 2025
https://github.com/pezzos/pezzos
Some info about me 🤓
curriculum-vitae cv devops sre
Last synced: 11 Feb 2026
https://github.com/deldotore-r/deldotore-r
🌐 Cloud Infrastructure & DevOps | AWS, Terraform, GitHub Actions & Linux Automation. Oficial reformado aplicando rigor técnico e disciplina à Engenharia de Sistemas.
airflow automation aws cloud-computing devops docker github-actions grafana infrastructure-as-code kubernetes linux prometheus shell-script sre terraform
Last synced: 16 Apr 2026
https://github.com/cpanato/cpanato
aws azure bots devops gcp go kubernetes sre
Last synced: 17 Feb 2026
https://github.com/anshul619/devops-sre
debugging devops docker hld kubernetes new-relic observability pager-duty sre zookeeper
Last synced: 17 May 2026
https://github.com/nalediym/tiger-mom-protocol
An operating system for builders who start more than they finish. Personal protocol for AI-assisted development with structured nudges, celebration loops, and the six-step shipping workflow.
ai-workflow claude-code developer-productivity opencode productivity-tools skills sre
Last synced: 23 Jun 2026
https://github.com/ziad-hsn/cpra
CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.
concurrency devops golang observability self-hosted sre uptime-monitor
Last synced: 07 Mar 2026
https://github.com/apolzek/shared
collection of proof-of-concepts (PoCs) created to explore ideas and test technologies
devops devops-tools laboratory proof-of-concept sre
Last synced: 17 Jan 2026
https://github.com/debaghtk/opsfordevs
devops bootcamp material that I have taught at previous companies
bootcamp devops operations sre
Last synced: 15 Feb 2026
https://github.com/tadiusfrank2001/managing_complex_cloud_systems_cs181y
Build and maintain a backend service using AWS to host websites
aws backend-services cloud-services networking relational-databases security sre system-design
Last synced: 23 Jun 2026
https://github.com/viniciushammett/n8n-devops-lab
Lab de n8n para DevOps/SRE: automação de incidentes, digests de SLO e webhooks, rodando em Docker (Postgres+Redis, queue mode).
automation cron devops docker docker-compose incident-management n8n postgres redis sre webhook workflows
Last synced: 16 Apr 2026
https://github.com/jesioo/dx
Developer-first CLI/TUI that turns your README and a single file into a next-gen DX
cli cli-app developer-tools devops dx enterprise internal-tools markdown onboarding remote-control scripting sre terminal terminal-based tui workflows
Last synced: 18 Apr 2026
https://github.com/detectviz/tech-docs-notes
本專案為整理與翻譯各類開源技術工具文件、白皮書與研究資料的筆記中心。
Last synced: 10 May 2026
https://github.com/jnbdz/site-reliability-engineer-quickstarts
:mechanical_arm: Site Reliability Engineer | Quickstarts :mechanical_arm:
quickstart quickstarts site-reliability-engineering site-reliability-engineering-sre sre
Last synced: 03 Mar 2026
https://github.com/cloudnativeworks/elchi
Elchi is a React + TypeScript web interface for managing Envoy Proxy–based L4/L7 traffic. It provides visual configuration and control of xDS resources, routing, clusters, filters, security (mTLS/WAF), and observability integrations without directly editing Envoy YAML.
cloud-native delta-xds devops envoy envoy-proxy load-balancer proxy service-mesh sre traffic-management ui xds
Last synced: 11 May 2026
https://github.com/dc-tec/openbao-observability
OpenBao observability reference architecture with metrics, logs, dashboards, alerts, fixtures, and runbooks.
alloy grafana loki observability openbao sre
Last synced: 02 Jun 2026
https://github.com/rrabelloo/homebrew-formae
Unofficial Homebrew tap for Formae, a modern Infrastructure-as-Code platform.
devops iac infrastructure-as-code platform-engineering sre
Last synced: 04 Mar 2026
https://github.com/zondatw/remote-cmder
remote cmder
command debugging-tool devops remote sre
Last synced: 17 Jan 2026
https://github.com/guibes/runbook-operator
A cloud-native Kubernetes operator that automatically generates and manages runbook documentation from PrometheusRule configurations with multiple output formats.
alerting automation cloud-native devops documentation gitops incident-response kubernetes monitoring operator prometheus runbooks sre
Last synced: 17 May 2026
https://github.com/fedekau/mercado-libre-sre
Es una API para centralizar y cachear las consultas a otras APIs de Mercado Libre.
Last synced: 17 May 2026
https://github.com/tedilabs/terraform-tfe-modules
🌳 A sustainable Terraform Package to manage all of things on Terraform Enterprise (Terraform Cloud)
devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-cloud terraform-enterprise terraform-module terraform-modules tfe type-module
Last synced: 05 Mar 2026
https://github.com/wesllen-lima/cronko
Self-hosted cron job monitoring. Know when your jobs stop running.
cron devops docker heartbeat hono monitoring nextjs postgresql self-hosted sqlite sre tailwindcss uptime
Last synced: 25 Jun 2026
https://github.com/pallaprolus/kube-foresight
Predictive Resource Optimizer for Kubernetes — identifies over-provisioned deployments and generates right-sizing patches
aiops cli cloud-cost cost-optimization devops finops k8s kubectl kubernetes observability prometheus python resource-optimization right-sizing sre
Last synced: 12 Jun 2026
https://github.com/aronmilenait/aronmilenait.github.io
My blog and portfolio as a Software Developer transitioning to DevOps and SRE.
blog devops linux portfolio software-development sre
Last synced: 21 Jun 2025
https://github.com/guptaprakhariitr/vigil
Self-hosted on-call engineer: point it at your logs, it finds the cited root cause and (if you let it) opens a fix PR. Your infra, your engine (Claude/Cursor/API/local), your autonomy dial.
cli devops incident-response llm observability root-cause-analysis rust self-hosted sre
Last synced: 25 Jun 2026
https://github.com/tedilabs/terraform-aws-ecs
🌳 A sustainable Terraform Package which creates resources for ECS Services on AWS
aws aws-ecs devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 03 Apr 2026
https://github.com/soakes/soakes
GitHub profile README
ansible automation bgp cloud-engineering devops devsecops docker gitops go infrastructure kubernetes linux network-security operator-tooling platform-engineering python site-reliability sre terraform
Last synced: 03 Jun 2026
https://github.com/omaciasd/sre
SRE Challenges.
aws devops docker kubernetes linux python sre vagrant
Last synced: 10 Apr 2026
https://github.com/amsa-2425-gei-udl/laboratoris
Material enfocat per als estudiants que desitgen ampliar els seus coneixements en administració i virtualització de sistemes.
devops labs sre sys-admin teaching-materials
Last synced: 11 Apr 2025
https://github.com/jojees/project-genesis
Project Genesis is a comprehensive, hands-on learning initiative designed to build and manage a tangible, multi-service application within a modern DevOps ecosystem. This project serves as a real-world sandbox, demonstrating best practices across various disciplines, including DevOps, Site Reliability Engineering (SRE), DevSecOps, and FinDevOps.
cicd devops docker gitops grafana high-availability kubernetes microservices-architecture observability postgres prometheus rabbitmq redis sre
Last synced: 04 Apr 2026
https://github.com/dbalucas/kb_dbalucas
This repository contains all my KnowHow and is a great list of quick and searchable commands for administrating a database and other types of systems I'm working with.
cloud dba ddl dml k8s linux postgresql sql sre
Last synced: 04 Apr 2026
https://github.com/peopledoc/jarvis
SRE toolbox
approved-public ghec-mig-migrated sre team-sre
Last synced: 04 Apr 2025
https://github.com/labrats-work/infra.cloud-platform
Multi-cloud Kubernetes platform with GitOps automation. Cloud-agnostic platform layer with provider-specific implementations.
ansible cloud-native devops flux fluxcd gitops hetzner infrastructure-as-code kubernetes kustomize multi-cloud platform-engineering prometheus sre terraform
Last synced: 05 Apr 2026
https://github.com/viniciushammett/airbyte-openmetrics-exporter
A lightweight OpenMetrics exporter for Airbyte OSS sync and operational visibility.
airbyte airbyte-oss datadog docker exporter grafana kubernetes metrics monitoring observability openmetrics platform-engineering postgresql prometheus python sre
Last synced: 04 Jun 2026
https://github.com/omers/sre-devops-tools
Tools and useful sources for SRE and DevOps
awsome awsome-list data devops monitoring sre tools
Last synced: 20 Apr 2026
https://github.com/deimosfr/mytechnotebook
My Tech Notebook
coding database dev devops kubernetes sre technology
Last synced: 09 May 2026
https://github.com/guilledipa/praetor
A Go-based configuration management tool for real-time, mTLS-secured infrastructure orchestration via Message Brokers and gRPC.
automation configuration-management devops golang grpc-go mtls nats-jetstream orchestration sre system-administration
Last synced: 14 Jun 2026
https://github.com/adanb13/cirdan
AI infrastructure cartographer and operations daemon — Graphify for live infrastructure. Fingerprints, graphs, and watches Docker/Kubernetes/AWS/IaC for AI agents via CLI + MCP.
agent-skills ai-agents aiops claude-code codex cursor devops docker gemini-cli incident-response infrastructure knowledge-graph kubernetes mcp mcp-server observability sre terraform
Last synced: 17 Jun 2026
https://github.com/donchanee/metricops
Prometheus metric governance CLI for self-hosted operators
cli cost-optimization golang grafana metrics observability prometheus sre
Last synced: 22 Apr 2026
https://github.com/admbahm/pushbadger
Deterministic git diff analyzer that maps changed files to risk areas using path-based heuristics. Fast, reproducible, no AI.
cli code-review developer-tools git go risk-analysis sre static-analysis
Last synced: 23 Apr 2026
https://github.com/tty47/axectl
DevOps/SRE set of tools
devops go golang infrastructure infrastructure-as-code infrastructure-management sre tooling tools
Last synced: 08 Apr 2025
https://github.com/jrhrmsll/tsgen
tsgen is a little Go program to simulate HTTP requests faults and show how Prometheus alerts based on the Multiwindow, Multi-Burn-Rate Alerts works.
golang grafana monitoring prometheus sre
Last synced: 22 Feb 2026
https://github.com/cheesebanana/yellowstack
Real-time Python script runner with scheduling, logging, and OpenAI-assisted debugging
automation aws devops flask job-scheduler openai python rest-api scheduler scripts sre
Last synced: 12 Jun 2025
https://github.com/robson-teixeira/jaeger-opentelemetry-tracking
Repositório do curso Rastreamento: fazendo tracing com Jaeger e OpenTelemetry da plataforma Alura.
alura container docker grafana grafana-loki jaeger java jdk nginx opentelemetry postgresql prometheus rastreamento redis spring sre tracing
Last synced: 12 Apr 2026
https://github.com/foxj77/autonomous-monitor
Kubernetes namespace monitor — continuous deterministic checks, JSON findings to Kafka
github-actions golang hacktoberfest kafka kubernetes monitoring namespace operator prometheus sre
Last synced: 18 Jun 2026
https://github.com/thanhnguyxn/alert-alchemy
🧪 CLI incident-response simulator: brew fixes from alerts using realistic logs, metrics & traces (offline).
chaos-engineering cli debugging devops game incident-response learning monitoring observability oncall postmortem python rich runbooks simulation site-reliability-engineering sre terminal typer yaml
Last synced: 13 Jan 2026
https://github.com/safoorsafdar/safoorsafdarcom
source code for the personal website safoorsafdar.com
azure cloud-architect devops docker kubernetes observability personal-website prometheus sre
Last synced: 18 Jan 2026
https://github.com/philyuchkoff/howtheysre
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
Last synced: 11 Mar 2025
https://github.com/timyiu478/sadservers
Notes of sad servers
devops linux sre troubleshooting
Last synced: 03 Jul 2025
https://github.com/oluwatobi-roie/sre-diskmonitor
Monitor disk usage on a MySQL server and auto-reset binary logs safely when space runs low.
automation bash cronjob devops diskmonitoring mysql server-maintenance sre
Last synced: 27 Apr 2026
https://github.com/dcaffese-cypher/observability-security-stack
Production-ready observability & security toolkit: OpenTelemetry Collector (master/agent), Prometheus, Loki, Grafana dashboards, Wazuh and Zabbix utilities, plus retention/TSDB maintenance scripts.
ansible devops grafana infrastructure logging loki monitoring observability opentelemetry prometheus security sre tracing wazuh zabbix
Last synced: 11 Apr 2026
https://github.com/robliberti/aws-terraform-ansible-1
Terraform + Ansible lab demonstrating a secure public-frontend/private-backend architecture on AWS, featuring reverse proxying, idempotent configuration, and post-deploy health checks.
ansible aws devops infrastructure-as-code nginx reverse-proxy sre terraform
Last synced: 12 Apr 2026
https://github.com/benfradjselim/ruptura
Predictive failure detection engine for cloud-native infrastructure. Rupture Index™ detects divergence hours early — adaptive ensemble of 5 models, 8 composite signals, automated K8s remediation.
cloud-native go kubernetes mlops observability opentelemetry predictive-analytics prometheus rupture-detection sre
Last synced: 28 May 2026
https://github.com/deas/ka0s
Building Chaos around LitmusChaos on Kubernetes
chaos-engineering flux2 kubernetes litmuschaos sre
Last synced: 15 Mar 2025
https://github.com/ops-talks/farm
The Full Stack Platform Egineer
devportal platform-engineering sre
Last synced: 29 Apr 2026
https://github.com/shinagawa-web/pgincident
A live terminal dashboard for the first 30 seconds of a PostgreSQL incident — connections, locks, long queries, and idle transactions at a glance.
bubbletea cli devops go incident-response postgres postgresql sre terminal tui
Last synced: 27 May 2026
https://github.com/davidmalko87/jira-confluence-full-instance-backup
Automated full-instance backup for Jira & Confluence Cloud (Standard plan). Jenkins + CLI/menu; encrypted 7-Zip archives to GCS, S3, Azure or local; notify via Slack, Teams, email or webhook.
7zip atlassian atlassian-cloud automation azure backup cli confluence devops disaster-recovery gcs jenkins jira python s3 sre
Last synced: 27 May 2026
https://github.com/opscart/opscart-k8s-watcher
Kubernetes security awareness and troubleshooting tool featuring CIS Benchmark scoring, environment-aware analysis (PROD vs DEV), and actionable recommendations. Not for compliance auditing - use kube-bench for official CIS compliance.
cis-benchmark cloud-native cluster-monitoring devops devsecops kubernetes-security platform-engineering resource-optimization sre troubleshooting
Last synced: 21 Feb 2026
https://github.com/cjcsecurity/claude-tabletop
Project-aware tabletop exercise generator + AAR drafter for Claude Code
after-action-report ai-agents ai-tools anthropic claude-code claude-skill claude-skills cybersecurity devops incident-response runbook security-tooling security-training sre tabletop-exercise tabletop-exercises ttx
Last synced: 08 Jun 2026
https://github.com/bienkma/bienkma.github.io
bienkma's information
bienkma devops infra infrastructure loadbalancing sre system systemadmin
Last synced: 16 Jan 2026
https://github.com/lukemurraynz/drasi-aks-sre-agent
Azure SRE Agent - Blueprint for Drasi on AKS
Last synced: 20 Jun 2026
https://github.com/powerhome/pac-quota-controller
PAC Resource Sharing Validation Webhook
Last synced: 16 Jan 2026
https://github.com/monim279/ai-powered-devops
🤖 Discover AI tools and techniques to optimize DevOps processes through practical challenges and hands-on projects in this comprehensive 10-day course.
agent agentic-ai cicd cloud devops devops-platform devops-workflow engineering-productivity environment-manager go grafana hacktoberfest llmops mcp microsoft-teams prometheus self-hosted sre
Last synced: 29 Apr 2026
https://github.com/itsfoss0/writeups
Writeup about my homelab and postmoterms for incidents and/or outages in the same
devops incident-management incident-response kubernetes sre
Last synced: 13 Apr 2026
https://github.com/doctolib/terraform-provider-elkaliases
Elasticsearch indexes provider for terraform
criticality-tier0 github-actions managed-by-terraform sre
Last synced: 30 Apr 2026
https://github.com/tedilabs/github-required-actions
♥️ The best way to manage GitHub Actions Required Workflows in @tedilabs
devops github github-actions hacktoberfest sre tedilabs
Last synced: 27 Mar 2025
https://github.com/awcodify/awesome-monitoring
This repository is a curated collection of valuable monitoring tools, resources, and best practices for developers, sysadmins, and DevOps professionals. It covers various aspects of monitoring, including infrastructure, applications, logs, networks, cloud, and Kubernetes.
alerting devops infrastructure logging logs metrics monitoring sre sysadmin
Last synced: 22 Feb 2026
https://github.com/swenyai/sweny
AI-powered engineering workflows — Learn from any source, Act through any tool, Report through any channel
ai ai-agent automation claude devops github-action observability sre triage typescript
Last synced: 11 Mar 2026
https://github.com/cakmoel/resilio
Professional technology-agnostic load testing suite built for performance engineering and durability auditing. Implements research-based methodologies (Jain, 1991) and ISO 25010 standards to validate speed, endurance, and scalability across any backend stack.
apachebench benchmarking devops-tools endurance-testing load-testing performance-testing quality-assurance reliability-engineering scalability sre stress-testing tech-agnostic
Last synced: 13 Apr 2026
https://github.com/codreum/terraform-aws-dns-monitoring-pro
Production-grade Route 53 DNS observability (Pro). Templates-only repo; module delivered via Codreum private Terraform registry.
aws cloudwatch cloudwatch-alarms cloudwatch-logs commercial contributor-insights dashboards dns dns-monitoring dnsci dnsciz incident-response infrastructure-as-code observability reliability route53 saas sre terraform terraform-templates
Last synced: 05 Feb 2026
https://github.com/rogerchappel/runbooklint
Local-first Markdown runbook linter for safer operational procedures
cli devops incident-response lint markdown runbook sre
Last synced: 26 May 2026
https://github.com/rtmuller/observability-reliability-lab
Hands-on lab demonstrating observability reliability patterns: meta-monitoring, chaos scenarios, Watchdog, absent() rules.
alertmanager chaos-engineering grafana meta-monitoring observability prometheus sre
Last synced: 29 May 2026
https://github.com/subhamay-bhattacharyya-tf/terraform-google-module-template
✅ A reusable, opinionated Terraform module template for provisioning and managing Google Cloud Platform resources — designed for consistency, scalability, and best practices.
cloud-infrastructure devops gcp google-cloud iac infrastructure-as-code module-template platform-engineering sre terraform terraform-module terraform-template
Last synced: 02 May 2026
https://github.com/juanfranciscocis/devprobe_tesis
DevProbe is a progressive web application that provides a platform for Site Reliability Engineers to monitor their websites. The app is built with , IONIC, Angular and Firebase.
angular gemini gemini-api ionic ionic-framework reliability-engineering site site-reliability-engineering site-reliability-engineering-sre sre sre-team typescript
Last synced: 10 Apr 2026
https://github.com/eon01/observabilitywithprometheusandgrafanacompaniontoolkit
Observability with Prometheus and Grafana - The Companion Toolkit
alertmanager devops docker docker-swarm grafana kubernetes metrics monitoring monitoring-as-code monitoring-tool observability prometheus prometheus-client prometheus-exporter prometheus-operator prometheus-pushgateway promql pushgateway sre
Last synced: 10 Apr 2026
https://github.com/christiangalsterer/pg-promise-prometheus-exporter
A prometheus exporter for pg-promise
grafana-dashboard metrics monitoring node-js nodejs observability pg-promise postgres postgresql prometheus prometheus-exporter sre
Last synced: 14 Jun 2025
https://github.com/jebinjeb/k8s-evacuator
Advanced Kubernetes node evacuation tool with safe batching, workload-aware draining, and progressive (controlled) pod eviction.
cloud-native cluster-operations devops k8s kubectl kubernetes node-drain platform-engineering pod-eviction sre statefulset
Last synced: 03 May 2026