SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-03-25 00:29:22 UTC
- JSON Representation
https://github.com/toolsascode/homebrew-tap
Homebrew library repository.
brew cloud devops golang gotemplate homebrew homebrew-tap it pipeline platform sre
Last synced: 27 Feb 2026
https://github.com/letusdevops/learngo
30 days roadmap for Golang for DevOps along with exercises.
Last synced: 13 Jun 2025
https://github.com/amaurybsouza/terraform-aws-ec2-ssh
Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. It is one of the core services offered by Amazon Web Services (AWS) and provides a wide range of features and capabilities.
aws aws-ec2 devops devops-tools github github-actions infrastructure-as-code infrstructure sre terraform terraform-aws terraform-managed terraform-modules terraform-provider
Last synced: 21 Jan 2026
https://github.com/greenblade29/loglense
AI-Powered Log Analysis for the Command Line
ai-analysis ai-powered anthropic artificial-intelligence cli-tool debugging devops gemini llm log-analysis machine-learning ollama openai python root-cause-analysis sre troubleshooting
Last synced: 11 Oct 2025
https://github.com/dvilaverde/k8s-countermeasures
Kubernetes operator deploying run-books as code.
automation countermeasure devops golang k8s kubernetes operator operator-sdk prod-support runbooks sre
Last synced: 15 Jan 2026
https://github.com/imgautamm/srerepo
SRE Assessment Repo
dataengineering docker postgres python sre
Last synced: 30 Apr 2025
https://github.com/tedilabs/terraform-http-modules
🌳 A sustainable Terraform Package which manage useful data modules via HTTP provider
devops hacktoberfest hcl2 http lang-hcl sre tedilabs terraform terraform-module terraform-modules
Last synced: 11 Jun 2025
https://github.com/tiagotartari/observability-dotnet-opentelemetry-first-steps
This project demonstrates how to implement observability in .NET applications using OpenTelemetry.
dotnet dotnet8 logs metrics observability opentelemetry opentelemetry-collector opentelemetry-dotnet sre traces
Last synced: 20 Jan 2026
https://github.com/pfrederiksen/blast-radius
Local-first AWS dependency graph CLI to understand blast radius before changes
aws aws-sdk-go-v2 cli cloudops devops golang observability sre
Last synced: 25 Jan 2026
https://github.com/apiaryio/example-intersphinx-repo
This repository demonstrates using Intersphinx with indexes being exported in Docker volume
Last synced: 26 Jun 2025
https://github.com/konstruktoid/disruella
A very small digitalized primate responsible for randomly preventing something from continuing as usual or as expected.
chaos-engineering hacktoberfest high-availability python-black python3 resilience sre systemd test-automation
Last synced: 16 Feb 2026
https://github.com/toolsascode/gomodeler
Go Modeler is a small CLI and Library that brings the powerful features of the golang template into a simplified form.
ci cloud devops github-actions infra it pipeline platform sre
Last synced: 13 Oct 2025
https://github.com/shakibamoshiri/dq
Debug docker quickly using Docker Query
debugging devops-tools docker nodejs sre
Last synced: 19 Jun 2025
https://github.com/mkozjak/mkozjak.github.io
My business website.
aws cloud consultancy devops engineer gcp kubernetes programmer services sre terraform
Last synced: 12 Aug 2025
https://github.com/dingus-technology/DINGUS
Identify and solve bugs in your code by talking to your logs!
ai bugs deployment devops docker grafana infrastructure llm logging loki metrics monitoring openai prometheus python sre
Last synced: 31 Dec 2025
https://github.com/rvhoney/sre-learning-notes
SRE & DevOps roadmap with Obsidian-friendly notes and projects.
automation cloud devops infrastructure learning monitoring networking obsidian projects roadmap security sre study-notes version-control
Last synced: 05 Oct 2025
https://github.com/sanjaysv18/att-website-monitoring-
🚀 Full-stack monitoring with Prometheus & Grafana | Docker-based infrastructure monitoring | SRE practices demonstration
devops docker docker-compose grafana monitoring observability prometheus sre
Last synced: 06 Oct 2025
https://github.com/99icar/devops
A fully autonomous, AI-powered DevOps platform for managing cloud infrastructure across multiple providers, with AWS and GitHub integration, powered by OpenAI's Agents SDK.
book containers devops-course docker hacktoberfest jenkins kubernetes leanpub prometheus sql sre terraform vagrant yaml
Last synced: 27 Mar 2025
https://github.com/loneexpert/stashfin_learning
api-testing automation automation-testing java restassured sre
Last synced: 07 Oct 2025
https://github.com/moneycat-inc/otel-ops-pack
Hardened OpenTelemetry Collector ops pack for Windows: day-2 tooling, deterministic canary, safe change control, chaos drills, audit evidence.
observability opentelemetry ops signoz sre windows
Last synced: 07 Oct 2025
https://github.com/cloudputation/iterator
Automate infrastructure management with observability
automation devops golang infrastructure-as-code sre
Last synced: 27 Jan 2026
https://github.com/felippemozer/go-devops-sre-udemy
DevOps and SRE use case problems solving with Go programming language - Udemy course "Programação Go para DevOps e SREs"
devops golang sre udemy-course
Last synced: 14 Jan 2026
https://github.com/logan-bobo/user_infomation_api
A RESTful API built with Python and Flask allowing management of user information such as first name, last name and email through CRUD operations against a persistent database.
api aws cloud database devops flask python rest-api sql sre
Last synced: 20 Aug 2025
https://github.com/ranching-farm/kubectl-addon
Kubectl addon for connecting Kubernetes clusters to ranching.farm - an AI-powered Kubernetes management platform. Simplify cluster operations and get intelligent assistance for common tasks.
ai-assistant ai-assisted cluster-management devops helm k8s krew krew-plugin kubectl kubectl-commands kubectl-plugin kubectl-plugins kubernetes kustomize ranching-farm sre
Last synced: 19 Jan 2026
https://github.com/aliariff/argus
devops grafana influxdb monitoring performance python sre webpagetest
Last synced: 05 Oct 2025
https://github.com/cloud-automation-portfolio/cloud-monitoring-automation
IaC and scripts to deploy secure, multi-cloud logging, monitoring, and alerting. Integrates AWS CloudWatch, Azure Monitor/Sentinel, and Kubernetes (Prometheus/Alertmanager/Grafana) into a single signed Alert Hub (API Gateway + Lambda) for ChatOps delivery. Uses GitHub OIDC for CI with policy and security gates.
alertmanager api-gateway aws azure chatops cloudwatch github-actions grafana iac kubernetes lambda log-analytics observability oidc prometheus security sentinel sre terraform
Last synced: 16 Aug 2025
https://github.com/viniciushammett/Golang-DevOps-SRE-Aplicado
Projeto desenvolvido para praticas de DevOps/SRE através do estudo aplicado no dia a dia usando a linguagem Golang
Last synced: 15 Aug 2025
https://github.com/jwalsh/observex-demo
ObserveX demo of an Internal Developer Platform (IDP)
devops distributed-systems distributed-tracing guile-scheme idp internal-developer-platform observability platform-engineering simulation sre
Last synced: 13 Mar 2026
https://github.com/ranching-farm/k8s-agent
Kubernetes agent for deploying ranching.farm directly into your cluster. Connect your K8s deployment to our AI-powered management platform with a single line of code.
ai-assistant ai-assisted cluster-management devops helm k8s kubectl kubernetes kustomize ranching-farm sre
Last synced: 03 Feb 2026
https://github.com/mrwogu/portguard
Port Monitoring Health Check Service
devops go golang health-check http-service kubernetes load-balancer monitoring port-monitoring sre systemd tcp
Last synced: 27 Jan 2026
https://github.com/aronmilenait/homelab
A personal homelab for experimenting with DevOps, SRE, and Linux tools on physical hardware.
Last synced: 10 Aug 2025
https://github.com/aswinbennyofficial/sre-exercises
Building a resilient backend project from scratch with Dockerization and CI/CD, incorporating Site Reliability Engineering (SRE) principles. Demonstrating best practices in modular architecture, logging, database migration, and CI/CD pipelines for automated testing, deployment.
api backend docker go go-chi golang pgx postgres rest-api sre student-management-system vyper-config yaml-configuration zerolog
Last synced: 12 Mar 2025
https://github.com/cloudon-one/opensearch-monitoring
Reusable OpenSearch Monitoring configs
aws aws-lambda sre sre-terraform-managed
Last synced: 30 Dec 2025
https://github.com/toolsascode/scoop-bucket
Scoop bucket for official GoModeler CLI
cli cloud devops golang gotemplate scoop sre
Last synced: 20 Oct 2025
https://github.com/briancain/cats-as-a-service
This is a helper repo used during a role playing based incident training.
cat cats dnd incident-response roleplay sre sre-infrastructure
Last synced: 28 Jan 2026
https://github.com/tedilabs/github-required-actions
♥️ The best way to manage GitHub Actions Required Workflows in @tedilabs
devops github github-actions hacktoberfest sre tedilabs
Last synced: 27 Mar 2025
https://github.com/ricoberger/ricoberger.de
Personal website with links to my LinkedIn, Xing, Twitter, Github and Medium profile.
cloud-native github gopher hacker linkedin medium site-reliability-engineer sre twitter xing
Last synced: 22 Feb 2025
https://github.com/gma1k/gma1k
Introducing
automation cloud devops kubernetes linux platform-engineering security sre
Last synced: 06 Mar 2026
https://github.com/tedilabs/terraform-tfe-modules
🌳 A sustainable Terraform Package to manage all of things on Terraform Enterprise (Terraform Cloud)
devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-cloud terraform-enterprise terraform-module terraform-modules tfe type-module
Last synced: 05 Mar 2026
https://github.com/curiouslearner/cache_sniper
A small utility to detect page caching on CDNs
cache cache-invalidation devops-tools rust rust-lang sre
Last synced: 28 Oct 2025
https://github.com/tswcbyy1107/ansible-k8s
deploy k8s by ansible
ansible calico containerd docker flannel k8s sre
Last synced: 02 Mar 2025
https://github.com/rebound-how/rebound
The open source toolbox for resilient operations
agentic-ai ai chaos-engineering chaostoolkit devops mcp-server reliability-engineering reliability-tools resilience resilience-testing sre
Last synced: 08 Jul 2025
https://github.com/usekarma/adage-fabric
A universal, governance-first pattern for unifying event streams with Kafka, ClickHouse, and Grafana.
adage clickhouse event-streaming fabric governance grafana kafka observability sre
Last synced: 16 Sep 2025
https://github.com/fedekau/mercado-libre-sre
Es una API para centralizar y cachear las consultas a otras APIs de Mercado Libre.
Last synced: 28 Oct 2025
https://github.com/debaghtk/opsfordevs
devops bootcamp material that I have taught at previous companies
bootcamp devops operations sre
Last synced: 15 Feb 2026
https://github.com/viniciushammett/n8n-devops-lab
Lab de n8n para DevOps/SRE: automação de incidentes, digests de SLO e webhooks, rodando em Docker (Postgres+Redis, queue mode).
automation cron devops docker docker-compose incident-management n8n postgres redis sre webhook workflows
Last synced: 09 Aug 2025
https://github.com/korchasa/severin
PoC server chat agent
agent agentic-ai devops llm sre
Last synced: 31 Jan 2026
https://github.com/mikeshobes718/eks-orchestrator
Python CLI to manage EKS cluster/nodegroup lifecycle, RBAC, addons, and GitOps-safe rollouts with dry-run plans.
cli devops eks gitops helm kubectl kubernetes python sre
Last synced: 17 Mar 2026
https://github.com/mikeshobes718/cluster-admin-toolkit
SRE toolkit for day‑2 ops: nodes/deployments views, health, logs/events, rollout status, cordon/drain, restart workflows.
cli kubectl kubernetes observability operations sre
Last synced: 17 Mar 2026
https://github.com/eon01/observabilitywithprometheusandgrafanacompaniontoolkit
Observability with Prometheus and Grafana - The Companion Toolkit
alertmanager devops docker docker-swarm grafana kubernetes metrics monitoring monitoring-as-code monitoring-tool observability prometheus prometheus-client prometheus-exporter prometheus-operator prometheus-pushgateway promql pushgateway sre
Last synced: 30 Dec 2025
https://github.com/monim279/ai-powered-devops
🤖 Discover AI tools and techniques to optimize DevOps processes through practical challenges and hands-on projects in this comprehensive 10-day course.
agent agentic-ai cicd cloud devops devops-platform devops-workflow engineering-productivity environment-manager go grafana hacktoberfest llmops mcp microsoft-teams prometheus self-hosted sre
Last synced: 24 Sep 2025
https://github.com/mariano-tp/github-observability-demo
Prometheus + Grafana + exporter para métricas de GitHub. CI con GitHub Actions.
devops docker-compose github-actions grafana observability portfolio prometheus sre
Last synced: 24 Sep 2025
https://github.com/knaeckebrothero/kubernetes-cluster-project
This project focuses on setting up and managing a Kubernetes Cluster, because who doesn't want one?
deployment devops kubernetes kubernetes-cluster sre
Last synced: 05 Apr 2025
https://github.com/goseind/schablone
Template repository for app infrastructure based on SRE principles
actions azure cicd helm kubernetes sre terraform
Last synced: 30 Dec 2025
https://github.com/oguzhan-yilmaz/auto-blackbox-exporter
SSL Certificate Expiry alerts for existing K8s Ingress hosts — Auto generate a Prometheus ScrapeConfig for blackbox exporter — Install with Helm or ArgoCD
alertmanager alerts argocd blackbox-exporter gitops grafana grafana-dashboard helm helm-chart kubernetes monitoring prometheus prometheus-exporter sre ssl-certificate ssl-certificate-expired-check
Last synced: 24 Dec 2025
https://gitlab.com/ek-it/guias
Guiás de instalación y configuacion de servidores y servicios en un Data Center basados en tecnologias open source y software libre.
Last synced: 11 Mar 2025
https://github.com/fsaintjacques/survivalkit
A survival kit is a package of basic tools and supplies prepared in advance as an aid to survival in an emergency.
c health-check healthcheck logger monitoring sre
Last synced: 21 Mar 2025
https://github.com/zondatw/remote-cmder
remote cmder
command debugging-tool devops remote sre
Last synced: 17 Jan 2026
https://github.com/dbalucas/kb_dbalucas
This repository contains all my KnowHow and is a great list of quick and searchable commands for administrating a database and other types of systems I'm working with.
cloud dba ddl dml k8s linux postgresql sql sre
Last synced: 30 Dec 2025
https://github.com/thanhnguyxn/alert-alchemy
🧪 CLI incident-response simulator: brew fixes from alerts using realistic logs, metrics & traces (offline).
chaos-engineering cli debugging devops game incident-response learning monitoring observability oncall postmortem python rich runbooks simulation site-reliability-engineering sre terminal typer yaml
Last synced: 13 Jan 2026
https://github.com/oluwatobi-roie/sre-diskmonitor
Monitor disk usage on a MySQL server and auto-reset binary logs safely when space runs low.
automation bash cronjob devops diskmonitoring mysql server-maintenance sre
Last synced: 02 Jul 2025
https://github.com/cpanato/cpanato
aws azure bots devops gcp go kubernetes sre
Last synced: 17 Feb 2026
https://github.com/deas/ka0s
Building Chaos around LitmusChaos on Kubernetes
chaos-engineering flux2 kubernetes litmuschaos sre
Last synced: 15 Mar 2025
https://github.com/roybidani/sre-lab-infra
🚀 Complete SRE Training Environment - Production-grade infrastructure with Kubernetes, Prometheus, Grafana, and advanced SRE practices for hands-on learning
aws chaos-engineering devops grafana kubernetes monitoring prometheus sre terraform training
Last synced: 30 Dec 2025
https://github.com/jojees/project-genesis
Project Genesis is a comprehensive, hands-on learning initiative designed to build and manage a tangible, multi-service application within a modern DevOps ecosystem. This project serves as a real-world sandbox, demonstrating best practices across various disciplines, including DevOps, Site Reliability Engineering (SRE), DevSecOps, and FinDevOps.
cicd devops docker gitops grafana high-availability kubernetes microservices-architecture observability postgres prometheus rabbitmq redis sre
Last synced: 30 Dec 2025
https://github.com/leehmdev/gke-gitops-observability-lab
End-to-end GKE GitOps & Observability lab using Terraform, Helm, Argo CD, Prometheus, and Grafana
argocd devops gitops gke grafana helm kubernetes prometheus sre terraform
Last synced: 23 Nov 2025
https://github.com/rafaellimatecnologia-cloud/local-first-ai-service
Local-first AI service with deterministic routing, deadline enforcement, and graceful degradation
deterministic edge-ai latency local-first observability python reliability sre
Last synced: 13 Jan 2026
https://github.com/pezzos/pezzos
Some info about me 🤓
curriculum-vitae cv devops sre
Last synced: 11 Feb 2026
https://github.com/ramesh-852000/devops-practices-and-interview-prep
A collection of DevOps practices, scripts, interview questions, and real-world examples covering Linux, Jenkins, AWS, Kubernetes, Docker, Ansible, Terraform, CI/CD pipelines, Monitoring, and Cloud Platforms.
ansible aws azure cloud devops docker elastic gcp interview-questions jenkins kubernetes linux nosql prometheus sql sre terraform
Last synced: 30 Dec 2025
https://github.com/tty47/axectl
DevOps/SRE set of tools
devops go golang infrastructure infrastructure-as-code infrastructure-management sre tooling tools
Last synced: 08 Apr 2025
https://github.com/ziad-hsn/cpra
CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.
concurrency devops golang observability self-hosted sre uptime-monitor
Last synced: 07 Mar 2026
https://github.com/felipe-veas/handling-production-incidents
Runbooks, processes, and guidelines for effectively managing production incidents
documentation incident-management reliability runbooks sre
Last synced: 10 Mar 2026
https://github.com/felipe-veas/felipe-veas
SRE-DevOps Engineer specializing in Kubernetes, Terraform, cloud infrastructure, and observability platforms
aws cloud-engineering devops gcp gitops kubernetes observability sre terraform
Last synced: 10 Mar 2026
https://github.com/guibes/runbook-operator
A cloud-native Kubernetes operator that automatically generates and manages runbook documentation from PrometheusRule configurations with multiple output formats.
alerting automation cloud-native devops documentation gitops incident-response kubernetes monitoring operator prometheus runbooks sre
Last synced: 23 Jun 2025
https://github.com/juanfranciscocis/devprobe_tesis
DevProbe is a progressive web application that provides a platform for Site Reliability Engineers to monitor their websites. The app is built with , IONIC, Angular and Firebase.
angular gemini gemini-api ionic ionic-framework reliability-engineering site site-reliability-engineering site-reliability-engineering-sre sre sre-team typescript
Last synced: 01 Apr 2025
https://github.com/macbre/http-shadow
Compares HTTP responses from two different backends
Last synced: 20 Jul 2025
https://github.com/bienkma/bienkma.github.io
bienkma's information
bienkma devops infra infrastructure loadbalancing sre system systemadmin
Last synced: 16 Jan 2026
https://github.com/timyiu478/sadservers
Notes of sad servers
devops linux sre troubleshooting
Last synced: 03 Jul 2025
https://github.com/apolzek/shared
collection of proof-of-concepts (PoCs) created to explore ideas and test technologies
devops devops-tools laboratory proof-of-concept sre
Last synced: 17 Jan 2026
https://github.com/itsfoss0/writeups
Writeup about my homelab and postmoterms for incidents and/or outages in the same
devops incident-management incident-response kubernetes sre
Last synced: 08 Apr 2025
https://github.com/powerhome/pac-quota-controller
PAC Resource Sharing Validation Webhook
Last synced: 16 Jan 2026
https://github.com/swenyai/sweny
AI-powered engineering workflows — Learn from any source, Act through any tool, Report through any channel
ai ai-agent automation claude devops github-action observability sre triage typescript
Last synced: 11 Mar 2026
https://github.com/opscart/opscart-k8s-watcher
Kubernetes security awareness and troubleshooting tool featuring CIS Benchmark scoring, environment-aware analysis (PROD vs DEV), and actionable recommendations. Not for compliance auditing - use kube-bench for official CIS compliance.
cis-benchmark cloud-native cluster-monitoring devops devsecops kubernetes-security platform-engineering resource-optimization sre troubleshooting
Last synced: 21 Feb 2026
https://github.com/timthepost/sidecar
Silly Shell System Sidecar
ansi devops devops-tools linux shell sre sysadmin terminal-based
Last synced: 31 Oct 2025
https://github.com/trafik255/platform-engineering-starter-kit
A complete AWS platform engineering reference architecture using Terraform, ECS Fargate, ALB, CloudWatch, and GitHub Actions.
alb aws aws-ecs aws-vpc cicd cloudwatch devops ecs fastapi infrastructure-as-code microservices observability platform-engineering sre starter-kit terraform
Last synced: 27 Nov 2025
https://github.com/Terraform-Tutorials/terraform-aws-autoscaling
Just a basic test to code a Auto Scaling using Terraform on AWS
aws aws-autoscaling aws-ec2 cloud devops devops-tools ec2 infrastructure-as-code sre tech terraform terraform-modules vpc
Last synced: 10 Mar 2025