SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-03-24 00:29:21 UTC
- JSON Representation
https://github.com/prologic/prologic
Hiya 👋 I'm James Mills a Senior SRE / DevOps formally Software Engineer and enthusiastic Gopher (Golang) Programmer I love open source and contributing back,unfortunately recent events have lead me to self-host more of my own projects and data. Please read on! 🙇♂️
devops golang open-source software software-engineering sre
Last synced: 13 Apr 2025
https://github.com/christiangalsterer/kafkajs-prometheus-exporter
A prometheus exporter exposing metrics for KafkaJS
grafana grafana-dashboard kafka kafkajs metrics monitoring node-js nodejs prometheus prometheus-exporter sre
Last synced: 28 Jul 2025
https://github.com/ohmydevops/devops-culture-or-tools
فایل ارائه "دوآپس، فرهنگ یا ابزار؟" در دورهمی شماره ۲ برنامهنویسان کارخانه نوآوری مشهد
agile devops devops-handbook devopsdays sre
Last synced: 18 Feb 2026
https://github.com/tedilabs/terraform-aws-organization
🌳 A sustainable Terraform Package to manage Organization resources on AWS
aws aws-organization aws-ram aws-sso devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 12 Feb 2026
https://github.com/johndeere/work-tracker
Observe and protect your Java web application.
elasticsearch java java11 java8 mdc metadata observability spring spring-boot sre
Last synced: 04 Oct 2025
https://github.com/guilt/chaossquirrel
Like Netflix's Chaos Monkey, packaged to run standalone.
chaos-monkey reliability-engineering sre
Last synced: 12 Aug 2025
https://github.com/tedilabs/terraform-aws-ec2
🌳 A sustainable Terraform Package which creates resources for EC2 Services on AWS
aws aws-ec2 devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module
Last synced: 27 Feb 2026
https://github.com/timothystiles/buster
A Go Package for running Go package CICD pipelines in Go...
cd ci ci-cd cicd continous-deployment continous-integration devops github-actions go golang infrastructure platform-engineering sre
Last synced: 10 Apr 2025
https://github.com/k8sgpt-ai/charts
Helm Charts for K8sGPT
devops kubernetes openai sre tooling
Last synced: 11 Mar 2026
https://github.com/tedilabs/terraform-github-modules
🌳 A sustainable Terraform Package which manage all of things on GitHub
devops github hacktoberfest hcl2 lang-hcl sre tedilabs terraform terraform-module terraform-modules
Last synced: 01 Mar 2026
https://github.com/tedilabs/terraform-aws-ipam
🌳 A sustainable Terraform Package which creates IPAM resources (IPAM, Elastic IP, Prefix List) on AWS
aws aws-eip aws-elastic-ip aws-ipam aws-prefix-list aws-vpc devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 02 Aug 2025
https://github.com/jayvdb/sre-tools
Helpers for sre_parse, transforming regexes
python3 regex regular-expressions sre sre-parse
Last synced: 19 Aug 2025
https://github.com/mattermost/ponos
A ChatOps SRE toil elimination tool
chatops sre sre-team toil-elimination
Last synced: 14 Jan 2026
https://github.com/aligoren/sre-book-tr
Google SRE kitabının Türkçe çevirisi. Site Reliability Engineering prensiplerini ve uygulamalarını Türkçe teknik topluluğa kazandırmak için hazırlanmıştır.
site-reliability-engineering sre turkish
Last synced: 07 Feb 2026
https://github.com/tedilabs/terraform-aws-lambda
🌳 A sustainable Terraform Package which creates Lambda & Step Functions resources on AWS
aws aws-lambda aws-sfn aws-step-functions devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 08 Mar 2026
https://github.com/mstryoda/sre-ai-agent
An autonomous Kubernetes troubleshooting and healing agent powered by AI Agents and LLMs
agent ai kubernetes llm python sre troubleshooting
Last synced: 13 Oct 2025
https://github.com/tedilabs/terraform-aws-messaging
🌳 A sustainable Terraform Package which creates resources for Messaging Services (EventBridge, MSK, SNS, SQS) on AWS
aws aws-eventbridge aws-msk aws-sns aws-sqs devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 19 Sep 2025
https://github.com/persys-dev/persys-cloud
community driven cloud provider :)
automation cloud cluster golang kubernetes pipelines platform sre
Last synced: 14 Apr 2025
https://github.com/vacovsky/poolse
Control health checks and toggle upstream node status in load balancers with ease.
application-monitoring devops devops-tools f5-health-monitor go golang health-check healthcheck load-balancer nginx-proxy proxy site-reliability-engineering sre
Last synced: 26 Mar 2025
https://github.com/amaurybsouza/devops-deep-dive
DevOps week of the Linux Tips chanel - Ansible, Kubernetes, Docker and AWS.
ansible automation aws bash cloud devops devops-tools linux playbook sre terraform
Last synced: 30 Dec 2025
https://github.com/eabykov/sre
Надежность — это не отсутствие сбоев. Это способность системы, команды и человека вместе подняться после падения, переосмыслить, перестроить и идти дальше — с новыми правилами игры, где человеческая уязвимость не угроза, а часть уравнения
chaos-testing error-budget incident monitoring mttd mttm mttr postmortem reliability sla sli slo sre stamp
Last synced: 19 Jan 2026
https://github.com/fusakla/coordinator
Tool to coordinate on-call, incident and maintenance management
alerting communication coordination dashboard devops oncall sre
Last synced: 06 Mar 2025
https://github.com/gorillati/guias
Guias de instalacion y configuacion de servidores y servicios en un Data Center basados en tecnologias open source y software libre.
configuration-management cpd datacenter free-software guias linux-server manuals open-source server services sre sre-infra sysadmin system-administration
Last synced: 11 Mar 2025
https://github.com/johndeere/outstanding
A Java concurrent collection for in-progress work
java java-collections java8 sre
Last synced: 14 Jan 2026
https://github.com/dynatrace/obslab-release-validation
Use Grafana k6, Dynatrace business events, workflows and site reliability guardian to validate software releases
automation demo dynatrace grafana-k6 k6 load-testing obslab openfeature release-validation site-reliability-engineering site-reliability-guardian sre workflow
Last synced: 11 Jul 2025
https://github.com/tedilabs/terraform-okta-modules
🌳 A sustainable Terraform Package which manage all of things on Okta
devops hacktoberfest hcl2 iac lang-hcl okta sre tedilabs terraform terraform-module terraform-modules terraform-okta
Last synced: 29 Jan 2026
https://github.com/priyanshujain/infragpt
InfraGPT is an AI SRE Copilot for the Cloud that provides infrastructure management agents through Slack integration. The system consists of multiple services that work together to deliver intelligent DevOps workflows.
artificial-intelligence google-cloud-platform infragpt infrastructure sre terraform
Last synced: 28 Jun 2025
https://github.com/ctsrc/mdrun
Runs command-line pipelines embedded in Markdown and CommonMark documents. Keeps your authored docs up to date. Even usable as an alternative to IPython notebooks.
abstract-syntax-tree ast commonmark computer-science data-science devops iac infrastructure-as-code markdown qa quality-assurance software-development software-engineering sre technical-writing writing-tool
Last synced: 02 Aug 2025
https://github.com/purbon/kafka-interesting-stories
Compilation of public incident/interesting/horror stories related to Kafka operations
incidents kafka post-mortem production-engineering sre
Last synced: 18 Mar 2026
https://github.com/fguisso/sre-checker
Simple server status checker
monitoring monitoring-tool server-status sre sre-checker
Last synced: 07 Mar 2026
https://github.com/ayazhankadessova/grafana-prometheus
Prometheus-based Grafana dashboard featuring latency chart, CPU usage gauge, and request rates table, and node host metrics, using PromLabs' public server.
grafana monitoring prometheus sre
Last synced: 22 Jul 2025
https://github.com/abhishekpanda0620/eol-check
A CLI tool to check the End-Of-Life (EOL) status of your development environment and project dependencies.
cli-tool devops end-of-life eol nodejs security sre typescript
Last synced: 13 Jan 2026
https://github.com/tedilabs/terraform-aws-cost
🌳 A sustainable Terraform Package which creates resources for Cost on AWS
aws aws-billing aws-budget aws-cost aws-cur devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 16 Jun 2025
https://github.com/amaurybsouza/aws-solutions-architect-associate
AWS Certified Solutions Architect - Associate (SAA-S02) Exam Notes
aws aws-ec2 aws-lambda certificate devops engeneering infrastructure-as-code solutions-architect sre tech terraform
Last synced: 01 Apr 2025
https://github.com/apurvabhandari/interview-que-for-devops
Interview Questions for DevOps / SRE
cloud container devops hacktoberfest sre technology
Last synced: 29 Jun 2025
https://github.com/amaurybsouza/portfolio
Helps global projects improve security posture while optimizing costs and ensuring business continuity. I'm a dedicated Cloud Security Engineer committed to safeguarding cloud environments and fostering a DevSecOps culture.
ansible aws cicd cloud devops devops-team devops-tools git github gitlab gitops infrastructure-as-code kubernetes sre terraform
Last synced: 12 May 2025
https://github.com/gdagil/vmprober
Network connectivity and service availability prober with WAL-backed metrics export to VictoriaMetrics
devops dns-probe golang grpc-probe health-check http-probe icmp infrastructure metrics monitoring network-monitoring observability probing prometheus sre tcp-probes victoriametrics
Last synced: 13 Jan 2026
https://github.com/lukaspustina/usereport-rs
Collect system information for the first 60 seconds of a performance analysis
analysis cli performance sre statistics
Last synced: 03 Aug 2025
https://github.com/ohmydevops/now-status
A social network of people's status! This is a simple project to organizing my mind for interviews. Wish me luck
Last synced: 31 Mar 2025
https://github.com/shmokmt/tfhk
Terraform Housekeeper. The utility tool to remove blocks for refactoring such as moved block.
devops iac infrastructure-as-code sre terraform
Last synced: 13 Jul 2025
https://github.com/exospherehost/ai-reliability-standards
Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).
ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre
Last synced: 15 Feb 2026
https://github.com/newrelic-experimental/nr1-command-center-v2
Consolidated view of incidents, anomalies, and issues across all accessible accounts
alerts anomalies issues nrai nrlabs nrlabs-viz ops sre
Last synced: 28 Nov 2025
https://github.com/amaurybsouza/my-github-actions
🪐🤖🚀An awesome list of useful Github actions with workflows examples and cases of market to you use on daily basis.
actions ansible aws azure azure-devops cicd cloud devops devops-pipeline devops-tools github github-actions gitlab infrastructure-as-code kubernetes-deployment pipeline sre terraform
Last synced: 30 Dec 2025
https://github.com/revengai/reai-cutter
RevEng.AI Cutter Plugin
artificial-intelligence cutter machine-learning radare2 radare2-plugin reverse-engineering rizin sre
Last synced: 11 Jul 2025
https://github.com/tedilabs/helm-charts
♻️ Repository for Reusable Kubernetes Helm Charts
devops gitops hacktoberfest helm helm-charts k8s kubernetes lang-yaml sre tedilabs
Last synced: 01 Mar 2026
https://github.com/rajatguptarg/samantha
Bot for SRE and DevOps
ansible automation bot ci-cd devops slackapi slackbot sre
Last synced: 19 Jan 2026
https://github.com/andrewaylett/self-throttle
Helps clients to not overwhelm the services they call
Last synced: 05 Feb 2026
https://github.com/alivzh/rahbia-live-coding
In the RahBia Live Coding Series, we’ll walk through a complete DevOps journey from start to finish. Together, we'll cover every step—from initial server configuration to final production-ready service deployment that mr AhmadRafiee is hosting it
argocd ceph cicd docker elk gitops grafana haproxy linux observability openstack prometheus sre sre-team terraform traefik
Last synced: 10 Apr 2025
https://github.com/rootly-ai-labs/gmcq-benchmark
Evaluation benchmark for language models to understand code to close pull requests.
ai benchmark evals evaluation-metrics llm sre
Last synced: 25 Feb 2026
https://github.com/mathisve/aiosre
All In One SRE Docker Container
aws docker hacktoberfest kubernetes sre
Last synced: 26 Feb 2026
https://github.com/anqorithm/fastapi-helm-chart
This repository contains a Helm chart for a FastAPI application to be deployed on OpenShift clusters with minimal effort with customizable configurations.
automation cicd deployment devops docker fastapi helm k8s kubernetes oc openshift poetry python sre
Last synced: 12 Feb 2026
https://github.com/usmanmern/semester-4
Semester4 Books Repo - GCUF SE: Access study materials for Computer Networking, OS, Design and Algorithm, DBMS, and Software Requirement Engineering. Excel in your studies! 📚
computer-networking operating-system os sre
Last synced: 02 Mar 2025
https://github.com/getbettr/www
Source code for my personal homepage.
advent-of-code adventure devops journal kubernetes rust sre technology
Last synced: 14 Feb 2026
https://github.com/ehsaniara/delay-box
This tool simplifies the workflow by removing the need to write code to handle Redis and Kafka development complexities. It manages these tasks for you through straightforward REST calls.
delayed-job devops distributed-systems kafka redis scheduled-tasks sre
Last synced: 01 Mar 2026
https://github.com/geekxflood/prometheus-inventory-manager
PRIM export in CSV All Prometheus metrics and rules set on your Prometheus instance
monitoring observability prometheus reporting sre
Last synced: 04 Mar 2026
https://github.com/maestre3d/k8s-microservices-sample
A sample platform using Kubernetes (K8s) to manage a set of container-based microservices clusters and web clients written in Java, Golang, Elixir, Rust, Javascript (+ NodeJS) and Python.
elixir golang java javascript kubernetes microservices pyhton rust sre
Last synced: 02 Apr 2025
https://github.com/omarmfathy219/k8s-stuck-pod-cleaner
A lightweight, automated solution to resolve one of the most common operational issues in Kubernetes: pods stuck in Terminating state.
cron-job devops helm-chart k8s kubernetes kubernetes-automation kubernetes-operator pods sre stuck-pods terminating
Last synced: 15 Oct 2025
https://github.com/lucasepe/resto
A minimalist CLI REST client that calls APIs, waits for conditions, and retries intelligently.
command-line devops expression-evaluator jq kubernetes rest-client retry sre tooling
Last synced: 23 Jun 2025
https://github.com/sredevopsorg/.github
Site Reliability Engineering (SRE), DevOps, DevSecOps, Cloud Native, Linux, AI, ML, OpenSource, Platform Engineering en Español, Portugués (Brasil) and English
community devops kubernetes linux open-source organization platform-engineering site-reliability-engineering sre
Last synced: 18 Jan 2026
https://github.com/katavinanguyen/data-center-staffing-optimization-simulator
Simulates incident handling in data centers using Python and SimPy. Analyze how staffing levels, shift timing, and triage rules affect SLA compliance, resolution time, and backlog size.
critical-infrastructure data-center discrete-event-simulation incident-management noc operations-research python simpy simulation sla-monitoring sre staffing-optimization
Last synced: 28 Jul 2025
https://github.com/apiaryio/heroku-datadog-drain
Funnel metrics from multiple Heroku apps into DataDog using statsd
Last synced: 20 Jan 2026
https://github.com/centerdevice/ceres-lambda
SRE Tool for CenterDevice - AWS Lambda Functions
Last synced: 10 Aug 2025
https://github.com/tedilabs/terraform-aws-vpn
🌳 A sustainable Terraform Package which creates VPN resources (Clienet VPN, Site-to-Site VPN) on AWS
aws aws-client-vpn aws-site-to-site-vpn aws-vpn devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 09 Apr 2025
https://github.com/arun0009/go-resilience-mock
Chaos engineering in a box. A high-performance mock server to test your API's resilience against latency, failures, and resource exhaustion
chaos-engineering cpu-stress fault-injection go golang http-mock mock-server observability prometheus resilience-testing sre
Last synced: 13 Jan 2026
https://github.com/efcloud/sre-docker-digger
Docker image to small tool that check connectivity.
docker docker-image infrastructure sre
Last synced: 11 Mar 2025
https://github.com/aruizeac/k8s-microservices-sample
A sample platform using Kubernetes (K8s) to manage a set of container-based microservices clusters and web clients written in Java, Golang, Elixir, Rust, Javascript (+ NodeJS) and Python.
elixir golang java javascript kubernetes microservices pyhton rust sre
Last synced: 22 Jun 2025
https://github.com/boshu2/12-factor-agentops
DevOps + SRE principles for operating LLM applications reliably at scale. Complementary to 12-Factor Agents for building
12-factor agent-orchestration agentops agents ai-agents ai-agents-framework ai-operations argocd context-engineering devops flux gitops infrastructure-as-code kubernetes kyverno llm openshift platform-engineering production-operations sre
Last synced: 05 Nov 2025
https://github.com/jnbdz/sre-quickstarts
Software Reverse Engineering (SRE) Quickstarts!
disassembler linux quickstart quickstarts reverse-engineering software-analysis software-reverse-engineering sre
Last synced: 28 Feb 2025
https://github.com/meysam81/liveness-check
Kubernetes-native health checker that automatically finds and verifies your latest pods are ready before considering deployments successful - perfect for preview environments
ci-cd cli-tool container-native cross-platform devops docker golang health-check http-client kubernetes kubernetes-health liveness-probe microservices monitoring preview-environments readiness-probe single-binary site-reliability sre zero-dependencies
Last synced: 28 Oct 2025
https://github.com/admodev/my-dockerfiles
Dockerfiles i use on a daily basis. Useful for SRE and DevOps Engineers.
devops docker dockerfile dockerfiles engineering image images sre
Last synced: 26 Aug 2025
https://github.com/woodprogrammer/skript
The shell script wrapper on Python
Last synced: 30 Mar 2025
https://github.com/kmadsdev/devops-challenge
PicPay Jr Devops Challenge Solution
api cors devops docker go html-css javascript microservices python redis sre
Last synced: 13 Jan 2026
https://github.com/glasnostic/helm-charts
Glasnostic Helm Chart Repository
control devops helm-charts k8s kubernetes sre
Last synced: 17 Jan 2026
https://github.com/brunopadz/memcached-ok
Simple way to test connection to memcached
infrastructure memcached site-reliability-engineering sre
Last synced: 05 Oct 2025
https://github.com/meysam81/prometheus-command-timer
Run any command, time its execution and push the metrics to Prometheus Pushgateway
cli-tool command-line command-wrapper container-metrics cross-platform devops docker execution-time golang job-monitoring kubernetes metrics metrics-collector monitoring observability performance-monitoring prometheus pushgateway sre time-tracking
Last synced: 10 Mar 2025
https://github.com/lucasloureiror/slh
Service Level Helper is a CLI tool for calculating Service Level related metrics like SLO, SLA, Error Budgets and probing frequency.
availability devops golang sla slo sre
Last synced: 06 Feb 2026
https://github.com/meysam81/healthchecks-client
🏥 A production-ready CLI tool for monitoring HTTP endpoints and automatically reporting success/failure to healthchecks.io. Single binary & cross-platform.
alerting cli devops golang golang-cli healthcheck healthchecks http-monitoring infrastructure-monitoring microservices-monitoring monitoring observability ping-monitoring production-monitoring service-health service-monitoring site-reliability-engineering sre status-monitoring uptime-monitoring
Last synced: 07 Aug 2025
https://github.com/toolsascode/gomodeler-action
GitHub Action for GoModeler
ci cloud devops github-actions golang gomodeler gotemplate pltaform sre summary template
Last synced: 04 Sep 2025
https://github.com/sergkondr/fake-web-service
fake web service for testing purposes
golang kubernetes sre testing web-service
Last synced: 02 Mar 2026
https://github.com/cantrellr/ultimate-k8s-toolbox
🛠️ Comprehensive Kubernetes administration workstation with 50+ pre-installed tools. Deploy a fully-equipped debugging pod directly into your cluster. Air-gapped ready.
air-gapped cloud-native debugging devops helm helm-chart k8s k9s kubectl kubernetes mongosh offline platform-engineering sre toolbox troubleshooting
Last synced: 13 Jan 2026
https://github.com/learn-software-engineering/website
Learn-Software.com Website
blog devops github-pages golang hugo kubernetes platform-engineering programming python site-reliability-engineering software software-engineering sre website
Last synced: 02 Nov 2025
https://github.com/nusnewob/kube-changejob
A Kubernetes operator that triggers Jobs when specific Kubernetes resources change
automation controller-runtime crd devops golang jobs kubernetes kubernetes-operator sre
Last synced: 16 Jan 2026
https://github.com/kintsdev/automountify
Automountify is a Go-based CLI tool to format, mount disks, and update /etc/fstab for persistent mounting
Last synced: 27 Mar 2025
https://github.com/meysam81/terraform-modules
automation ci-cd cloud-infrastructure cloud-native deployment-automation devops github-actions github-repositories github-workflow iac infrastructure-as-code infrastructure-management kubernetes platform-engineering site-reliability-engineering sre terraform terraform-best-practices terraform-configuration terraform-modules
Last synced: 03 Apr 2025
https://github.com/rmkraus/demo-ansible-monitoring
Demo Builder - Automated Issue Remediation with Zabbix + Ansible
Last synced: 21 Aug 2025
https://github.com/chukwuemekaaham/cloud-gcp-projects
Google Cloud Platform Projects, Workshop Training and Skill Badge
anthos big-data-analytics case-study cloud-identity cloud-infrastructure cloudbuild data-engineering devsecops gcp grafana-dashboard landing-zone migration mlops prometheus service-account spinnaker sre terraform vpn
Last synced: 27 Feb 2025
https://github.com/itsfoss0/alx-backend
Backend Engineering concepts, projects and resources at ALX Africa
alx-africa alx-backend backend backend-api sre
Last synced: 09 Oct 2025