SRE
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
- GitHub: https://github.com/topics/sre
- Wikipedia: https://en.wikipedia.org/wiki/Site_reliability_engineering
- Aliases: site-reliability-engineering,
- Last updated: 2026-06-30 00:25:53 UTC
- JSON Representation
https://github.com/suhasramanand/predictive-reliability-platform
End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices
anomaly-detection auto-remediation chaos-engineering devops docker fastapi grafana kubernetes microservices monitoring observability predictive-maintenance prometheus python react site-reliability sre typescript
Last synced: 08 Apr 2026
https://github.com/mrwogu/portguard
Port Monitoring Health Check Service
devops go golang health-check http-service kubernetes load-balancer monitoring port-monitoring sre systemd tcp
Last synced: 27 Jan 2026
https://github.com/mizcausevic-dev/kinetic-gain-operator-console
Mission-control operator console for the Kinetic Gain Protocol Suite — interactive topology mesh, configurable SRE operator dashboard, audit-stream visualization, PDF export. Deploys to console.kineticgain.com.
ai-governance audit-stream dataviz kinetic-gain kinetic-gain-protocol-suite operator-console react sre topology typescript vite
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/slo-budget-tracker
SLO + error-budget tracker for Python services. FastAPI middleware, Prometheus exporter, multi-window burn-rate alerts. Part of the Platform Reliability Stack.
asgi burn-rate error-budget fastapi monitoring prometheus python reliability slo sre
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/request-shadow-rs
Async request mirroring with sampling, divergence detection, and structured response diffs. The SRE primitive for safe migrations. Part of the Platform Reliability Stack.
async diff migration mirror reliability rust shadow sre tokio
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/mcp-reliability-toolkit
MCP server exposing SLO math + reliability config recipes. Compute burn rate, size rate limiters, pick breaker thresholds, get drop-in Python and Rust configs back. Part of the Platform Reliability Stack.
circuit-breaker claude kinetic-gain mcp model-context-protocol rate-limiter reliability slo sre typescript
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/latency-budget-enforcer
Go policy engine for latency budget enforcement, dependency drag review, tail-latency breaches, and operator-facing service-path response planning
backend go golang governance latency net-http observability performance-engineering platform-engineering policy-engine sre
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/agent-canary
Progressive rollout, shadow mode, and auto-rollback for AI agents. Sticky-percent routing with promote/rollback gates driven by real metrics. Platform engineering reliability for the agent era.
ai-agents canary deployment feature-flags platform-engineering progressive-rollout python reliability shadow-deployment sre
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/rate-limit-shield
Production-grade rate limiting, circuit breaking, and retry shaping for LLM APIs. Token bucket + breaker + jittered backoff with HTTP 429 / Retry-After awareness.
anthropic circuit-breaker llm llmops openai python rate-limiting reliability retry-policy sre
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/observability-incident-command-api
TypeScript API for incident severity analysis, escalation routing, responder visibility, and operational incident-command workflows.
backend express incident-response nodejs openapi platform-engineering sre typescript
Last synced: 01 Jun 2026
https://github.com/mizcausevic-dev/grpc-mesh-shadow
Typed gRPC shadow traffic client. Mirrors requests from a stable primary to an under-test candidate; diffs responses asynchronously; returns the primary to your caller. Sampling, timeouts, pluggable sinks. bufconn-tested.
ai-governance canary golang grpc platform-engineering protobuf service-mesh shadow-traffic sre
Last synced: 01 Jun 2026
https://github.com/rbryce90/linux-time-machine
Local-first Linux observability with historical scrubbing, semantic journald search, and an MCP server for Claude-driven investigation. Go + SQLite + Ollama embeddings.
bubbletea embeddings golang linux local-first mcp model-context-protocol observability ollama rag sre systems-monitoring time-series tui
Last synced: 22 May 2026
https://github.com/akintunero/netdiag
Stdlib-only CLI for SRE on-call: traceroute, DNS, ping, TLS, VPN checks, and incident presets. JSON output, stable exit codes.
automation cli command-line-tools devops dns json network-diagnostics networking on-call python sre sysadmin traceroute troubleshooting vpn
Last synced: 21 Jun 2026
https://github.com/toolsascode/scoop-bucket
Scoop bucket for official GoModeler CLI
cli cloud devops golang gotemplate scoop sre
Last synced: 20 Oct 2025
https://github.com/volkv/server-pulse
Lightweight Linux server monitoring with Telegram alerts. CPU, RAM, disk, load, Docker, OOM. Pure bash, systemd timer, no daemon.
alerting bash dedicated-server devops disk-space docker homelab linux-monitoring monitoring oom-killer self-hosted server-monitoring shell-script sre systemd telegram-alerts telegram-bot vps
Last synced: 21 Jun 2026
https://github.com/ranching-farm/k8s-agent
Kubernetes agent for deploying ranching.farm directly into your cluster. Connect your K8s deployment to our AI-powered management platform with a single line of code.
ai-assistant ai-assisted cluster-management devops helm k8s kubectl kubernetes kustomize ranching-farm sre
Last synced: 03 Feb 2026
https://github.com/briancain/cats-as-a-service
This is a helper repo used during a role playing based incident training.
cat cats dnd incident-response roleplay sre sre-infrastructure
Last synced: 28 Jan 2026
https://github.com/aliariff/argus
Tool to export WebPageTest results into InfluxDB.
devops grafana influxdb monitoring performance python sre webpagetest
Last synced: 18 Apr 2026
https://github.com/ramesh-852000/devops-practices-and-interview-prep
A collection of DevOps practices, scripts, interview questions, and real-world examples covering Linux, Jenkins, AWS, Kubernetes, Docker, Ansible, Terraform, CI/CD pipelines, Monitoring, and Cloud Platforms.
ansible aws azure cloud devops docker elastic gcp interview-questions jenkins kubernetes linux nosql prometheus sql sre terraform
Last synced: 04 Apr 2026
https://github.com/curiouslearner/cache_sniper
A small utility to detect page caching on CDNs
cache cache-invalidation devops-tools rust rust-lang sre
Last synced: 28 Oct 2025
https://github.com/macbre/http-shadow
Compares HTTP responses from two different backends
Last synced: 20 Jul 2025
https://github.com/tedilabs/terraform-aws-quicksight
🌳 A sustainable Terraform Package to manage QuickSight resources on AWS
aws aws-data aws-quicksight devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules
Last synced: 19 May 2026
https://github.com/fsaintjacques/survivalkit
A survival kit is a package of basic tools and supplies prepared in advance as an aid to survival in an emergency.
c health-check healthcheck logger monitoring sre
Last synced: 21 Mar 2025
https://github.com/cleancloud-io/scan-action
GitHub Action for CleanCloud — read-only cloud hygiene scanner for AWS, Azure and GCP
aws azure cloud cloud-computing cost-optimization devops finops hygiene sre
Last synced: 07 Apr 2026
https://github.com/altikva/spero
Self-healing supervision agent for Linux hosts and Kubernetes
agentic-ai aiops anomaly-detection asyncio automation devops fastapi homelab incident-response kubernetes llm monitoring observability python self-healing self-hosted sre ssh supervision
Last synced: 21 Jun 2026