An open API service indexing awesome lists of open source software.

SRE

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.

https://github.com/suhasramanand/predictive-reliability-platform

End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices

anomaly-detection auto-remediation chaos-engineering devops docker fastapi grafana kubernetes microservices monitoring observability predictive-maintenance prometheus python react site-reliability sre typescript

Last synced: 08 Apr 2026

https://github.com/mizcausevic-dev/kinetic-gain-operator-console

Mission-control operator console for the Kinetic Gain Protocol Suite — interactive topology mesh, configurable SRE operator dashboard, audit-stream visualization, PDF export. Deploys to console.kineticgain.com.

ai-governance audit-stream dataviz kinetic-gain kinetic-gain-protocol-suite operator-console react sre topology typescript vite

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/slo-budget-tracker

SLO + error-budget tracker for Python services. FastAPI middleware, Prometheus exporter, multi-window burn-rate alerts. Part of the Platform Reliability Stack.

asgi burn-rate error-budget fastapi monitoring prometheus python reliability slo sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/request-shadow-rs

Async request mirroring with sampling, divergence detection, and structured response diffs. The SRE primitive for safe migrations. Part of the Platform Reliability Stack.

async diff migration mirror reliability rust shadow sre tokio

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/mcp-reliability-toolkit

MCP server exposing SLO math + reliability config recipes. Compute burn rate, size rate limiters, pick breaker thresholds, get drop-in Python and Rust configs back. Part of the Platform Reliability Stack.

circuit-breaker claude kinetic-gain mcp model-context-protocol rate-limiter reliability slo sre typescript

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/latency-budget-enforcer

Go policy engine for latency budget enforcement, dependency drag review, tail-latency breaches, and operator-facing service-path response planning

backend go golang governance latency net-http observability performance-engineering platform-engineering policy-engine sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/agent-canary

Progressive rollout, shadow mode, and auto-rollback for AI agents. Sticky-percent routing with promote/rollback gates driven by real metrics. Platform engineering reliability for the agent era.

ai-agents canary deployment feature-flags platform-engineering progressive-rollout python reliability shadow-deployment sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/rate-limit-shield

Production-grade rate limiting, circuit breaking, and retry shaping for LLM APIs. Token bucket + breaker + jittered backoff with HTTP 429 / Retry-After awareness.

anthropic circuit-breaker llm llmops openai python rate-limiting reliability retry-policy sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/observability-incident-command-api

TypeScript API for incident severity analysis, escalation routing, responder visibility, and operational incident-command workflows.

backend express incident-response nodejs openapi platform-engineering sre typescript

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/grpc-mesh-shadow

Typed gRPC shadow traffic client. Mirrors requests from a stable primary to an under-test candidate; diffs responses asynchronously; returns the primary to your caller. Sampling, timeouts, pluggable sinks. bufconn-tested.

ai-governance canary golang grpc platform-engineering protobuf service-mesh shadow-traffic sre

Last synced: 01 Jun 2026

https://github.com/rbryce90/linux-time-machine

Local-first Linux observability with historical scrubbing, semantic journald search, and an MCP server for Claude-driven investigation. Go + SQLite + Ollama embeddings.

bubbletea embeddings golang linux local-first mcp model-context-protocol observability ollama rag sre systems-monitoring time-series tui

Last synced: 22 May 2026

https://github.com/akintunero/netdiag

Stdlib-only CLI for SRE on-call: traceroute, DNS, ping, TLS, VPN checks, and incident presets. JSON output, stable exit codes.

automation cli command-line-tools devops dns json network-diagnostics networking on-call python sre sysadmin traceroute troubleshooting vpn

Last synced: 21 Jun 2026

https://github.com/toolsascode/scoop-bucket

Scoop bucket for official GoModeler CLI

cli cloud devops golang gotemplate scoop sre

Last synced: 20 Oct 2025

https://github.com/volkv/server-pulse

Lightweight Linux server monitoring with Telegram alerts. CPU, RAM, disk, load, Docker, OOM. Pure bash, systemd timer, no daemon.

alerting bash dedicated-server devops disk-space docker homelab linux-monitoring monitoring oom-killer self-hosted server-monitoring shell-script sre systemd telegram-alerts telegram-bot vps

Last synced: 21 Jun 2026

https://github.com/ranching-farm/k8s-agent

Kubernetes agent for deploying ranching.farm directly into your cluster. Connect your K8s deployment to our AI-powered management platform with a single line of code.

ai-assistant ai-assisted cluster-management devops helm k8s kubectl kubernetes kustomize ranching-farm sre

Last synced: 03 Feb 2026

https://github.com/briancain/cats-as-a-service

This is a helper repo used during a role playing based incident training.

cat cats dnd incident-response roleplay sre sre-infrastructure

Last synced: 28 Jan 2026

https://github.com/aliariff/argus

Tool to export WebPageTest results into InfluxDB.

devops grafana influxdb monitoring performance python sre webpagetest

Last synced: 18 Apr 2026

https://github.com/ramesh-852000/devops-practices-and-interview-prep

A collection of DevOps practices, scripts, interview questions, and real-world examples covering Linux, Jenkins, AWS, Kubernetes, Docker, Ansible, Terraform, CI/CD pipelines, Monitoring, and Cloud Platforms.

ansible aws azure cloud devops docker elastic gcp interview-questions jenkins kubernetes linux nosql prometheus sql sre terraform

Last synced: 04 Apr 2026

https://github.com/curiouslearner/cache_sniper

A small utility to detect page caching on CDNs

cache cache-invalidation devops-tools rust rust-lang sre

Last synced: 28 Oct 2025

https://github.com/apiaryio/blackhole

App returning HTTP code 429

sre

Last synced: 26 Jun 2025

https://github.com/macbre/http-shadow

Compares HTTP responses from two different backends

sre sus

Last synced: 20 Jul 2025

https://github.com/fsaintjacques/survivalkit

A survival kit is a package of basic tools and supplies prepared in advance as an aid to survival in an emergency.

c health-check healthcheck logger monitoring sre

Last synced: 21 Mar 2025

https://github.com/cleancloud-io/scan-action

GitHub Action for CleanCloud — read-only cloud hygiene scanner for AWS, Azure and GCP

aws azure cloud cloud-computing cost-optimization devops finops hygiene sre

Last synced: 07 Apr 2026