An open API service indexing awesome lists of open source software.

SRE

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.

https://github.com/newrelic-experimental/nr1-command-center-v2

Consolidated view of incidents, anomalies, and issues across all accessible accounts

alerts anomalies issues nrai nrlabs nrlabs-viz ops sre

Last synced: 08 Jun 2026

https://github.com/usmanmern/semester-4

Semester4 Books Repo - GCUF SE: Access study materials for Computer Networking, OS, Design and Algorithm, DBMS, and Software Requirement Engineering. Excel in your studies! 📚

computer-networking operating-system os sre

Last synced: 10 May 2026

https://github.com/itsfoss0/alx-backend

Backend Engineering concepts, projects and resources at ALX Africa

alx-africa alx-backend backend backend-api sre

Last synced: 09 Oct 2025

https://github.com/centerdevice/ceres

SRE Tool for CenterDevice

cli sre

Last synced: 09 Apr 2025

https://github.com/efcloud/sre-docker-digger

Docker image to small tool that check connectivity.

docker docker-image infrastructure sre

Last synced: 11 Mar 2025

https://github.com/glasnostic/helm-charts

Glasnostic Helm Chart Repository

control devops helm-charts k8s kubernetes sre

Last synced: 17 Jan 2026

https://github.com/walmartdigital/hirings

Jobs offers at Walmart Chile

hiring jobs sre

Last synced: 17 Jan 2026

https://github.com/vsingh55/homelab-ops

A production-grade Hybrid Cloud Platform spanning On-Prem (Proxmox) and GCP. Engineered with Terraform, Ansible, K3s, and WireGuard Mesh to demonstrate Zero-Trust networking, FinOps, and SRE principles.

ansible automation devops finops gcp gitops grafana hybrid-cloud infrastructure-as-code kubernetes observability platform-engineering proxmox self-hosted sre terraform wireguard zero-trust

Last synced: 11 Apr 2026

https://github.com/inxbit/prismtty

Fast terminal output highlighter focused on network devices and Unix systems

ansi chromaterm cisco cli devops fortinet juniper network-tools networking rust sre ssh sysadmin terminal terminal-ui

Last synced: 30 May 2026

https://github.com/dingus-technology/DINGUS

Identify and solve bugs in your code by talking to your logs!

ai bugs deployment devops docker grafana infrastructure llm logging loki metrics monitoring openai prometheus python sre

Last synced: 31 Dec 2025

https://github.com/maestre3d/k8s-microservices-sample

A sample platform using Kubernetes (K8s) to manage a set of container-based microservices clusters and web clients written in Java, Golang, Elixir, Rust, Javascript (+ NodeJS) and Python.

elixir golang java javascript kubernetes microservices pyhton rust sre

Last synced: 02 Apr 2025

https://github.com/apiaryio/heroku-datadog-drain

Funnel metrics from multiple Heroku apps into DataDog using statsd

deprecated sre

Last synced: 20 Jan 2026

https://github.com/letusdevops/learngo

30 days roadmap for Golang for DevOps along with exercises.

devops golang roadmap sre

Last synced: 21 Apr 2026

https://github.com/sredevopsorg/.github

Site Reliability Engineering (SRE), DevOps, DevSecOps, Cloud Native, Linux, AI, ML, OpenSource, Platform Engineering en Español, Portugués (Brasil) and English

community devops kubernetes linux open-source organization platform-engineering site-reliability-engineering sre

Last synced: 18 Jan 2026

https://github.com/sharanch/inkwell-complete

Microservices blogging platform — Go services, React frontend, Kubernetes (Minikube), GitOps with ArgoCD, CI/CD via GitHub Actions. SRE/DevOps portfolio project.

argocd devops docker github-actions golang istio kubernetes microservices portfolio postgresql react redis sre

Last synced: 30 May 2026

https://github.com/arun0009/go-resilience-mock

Chaos engineering in a box. A high-performance mock server to test your API's resilience against latency, failures, and resource exhaustion

chaos-engineering cpu-stress fault-injection go golang http-mock mock-server observability prometheus resilience-testing sre

Last synced: 13 Jan 2026

https://github.com/cantrellr/ultimate-k8s-toolbox

🛠️ Comprehensive Kubernetes administration workstation with 50+ pre-installed tools. Deploy a fully-equipped debugging pod directly into your cluster. Air-gapped ready.

air-gapped cloud-native debugging devops helm helm-chart k8s k9s kubectl kubernetes mongosh offline platform-engineering sre toolbox troubleshooting

Last synced: 13 Jan 2026

https://github.com/admodev/my-dockerfiles

Dockerfiles i use on a daily basis. Useful for SRE and DevOps Engineers.

devops docker dockerfile dockerfiles engineering image images sre

Last synced: 26 Aug 2025

https://github.com/brunopadz/memcached-ok

Simple way to test connection to memcached

infrastructure memcached site-reliability-engineering sre

Last synced: 05 Oct 2025

https://github.com/rmkraus/demo-ansible-monitoring

Demo Builder - Automated Issue Remediation with Zabbix + Ansible

ansible demo sre zabbix

Last synced: 21 Aug 2025

https://github.com/centerdevice/ceres-lambda

SRE Tool for CenterDevice - AWS Lambda Functions

aws lambda ops serverless sre

Last synced: 18 May 2026

https://github.com/charles-adedotun/kubepulse

Intelligent Kubernetes health monitoring with AI-powered diagnostics, predictive analytics, and auto-remediation

ai ai-agents automation claude cloud-native devops go kubernetes monitoring observability react sre typescript

Last synced: 14 Apr 2026

https://github.com/amaurybsouza/terraform-aws-ec2-ssh

Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. It is one of the core services offered by Amazon Web Services (AWS) and provides a wide range of features and capabilities.

aws aws-ec2 devops devops-tools github github-actions infrastructure-as-code infrstructure sre terraform terraform-aws terraform-managed terraform-modules terraform-provider

Last synced: 21 Jan 2026

https://github.com/cyanheads/devops-status-mcp-server

Check vendor status pages, inspect SSL/TLS certificates, verify DNS propagation, and get incident-response playbooks via MCP. STDIO or Streamable HTTP.

ai-agents ai-tools cyanheads devops mcp mcp-server model-context-protocol monitoring sre status statuspage typescript

Last synced: 20 Jun 2026

https://github.com/katavinanguyen/data-center-staffing-optimization-simulator

Simulates incident handling in data centers using Python and SimPy. Analyze how staffing levels, shift timing, and triage rules affect SLA compliance, resolution time, and backlog size.

critical-infrastructure data-center discrete-event-simulation incident-management noc operations-research python simpy simulation sla-monitoring sre staffing-optimization

Last synced: 28 Jul 2025

https://github.com/konstruktoid/disruella

A very small digitalized primate responsible for randomly preventing something from continuing as usual or as expected.

chaos-engineering hacktoberfest high-availability python-black python3 resilience sre systemd test-automation

Last synced: 16 Feb 2026

https://github.com/toolsascode/gomodeler

Go Modeler is a small CLI and Library that brings the powerful features of the golang template into a simplified form.

ci cloud devops github-actions infra it pipeline platform sre

Last synced: 13 Oct 2025

https://github.com/omarmfathy219/k8s-stuck-pod-cleaner

A lightweight, automated solution to resolve one of the most common operational issues in Kubernetes: pods stuck in Terminating state.

cron-job devops helm-chart k8s kubernetes kubernetes-automation kubernetes-operator pods sre stuck-pods terminating

Last synced: 15 Oct 2025

https://github.com/woodprogrammer/skript

The shell script wrapper on Python

bash python shell sre

Last synced: 14 Apr 2026

https://github.com/aruizeac/k8s-microservices-sample

A sample platform using Kubernetes (K8s) to manage a set of container-based microservices clusters and web clients written in Java, Golang, Elixir, Rust, Javascript (+ NodeJS) and Python.

elixir golang java javascript kubernetes microservices pyhton rust sre

Last synced: 08 Apr 2026

https://github.com/tiagotartari/observability-dotnet-opentelemetry-first-steps

This project demonstrates how to implement observability in .NET applications using OpenTelemetry.

dotnet dotnet8 logs metrics observability opentelemetry opentelemetry-collector opentelemetry-dotnet sre traces

Last synced: 20 Jan 2026

https://github.com/pfrederiksen/blast-radius

Local-first AWS dependency graph CLI to understand blast radius before changes

aws aws-sdk-go-v2 cli cloudops devops golang observability sre

Last synced: 25 Jan 2026

https://github.com/apiaryio/example-intersphinx-repo

This repository demonstrates using Intersphinx with indexes being exported in Docker volume

sre

Last synced: 26 Jun 2025

https://github.com/lucasloureiror/slh

Service Level Helper is a CLI tool for calculating Service Level related metrics like SLO, SLA, Error Budgets and probing frequency.

availability devops golang sla slo sre

Last synced: 06 Feb 2026

https://github.com/sergkondr/fake-web-service

fake web service for testing purposes

golang kubernetes sre testing web-service

Last synced: 02 Mar 2026

https://github.com/nudgebee/nudgebee-docs

Public documentation for NudgeBee — AI-powered Kubernetes operations platform. Built with Docusaurus 3, published at docs.nudgebee.com.

devops documentation docusaurus kubernetes nudgebee sre

Last synced: 17 May 2026

https://github.com/toolsascode/protomagic

ProtoMagic is a CLI that helps convert database tables into Protocol Buffers files (.proto).

api cloud dev developer devops golang grpc opensource proto protobuf software sre

Last synced: 17 May 2026

https://github.com/ops4life/claude4ops

Production-ready DevOps superpowers for everyone. Streamline and automate complex workflows across AWS, GCP, Azure, and Kubernetes.

ai-devops anthropic aws azure cicd claude claude-code devops gcp helm iac incident-management infrastructure-as-code kubernetes monitoring observability platform-engineering sre terraform

Last synced: 05 Jun 2026

https://github.com/bigg01/claude-ci-agent

Autonomous Claude Code CI agent in a rootless Podman sandbox — GitLab CI & GitHub Actions, two personalities (read-write Agent, read-only Advisor), OpenTelemetry audit to Elastic, Helm chart for AKS/OpenShift.

ai-agent anthropic cert-manager ci-cd claude-code devops elasticsearch github-actions gitlab-ci helm kubernetes llm llm-gateway observability openbao openshift opentelemetry podman rootless sre

Last synced: 26 Jun 2026

https://github.com/kintsdev/automountify

Automountify is a Go-based CLI tool to format, mount disks, and update /etc/fstab for persistent mounting

go golang sre ubuntu

Last synced: 27 Mar 2025

https://github.com/tedilabs/terraform-http-modules

🌳 A sustainable Terraform Package which manage useful data modules via HTTP provider

devops hacktoberfest hcl2 http lang-hcl sre tedilabs terraform terraform-module terraform-modules

Last synced: 06 Jun 2026

https://github.com/clear-route/vault-client-count-exporter

A dead-simple Prometheus exporter to monitor Vaults Client Count for the entire Cluster and each Namespace

adoption prometheus sre vault

Last synced: 26 Jun 2026

https://github.com/imgautamm/srerepo

SRE Assessment Repo

dataengineering docker postgres python sre

Last synced: 30 Apr 2025

https://github.com/tedilabs/terraform-aws-vpn

🌳 A sustainable Terraform Package which creates VPN resources (Clienet VPN, Site-to-Site VPN) on AWS

aws aws-client-vpn aws-site-to-site-vpn aws-vpn devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 29 Apr 2026

https://github.com/polsebas/agente-admin-observabilidad

Sistema de análisis automático de alertas con Agno Framework + Grafana Stack. Incluye ObservabilityTeam (WatchdogAgent, TriageAgent, ReportAgent) y Quick Commands para observabilidad en tiempo real.

agno ai alerting devops grafana loki multi-agent observability prometheus sre tempo

Last synced: 02 May 2026

https://github.com/manishklach/ai-host-observability

Linux host observability toolkit for AI/GPU infrastructure, exposing Prometheus metrics for memory pressure, RDMA/NIC health, PCIe/VFIO, NUMA, GPUs, and kernel events.

ai-infrastructure ai-ops gpu gpu-monitoring infiniband kernel linux linux-monitoring mlx5 node-exporter numa nvidia observability pcie performance-engineering prometheus rdma rdma-monitoring sre vfio

Last synced: 09 Jun 2026

https://github.com/23seriy/devops-ai-workflows

Curated collection of AI-agent workflows, prompts & rules for DevOps/SRE — Kubernetes debugging, AWS audits, Terraform plan reviews, CI/CD triage, Dockerfile reviews, secrets scanning & incident response. Works with Windsurf, Cursor, Claude Code or any LLM.

ai-agent ai-workflows aws chatops cicd cursor devops docker incident-response kubernetes llm observability platform-engineering prompts security sre terraform windsurf

Last synced: 03 Jun 2026

https://github.com/shakibamoshiri/dq

Debug docker quickly using Docker Query

debugging devops-tools docker nodejs sre

Last synced: 09 May 2026

https://github.com/oaslananka-lab/mcp-ssh-tool

Production-grade MCP SSH automation server for secure remote command, file, tunnel, service, metrics, and policy-controlled operations over stdio/HTTP, with npm distribution, MCP Registry metadata, and ChatGPT app readiness.

automation chatgpt claude codex cursor devops infrastructure mcp mcp-server model-context-protocol nodejs npm-package openai remote-automation security sre ssh ssh-client typescript vscode

Last synced: 13 May 2026

https://github.com/nusnewob/kube-changejob

A Kubernetes operator that triggers Jobs when specific Kubernetes resources change

automation controller-runtime crd devops golang jobs kubernetes kubernetes-operator sre

Last synced: 16 Jan 2026

https://github.com/emdneto/playground

scripts and some random stuff for study

aws k8s kafka sre terraform

Last synced: 12 Apr 2026

https://github.com/marchenkovit/brewfile

One-command MacBook Pro M3 setup — Homebrew packages, casks, VS Code extensions, shell config, macOS defaults, kubectl contexts. Idempotent install.sh skips apps already installed manually.

apple-silicon automation aws brewfile devops dotfiles homebrew idempotent installer jetbrains kubernetes m3 mac-setup macbook macos setup sre terraform vscode zsh

Last synced: 13 May 2026

https://github.com/excoriate/tfgenctl

tfgenctl is a CLI for generate code in Terraform, for lazy folks like me.

cli devops ecs example sre tooling

Last synced: 30 Mar 2025

https://github.com/suhasramanand/predictive-reliability-platform

End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices

anomaly-detection auto-remediation chaos-engineering devops docker fastapi grafana kubernetes microservices monitoring observability predictive-maintenance prometheus python react site-reliability sre typescript

Last synced: 08 Apr 2026

https://github.com/miare-ir/sreinterview

This repo holds the material for the technical step of our SRE interview process.

ansible celery django interview miare sre

Last synced: 06 May 2026

https://github.com/mizcausevic-dev/kinetic-gain-operator-console

Mission-control operator console for the Kinetic Gain Protocol Suite — interactive topology mesh, configurable SRE operator dashboard, audit-stream visualization, PDF export. Deploys to console.kineticgain.com.

ai-governance audit-stream dataviz kinetic-gain kinetic-gain-protocol-suite operator-console react sre topology typescript vite

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/slo-budget-tracker

SLO + error-budget tracker for Python services. FastAPI middleware, Prometheus exporter, multi-window burn-rate alerts. Part of the Platform Reliability Stack.

asgi burn-rate error-budget fastapi monitoring prometheus python reliability slo sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/request-shadow-rs

Async request mirroring with sampling, divergence detection, and structured response diffs. The SRE primitive for safe migrations. Part of the Platform Reliability Stack.

async diff migration mirror reliability rust shadow sre tokio

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/mcp-reliability-toolkit

MCP server exposing SLO math + reliability config recipes. Compute burn rate, size rate limiters, pick breaker thresholds, get drop-in Python and Rust configs back. Part of the Platform Reliability Stack.

circuit-breaker claude kinetic-gain mcp model-context-protocol rate-limiter reliability slo sre typescript

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/latency-budget-enforcer

Go policy engine for latency budget enforcement, dependency drag review, tail-latency breaches, and operator-facing service-path response planning

backend go golang governance latency net-http observability performance-engineering platform-engineering policy-engine sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/agent-canary

Progressive rollout, shadow mode, and auto-rollback for AI agents. Sticky-percent routing with promote/rollback gates driven by real metrics. Platform engineering reliability for the agent era.

ai-agents canary deployment feature-flags platform-engineering progressive-rollout python reliability shadow-deployment sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/rate-limit-shield

Production-grade rate limiting, circuit breaking, and retry shaping for LLM APIs. Token bucket + breaker + jittered backoff with HTTP 429 / Retry-After awareness.

anthropic circuit-breaker llm llmops openai python rate-limiting reliability retry-policy sre

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/observability-incident-command-api

TypeScript API for incident severity analysis, escalation routing, responder visibility, and operational incident-command workflows.

backend express incident-response nodejs openapi platform-engineering sre typescript

Last synced: 01 Jun 2026

https://github.com/mizcausevic-dev/grpc-mesh-shadow

Typed gRPC shadow traffic client. Mirrors requests from a stable primary to an under-test candidate; diffs responses asynchronously; returns the primary to your caller. Sampling, timeouts, pluggable sinks. bufconn-tested.

ai-governance canary golang grpc platform-engineering protobuf service-mesh shadow-traffic sre

Last synced: 01 Jun 2026

https://github.com/ricoberger/ricoberger.de

Personal website with links to my LinkedIn, Xing, Twitter, Github and Medium profile.

cloud-native github gopher hacker linkedin medium site-reliability-engineer sre twitter xing

Last synced: 17 May 2026

https://github.com/rbryce90/linux-time-machine

Local-first Linux observability with historical scrubbing, semantic journald search, and an MCP server for Claude-driven investigation. Go + SQLite + Ollama embeddings.

bubbletea embeddings golang linux local-first mcp model-context-protocol observability ollama rag sre systems-monitoring time-series tui

Last synced: 22 May 2026

https://github.com/vfolgosa/bifrost-proxy

A lightweight Layer-7 Kafka proxy. Route traffic across clusters with port-based routing, SASL passthrough, and autonomous failover.

devops failover golang kafka kafka-proxy load-balancing proxy sre

Last synced: 20 Jun 2026

https://github.com/toolsascode/scoop-bucket

Scoop bucket for official GoModeler CLI

cli cloud devops golang gotemplate scoop sre

Last synced: 20 Oct 2025

https://github.com/volkv/server-pulse

Lightweight Linux server monitoring with Telegram alerts. CPU, RAM, disk, load, Docker, OOM. Pure bash, systemd timer, no daemon.

alerting bash dedicated-server devops disk-space docker homelab linux-monitoring monitoring oom-killer self-hosted server-monitoring shell-script sre systemd telegram-alerts telegram-bot vps

Last synced: 21 Jun 2026

https://github.com/ranching-farm/k8s-agent

Kubernetes agent for deploying ranching.farm directly into your cluster. Connect your K8s deployment to our AI-powered management platform with a single line of code.

ai-assistant ai-assisted cluster-management devops helm k8s kubectl kubernetes kustomize ranching-farm sre

Last synced: 03 Feb 2026

https://github.com/briancain/cats-as-a-service

This is a helper repo used during a role playing based incident training.

cat cats dnd incident-response roleplay sre sre-infrastructure

Last synced: 28 Jan 2026

https://github.com/aliariff/argus

Tool to export WebPageTest results into InfluxDB.

devops grafana influxdb monitoring performance python sre webpagetest

Last synced: 18 Apr 2026

https://github.com/ramesh-852000/devops-practices-and-interview-prep

A collection of DevOps practices, scripts, interview questions, and real-world examples covering Linux, Jenkins, AWS, Kubernetes, Docker, Ansible, Terraform, CI/CD pipelines, Monitoring, and Cloud Platforms.

ansible aws azure cloud devops docker elastic gcp interview-questions jenkins kubernetes linux nosql prometheus sql sre terraform

Last synced: 04 Apr 2026

https://github.com/felipe-veas/handling-production-incidents

Runbooks, processes, and guidelines for effectively managing production incidents

documentation incident-management reliability runbooks sre

Last synced: 10 Mar 2026

https://github.com/curiouslearner/cache_sniper

A small utility to detect page caching on CDNs

cache cache-invalidation devops-tools rust rust-lang sre

Last synced: 28 Oct 2025

https://github.com/apiaryio/blackhole

App returning HTTP code 429

sre

Last synced: 26 Jun 2025

https://github.com/macbre/http-shadow

Compares HTTP responses from two different backends

sre sus

Last synced: 20 Jul 2025