An open API service indexing awesome lists of open source software.

SRE

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.

https://github.com/fsaintjacques/survivalkit

A survival kit is a package of basic tools and supplies prepared in advance as an aid to survival in an emergency.

c health-check healthcheck logger monitoring sre

Last synced: 21 Mar 2025

https://github.com/cleancloud-io/scan-action

GitHub Action for CleanCloud — read-only cloud hygiene scanner for AWS, Azure and GCP

aws azure cloud cloud-computing cost-optimization devops finops hygiene sre

Last synced: 07 Apr 2026

https://github.com/srujantata/opentelemetry-observability

OpenTelemetry Collector on Kubernetes: unified traces (Jaeger), metrics (Prometheus + Grafana), logs (Loki), auto-instrumentation, exemplar linking

grafana loki observability opentelemetry prometheus slo sre tempo

Last synced: 28 Jun 2026

https://github.com/akintunero/netdiag

Stdlib-only CLI for SRE on-call: traceroute, DNS, ping, TLS, VPN checks, and incident presets. JSON output, stable exit codes.

automation cli command-line-tools devops dns json network-diagnostics networking on-call python sre sysadmin traceroute troubleshooting vpn

Last synced: 21 Jun 2026

https://github.com/goseind/schablone

Template repository for app infrastructure based on SRE principles

actions azure cicd helm kubernetes sre terraform

Last synced: 05 Apr 2026

https://github.com/korchasa/severin

PoC server chat agent

agent agentic-ai devops llm sre

Last synced: 31 Jan 2026

https://github.com/mikeshobes718/eks-orchestrator

Python CLI to manage EKS cluster/nodegroup lifecycle, RBAC, addons, and GitOps-safe rollouts with dry-run plans.

cli devops eks gitops helm kubectl kubernetes python sre

Last synced: 17 Mar 2026

https://github.com/mikeshobes718/cluster-admin-toolkit

SRE toolkit for day‑2 ops: nodes/deployments views, health, logs/events, rollout status, cordon/drain, restart workflows.

cli kubectl kubernetes observability operations sre

Last synced: 17 Mar 2026

https://github.com/assafdori/tcdp

This repository tracks my progress while taking part of TCDP course.

devops sre

Last synced: 07 Apr 2026

https://github.com/adriantunez/adriantunez.cloud

Personal site of an AWS cloud architect, platform engineer, and SRE enthusiast willing to openly share his learnings

aws blog blowfish giscuss hugo sre website

Last synced: 18 May 2026

https://github.com/knaeckebrothero/kubernetes-cluster-project

This project focuses on setting up and managing a Kubernetes Cluster, because who doesn't want one?

deployment devops kubernetes kubernetes-cluster sre

Last synced: 05 Apr 2025

https://github.com/ishantanu/gcp-status-exporter

A Prometheus Exporter for generating metrics for GCP Service Status and Incidents :rocket:

gcp opentelemetry prometheus-exporter sre

Last synced: 10 May 2026

https://github.com/abd-ulbasit/goplatform

Learning-focused Kubernetes operator (kubebuilder v4) that provisions in-cluster databases, caches, and queues via a custom Application CRD — with tier-based observability and drift detection.

cloud-native cnpg controller-runtime crd devops go golang kubebuilder kubernetes kubernetes-operator operator platform-engineering postgresql prometheus rabbitmq redis sre

Last synced: 10 Jun 2026

https://gitlab.com/ek-it/guias

Guiás de instalación y configuacion de servidores y servicios en un Data Center basados en tecnologias open source y software libre.

devops linux sre sysadmin

Last synced: 11 Mar 2025

https://github.com/pezzos/pezzos

Some info about me 🤓

curriculum-vitae cv devops sre

Last synced: 11 Feb 2026

https://github.com/deldotore-r/deldotore-r

🌐 Cloud Infrastructure & DevOps | AWS, Terraform, GitHub Actions & Linux Automation. Oficial reformado aplicando rigor técnico e disciplina à Engenharia de Sistemas.

airflow automation aws cloud-computing devops docker github-actions grafana infrastructure-as-code kubernetes linux prometheus shell-script sre terraform

Last synced: 16 Apr 2026

https://github.com/rrey/back-of-the-envelope

System design exercises

sre system-design

Last synced: 22 Jun 2026

https://github.com/nalediym/tiger-mom-protocol

An operating system for builders who start more than they finish. Personal protocol for AI-assisted development with structured nudges, celebration loops, and the six-step shipping workflow.

ai-workflow claude-code developer-productivity opencode productivity-tools skills sre

Last synced: 23 Jun 2026

https://github.com/ziad-hsn/cpra

CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.

concurrency devops golang observability self-hosted sre uptime-monitor

Last synced: 07 Mar 2026

https://github.com/apolzek/shared

collection of proof-of-concepts (PoCs) created to explore ideas and test technologies

devops devops-tools laboratory proof-of-concept sre

Last synced: 17 Jan 2026

https://github.com/debaghtk/opsfordevs

devops bootcamp material that I have taught at previous companies

bootcamp devops operations sre

Last synced: 15 Feb 2026

https://github.com/viniciushammett/n8n-devops-lab

Lab de n8n para DevOps/SRE: automação de incidentes, digests de SLO e webhooks, rodando em Docker (Postgres+Redis, queue mode).

automation cron devops docker docker-compose incident-management n8n postgres redis sre webhook workflows

Last synced: 16 Apr 2026

https://github.com/jesioo/dx

Developer-first CLI/TUI that turns your README and a single file into a next-gen DX

cli cli-app developer-tools devops dx enterprise internal-tools markdown onboarding remote-control scripting sre terminal terminal-based tui workflows

Last synced: 18 Apr 2026

https://github.com/detectviz/tech-docs-notes

本專案為整理與翻譯各類開源技術工具文件、白皮書與研究資料的筆記中心。

docs grafana guide sre

Last synced: 10 May 2026

https://github.com/jnbdz/site-reliability-engineer-quickstarts

:mechanical_arm: Site Reliability Engineer | Quickstarts :mechanical_arm:

quickstart quickstarts site-reliability-engineering site-reliability-engineering-sre sre

Last synced: 03 Mar 2026

https://github.com/cloudnativeworks/elchi

Elchi is a React + TypeScript web interface for managing Envoy Proxy–based L4/L7 traffic. It provides visual configuration and control of xDS resources, routing, clusters, filters, security (mTLS/WAF), and observability integrations without directly editing Envoy YAML.

cloud-native delta-xds devops envoy envoy-proxy load-balancer proxy service-mesh sre traffic-management ui xds

Last synced: 11 May 2026

https://github.com/dc-tec/openbao-observability

OpenBao observability reference architecture with metrics, logs, dashboards, alerts, fixtures, and runbooks.

alloy grafana loki observability openbao sre

Last synced: 02 Jun 2026

https://github.com/rrabelloo/homebrew-formae

Unofficial Homebrew tap for Formae, a modern Infrastructure-as-Code platform.

devops iac infrastructure-as-code platform-engineering sre

Last synced: 04 Mar 2026

https://github.com/guibes/runbook-operator

A cloud-native Kubernetes operator that automatically generates and manages runbook documentation from PrometheusRule configurations with multiple output formats.

alerting automation cloud-native devops documentation gitops incident-response kubernetes monitoring operator prometheus runbooks sre

Last synced: 17 May 2026

https://github.com/fedekau/mercado-libre-sre

Es una API para centralizar y cachear las consultas a otras APIs de Mercado Libre.

api mercadolibre nodejs sre

Last synced: 17 May 2026

https://github.com/tedilabs/terraform-tfe-modules

🌳 A sustainable Terraform Package to manage all of things on Terraform Enterprise (Terraform Cloud)

devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-cloud terraform-enterprise terraform-module terraform-modules tfe type-module

Last synced: 05 Mar 2026

https://github.com/wesllen-lima/cronko

Self-hosted cron job monitoring. Know when your jobs stop running.

cron devops docker heartbeat hono monitoring nextjs postgresql self-hosted sqlite sre tailwindcss uptime

Last synced: 25 Jun 2026

https://github.com/pallaprolus/kube-foresight

Predictive Resource Optimizer for Kubernetes — identifies over-provisioned deployments and generates right-sizing patches

aiops cli cloud-cost cost-optimization devops finops k8s kubectl kubernetes observability prometheus python resource-optimization right-sizing sre

Last synced: 12 Jun 2026

https://github.com/aronmilenait/aronmilenait.github.io

My blog and portfolio as a Software Developer transitioning to DevOps and SRE.

blog devops linux portfolio software-development sre

Last synced: 21 Jun 2025

https://github.com/guptaprakhariitr/vigil

Self-hosted on-call engineer: point it at your logs, it finds the cited root cause and (if you let it) opens a fix PR. Your infra, your engine (Claude/Cursor/API/local), your autonomy dial.

cli devops incident-response llm observability root-cause-analysis rust self-hosted sre

Last synced: 25 Jun 2026

https://github.com/tedilabs/terraform-aws-ecs

🌳 A sustainable Terraform Package which creates resources for ECS Services on AWS

aws aws-ecs devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module

Last synced: 03 Apr 2026

https://github.com/amsa-2425-gei-udl/laboratoris

Material enfocat per als estudiants que desitgen ampliar els seus coneixements en administració i virtualització de sistemes.

devops labs sre sys-admin teaching-materials

Last synced: 11 Apr 2025

https://github.com/jojees/project-genesis

Project Genesis is a comprehensive, hands-on learning initiative designed to build and manage a tangible, multi-service application within a modern DevOps ecosystem. This project serves as a real-world sandbox, demonstrating best practices across various disciplines, including DevOps, Site Reliability Engineering (SRE), DevSecOps, and FinDevOps.

cicd devops docker gitops grafana high-availability kubernetes microservices-architecture observability postgres prometheus rabbitmq redis sre

Last synced: 04 Apr 2026

https://github.com/dbalucas/kb_dbalucas

This repository contains all my KnowHow and is a great list of quick and searchable commands for administrating a database and other types of systems I'm working with.

cloud dba ddl dml k8s linux postgresql sql sre

Last synced: 04 Apr 2026

https://github.com/labrats-work/infra.cloud-platform

Multi-cloud Kubernetes platform with GitOps automation. Cloud-agnostic platform layer with provider-specific implementations.

ansible cloud-native devops flux fluxcd gitops hetzner infrastructure-as-code kubernetes kustomize multi-cloud platform-engineering prometheus sre terraform

Last synced: 05 Apr 2026

https://github.com/omers/sre-devops-tools

Tools and useful sources for SRE and DevOps

awsome awsome-list data devops monitoring sre tools

Last synced: 20 Apr 2026

https://github.com/moemoe89/ansible-ji2

📓 Ansible project for my Medium story material

ansibe aws cloud compute-engine devops docker ec2 gcp haproxy sre

Last synced: 21 Apr 2026

https://github.com/guilledipa/praetor

A Go-based configuration management tool for real-time, mTLS-secured infrastructure orchestration via Message Brokers and gRPC.

automation configuration-management devops golang grpc-go mtls nats-jetstream orchestration sre system-administration

Last synced: 14 Jun 2026

https://github.com/adanb13/cirdan

AI infrastructure cartographer and operations daemon — Graphify for live infrastructure. Fingerprints, graphs, and watches Docker/Kubernetes/AWS/IaC for AI agents via CLI + MCP.

agent-skills ai-agents aiops claude-code codex cursor devops docker gemini-cli incident-response infrastructure knowledge-graph kubernetes mcp mcp-server observability sre terraform

Last synced: 17 Jun 2026

https://github.com/sixtusagbo/alx-system_engineering-devops

System engineering and DevOps training at ALX Holberton school

api automation back-end bash ci-cd debugging devops mysql puppet python scripting shell sre sysadmin

Last synced: 20 Jan 2026

https://github.com/donchanee/metricops

Prometheus metric governance CLI for self-hosted operators

cli cost-optimization golang grafana metrics observability prometheus sre

Last synced: 22 Apr 2026

https://github.com/admbahm/pushbadger

Deterministic git diff analyzer that maps changed files to risk areas using path-based heuristics. Fast, reproducible, no AI.

cli code-review developer-tools git go risk-analysis sre static-analysis

Last synced: 23 Apr 2026

https://github.com/jrhrmsll/tsgen

tsgen is a little Go program to simulate HTTP requests faults and show how Prometheus alerts based on the Multiwindow, Multi-Burn-Rate Alerts works.

golang grafana monitoring prometheus sre

Last synced: 22 Feb 2026

https://github.com/cheesebanana/yellowstack

Real-time Python script runner with scheduling, logging, and OpenAI-assisted debugging

automation aws devops flask job-scheduler openai python rest-api scheduler scripts sre

Last synced: 12 Jun 2025

https://github.com/robson-teixeira/jaeger-opentelemetry-tracking

Repositório do curso Rastreamento: fazendo tracing com Jaeger e OpenTelemetry da plataforma Alura.

alura container docker grafana grafana-loki jaeger java jdk nginx opentelemetry postgresql prometheus rastreamento redis spring sre tracing

Last synced: 12 Apr 2026

https://github.com/foxj77/autonomous-monitor

Kubernetes namespace monitor — continuous deterministic checks, JSON findings to Kafka

github-actions golang hacktoberfest kafka kubernetes monitoring namespace operator prometheus sre

Last synced: 18 Jun 2026

https://github.com/thanhnguyxn/alert-alchemy

🧪 CLI incident-response simulator: brew fixes from alerts using realistic logs, metrics & traces (offline).

chaos-engineering cli debugging devops game incident-response learning monitoring observability oncall postmortem python rich runbooks simulation site-reliability-engineering sre terminal typer yaml

Last synced: 13 Jan 2026

https://github.com/safoorsafdar/safoorsafdarcom

source code for the personal website safoorsafdar.com

azure cloud-architect devops docker kubernetes observability personal-website prometheus sre

Last synced: 18 Jan 2026

https://github.com/philyuchkoff/howtheysre

A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

sre

Last synced: 11 Mar 2025

https://github.com/timyiu478/sadservers

Notes of sad servers

devops linux sre troubleshooting

Last synced: 03 Jul 2025

https://github.com/oluwatobi-roie/sre-diskmonitor

Monitor disk usage on a MySQL server and auto-reset binary logs safely when space runs low.

automation bash cronjob devops diskmonitoring mysql server-maintenance sre

Last synced: 27 Apr 2026

https://github.com/dcaffese-cypher/observability-security-stack

Production-ready observability & security toolkit: OpenTelemetry Collector (master/agent), Prometheus, Loki, Grafana dashboards, Wazuh and Zabbix utilities, plus retention/TSDB maintenance scripts.

ansible devops grafana infrastructure logging loki monitoring observability opentelemetry prometheus security sre tracing wazuh zabbix

Last synced: 11 Apr 2026

https://github.com/robliberti/aws-terraform-ansible-1

Terraform + Ansible lab demonstrating a secure public-frontend/private-backend architecture on AWS, featuring reverse proxying, idempotent configuration, and post-deploy health checks.

ansible aws devops infrastructure-as-code nginx reverse-proxy sre terraform

Last synced: 12 Apr 2026

https://github.com/benfradjselim/ruptura

Predictive failure detection engine for cloud-native infrastructure. Rupture Index™ detects divergence hours early — adaptive ensemble of 5 models, 8 composite signals, automated K8s remediation.

cloud-native go kubernetes mlops observability opentelemetry predictive-analytics prometheus rupture-detection sre

Last synced: 28 May 2026

https://github.com/deas/ka0s

Building Chaos around LitmusChaos on Kubernetes

chaos-engineering flux2 kubernetes litmuschaos sre

Last synced: 15 Mar 2025

https://github.com/ops-talks/farm

The Full Stack Platform Egineer

devportal platform-engineering sre

Last synced: 29 Apr 2026

https://github.com/shinagawa-web/pgincident

A live terminal dashboard for the first 30 seconds of a PostgreSQL incident — connections, locks, long queries, and idle transactions at a glance.

bubbletea cli devops go incident-response postgres postgresql sre terminal tui

Last synced: 27 May 2026

https://github.com/davidmalko87/jira-confluence-full-instance-backup

Automated full-instance backup for Jira & Confluence Cloud (Standard plan). Jenkins + CLI/menu; encrypted 7-Zip archives to GCS, S3, Azure or local; notify via Slack, Teams, email or webhook.

7zip atlassian atlassian-cloud automation azure backup cli confluence devops disaster-recovery gcs jenkins jira python s3 sre

Last synced: 27 May 2026

https://github.com/opscart/opscart-k8s-watcher

Kubernetes security awareness and troubleshooting tool featuring CIS Benchmark scoring, environment-aware analysis (PROD vs DEV), and actionable recommendations. Not for compliance auditing - use kube-bench for official CIS compliance.

cis-benchmark cloud-native cluster-monitoring devops devsecops kubernetes-security platform-engineering resource-optimization sre troubleshooting

Last synced: 21 Feb 2026

https://github.com/lukemurraynz/drasi-aks-sre-agent

Azure SRE Agent - Blueprint for Drasi on AKS

aks azure drasi sre

Last synced: 20 Jun 2026

https://github.com/powerhome/pac-quota-controller

PAC Resource Sharing Validation Webhook

kubernetes-controller pac sre

Last synced: 16 Jan 2026

https://github.com/pinepain/laravel-system-info

Set of tools to help maintaining a Laravel application in kubernetes

devops k8s laravel php sre

Last synced: 08 May 2026

https://github.com/monim279/ai-powered-devops

🤖 Discover AI tools and techniques to optimize DevOps processes through practical challenges and hands-on projects in this comprehensive 10-day course.

agent agentic-ai cicd cloud devops devops-platform devops-workflow engineering-productivity environment-manager go grafana hacktoberfest llmops mcp microsoft-teams prometheus self-hosted sre

Last synced: 29 Apr 2026

https://github.com/itsfoss0/writeups

Writeup about my homelab and postmoterms for incidents and/or outages in the same

devops incident-management incident-response kubernetes sre

Last synced: 13 Apr 2026

https://github.com/doctolib/terraform-provider-elkaliases

Elasticsearch indexes provider for terraform

criticality-tier0 github-actions managed-by-terraform sre

Last synced: 30 Apr 2026

https://github.com/tedilabs/github-required-actions

♥️ The best way to manage GitHub Actions Required Workflows in @tedilabs

devops github github-actions hacktoberfest sre tedilabs

Last synced: 27 Mar 2025

https://github.com/awcodify/awesome-monitoring

This repository is a curated collection of valuable monitoring tools, resources, and best practices for developers, sysadmins, and DevOps professionals. It covers various aspects of monitoring, including infrastructure, applications, logs, networks, cloud, and Kubernetes.

alerting devops infrastructure logging logs metrics monitoring sre sysadmin

Last synced: 22 Feb 2026

https://github.com/swenyai/sweny

AI-powered engineering workflows — Learn from any source, Act through any tool, Report through any channel

ai ai-agent automation claude devops github-action observability sre triage typescript

Last synced: 11 Mar 2026

https://github.com/cakmoel/resilio

Professional technology-agnostic load testing suite built for performance engineering and durability auditing. Implements research-based methodologies (Jain, 1991) and ISO 25010 standards to validate speed, endurance, and scalability across any backend stack.

apachebench benchmarking devops-tools endurance-testing load-testing performance-testing quality-assurance reliability-engineering scalability sre stress-testing tech-agnostic

Last synced: 13 Apr 2026

https://github.com/rogerchappel/runbooklint

Local-first Markdown runbook linter for safer operational procedures

cli devops incident-response lint markdown runbook sre

Last synced: 26 May 2026

https://github.com/rtmuller/observability-reliability-lab

Hands-on lab demonstrating observability reliability patterns: meta-monitoring, chaos scenarios, Watchdog, absent() rules.

alertmanager chaos-engineering grafana meta-monitoring observability prometheus sre

Last synced: 29 May 2026

https://github.com/subhamay-bhattacharyya-tf/terraform-google-module-template

✅ A reusable, opinionated Terraform module template for provisioning and managing Google Cloud Platform resources — designed for consistency, scalability, and best practices.

cloud-infrastructure devops gcp google-cloud iac infrastructure-as-code module-template platform-engineering sre terraform terraform-module terraform-template

Last synced: 02 May 2026

https://github.com/juanfranciscocis/devprobe_tesis

DevProbe is a progressive web application that provides a platform for Site Reliability Engineers to monitor their websites. The app is built with , IONIC, Angular and Firebase.

angular gemini gemini-api ionic ionic-framework reliability-engineering site site-reliability-engineering site-reliability-engineering-sre sre sre-team typescript

Last synced: 10 Apr 2026

https://github.com/jebinjeb/k8s-evacuator

Advanced Kubernetes node evacuation tool with safe batching, workload-aware draining, and progressive (controlled) pod eviction.

cloud-native cluster-operations devops k8s kubectl kubernetes node-drain platform-engineering pod-eviction sre statefulset

Last synced: 03 May 2026