An open API service indexing awesome lists of open source software.

SRE

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.

https://github.com/excoriate/terragrunt-ref-arch

Highly scalable and available reference architecture for Terragrunt.

cli devops ecs example sre tooling

Last synced: 28 Apr 2025

https://github.com/adhorn/aws-fis-templates-cdk

Collection of AWS Fault Injection Simulator (FIS) experiment templates deploy-able via the AWS CDK

amazon-web-services automation aws aws-fis cdk-examples cdk-library chaos-engineering chaos-testing devops-tools sre testing

Last synced: 16 May 2025

https://github.com/bjarneo/rip

Rest in peace(s) - HTTP/UDP load testing tool

ddos go golang http learning-by-doing load-testing rip security-tools sre sre-infra udp-flood

Last synced: 25 Apr 2025

https://github.com/ari-hacks/command-line-cheat-sheet

📝 A place to quickly lookup commands (bash, vim, git, AWS, Docker, Terraform, Ansible, kubectl)

ansible aws bash command-line devops docker git k8s kubectl kubernetes sre terraform vim

Last synced: 10 Apr 2026

https://github.com/alexkroman/ollychat

Create custom DevOps AI agents that understand and manage your infrastructure.

agent agents ai ai-agent-framework ai-agents ai-agents-framework llm observability observability-data prometheus sre

Last synced: 16 May 2025

https://github.com/tedilabs/terraform-aws-account

🌳 A sustainable Terraform Package which creates Account & IAM resources on AWS

aws aws-iam devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 13 Feb 2026

https://github.com/microsoft/sqlcallstackresolver

A sample tool for users of Microsoft SQL Server to aid in troubleshooting otherwise difficult to diagnose issues. Provided AS-IS - see SUPPORT.md.

azuresql azuresqldb azuresqlmanagedinstance callstack debugging debugging-symbol msdia140 pdb pdb-files sqlserver sqlserver-2017 sqlserver-2019 sqlserver-2022 sre symbols tool xevent xevents

Last synced: 07 May 2025

https://github.com/microsoft/SQLCallStackResolver

A sample tool for users of Microsoft SQL Server to aid in troubleshooting otherwise difficult to diagnose issues. Provided AS-IS - see SUPPORT.md.

azuresql azuresqldb azuresqlmanagedinstance callstack debugging debugging-symbol msdia140 pdb pdb-files sqlserver sqlserver-2017 sqlserver-2019 sqlserver-2022 sre symbols tool xevent xevents

Last synced: 08 Apr 2025

https://github.com/blacklane/kiev

A set of tools to do distributed logging for Ruby web applications

distributed-tracing elk-stack logging ruby sre

Last synced: 04 Apr 2025

https://github.com/loftwah/loftwahs-cheatsheet

My own personal tech cheatsheet. This covers the stuff I use quite regularly.

bash devops hacktoberfest linux nodejs python sre typescript

Last synced: 20 Jun 2025

https://github.com/huseynovvusal/blamebot

AI on-call agent that detects deploy failures explains what broke pages the responsible team and rolls back automatically.

ai-agent devops hackathon incident-management nextjs postmortem redis slack-bot sre upstash vercel

Last synced: 09 May 2026

https://github.com/tedilabs/terraform-aws-container

🌳 A sustainable Terraform Package which creates resources for Container Services on AWS

aws aws-ecr aws-eks devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules type-module

Last synced: 23 Feb 2026

https://github.com/k8sgpt-ai/docs

Documentation for K8sGPT

ai chatgpt docs kubernetes sre

Last synced: 06 Apr 2025

https://github.com/nobl9/sloctl

A command line tool to cast SLO spells 🪄

cli go golang nobl9 reliability slo sre

Last synced: 27 Feb 2026

https://github.com/FluidifyAI/Regen

Open-source incident management Alerts, on-call, AI post-mortems. Self-hosted alternative to PagerDuty & incident.io. Works with Prometheus, Grafana, Datadog, Slack, and Teams. Free forever, BYO-AI.

ai alerting devops grafana incident-management observability on-call open-source pagerduty-alternative prometheus self-hosted slack sre

Last synced: 28 May 2026

https://github.com/ory/jobs

Want to build the next generation identity stack? You've come to the right place!

go hiring jobs kubernetes open-source opensource ory react sre

Last synced: 17 Mar 2025

https://github.com/icco/postmortems

Postmortem metadata from danluu/post-mortems.

hacktoberfest postmortem-metadata sre

Last synced: 21 Mar 2025

https://github.com/sitectl/cuttle

Blue Box SRE Operations Platform

ansible bastion bluebox elk operations sensu sre

Last synced: 11 Apr 2025

https://github.com/apiaryio/heroku-datadog-drain-golang

Funnel metrics from multiple Heroku apps into DataDog using statsd.

datadog golang heroku metrics sre statsd

Last synced: 03 Oct 2025

https://github.com/excoriate/go-terradagger

TerraDagger is a Go package for managing your infrastructure-as-code through containers.

cli devops ecs example sre tooling

Last synced: 13 Apr 2025

https://github.com/ramizpolic/sre-playground

A set of Site Reliability Engineering notes & challenges

cicd cloud guide infrastructure site-reliability-engineer sre tasks

Last synced: 14 Apr 2025

https://github.com/seveas/herd

Massively parallel ssh client

cli orchestration sre ssh sysadmin system-administration

Last synced: 25 Jun 2025

https://github.com/fkie-cad/logprep

log data pre processing, generation and shipping in python

etl kafka log logdata loggenerator logshipper opensearch preprocessing python soar sre

Last synced: 02 Mar 2026

https://github.com/alexewerlof/slc

A simple service level calculator

error-budget servicelevels sla sli slo sre

Last synced: 03 Apr 2026

https://github.com/enola-dev/enola

Enola 🕵🏾‍♀️ Holmes was an SRE.

graph graphviz mermaid modeling rdf semantic-web sre visualization

Last synced: 16 Jun 2025

https://github.com/Excoriate/go-terradagger

TerraDagger is a Go package for managing your infrastructure-as-code through containers.

cli devops ecs example sre tooling

Last synced: 21 Apr 2025

https://github.com/lwindolf/multi-status

Aggregator PWA for status pages of online services. Know which of your 3rd party SaaS/PaaS are having issues right now.

cloud devops monitoring paas pwa saas sre

Last synced: 11 Apr 2025

https://github.com/keycloak/keycloak-sre-sig

Keycloak's Site Reliability Engineers Special Interest Group (Keycloak SRE SIG): To improve the lives of people running and operating Keycloak

keycloak sig sre

Last synced: 12 Apr 2025

https://github.com/tedilabs/terraform-aws-network

🌳 A sustainable Terraform Package which creates VPC resources (VPC, Subnet, NACL, NAT Gateway, Route Table) on AWS

aws aws-vpc devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 15 Apr 2025

https://github.com/immobiliare/collectd-haproxy-plugin

Collectd plugin to pull metrics from HAProxy instances

collectd collectd-plugin grafana haproxy metrics monitoring sre

Last synced: 01 Apr 2025

https://github.com/better-sre/config

config files, Dockerfiles, Taskfiles for Developers.

awesome-taskfile docker dotfiles flutter go-task golang python rust sre taskfile

Last synced: 02 May 2025

https://github.com/hequan2017/raptor

猛禽 运维平台 项目完结,后续不再更新。

cmdb devops gin go golang jenkins rbac sre vue

Last synced: 07 May 2025

https://github.com/dynatrace-oss/customersuccess

Open source solutions that help you level up your observability game with Dynatrace.

adoption ai automation dashboards dynatrace intelligence notebooks observability obsolescence software sre value workflows

Last synced: 07 Jan 2026

https://github.com/grafana/xk6-chaos

xk6 extension for running chaos experiments with k6 💣

chaos chaos-engineering k6-extension reliability sre testing xk6

Last synced: 01 Oct 2025

https://github.com/operate-first/operations

The sig-operations repository.

site-reliability-engineering sre

Last synced: 16 Jan 2026

https://github.com/k8sgpt-ai/community

Community Management for K8sGPT

devops kubernetes openai sre tooling

Last synced: 15 Apr 2025

https://github.com/nathanielvarona/pritunl-client-github-action

Establish automated secure Pritunl VPN connections with Pritunl Client in GitHub Actions, supporting OpenVPN and WireGuard.

cicd devops github-actions hacktoberfest openvpn pritunl pritunl-vpn sre vpn-client vpn-server wireguard

Last synced: 10 Mar 2026

https://github.com/anjakammer/devops-and-sre

An online course @ HTW Berlin

devops gitops operations sre

Last synced: 21 Jan 2026

https://github.com/be-next/awesome-performance-engineering

A curated, opinionated collection of tools and resources dedicated to Performance Engineering, covering both Observability and Performance Testing.

awesome awesome-list devops load-testing monitoring observability performance performance-engineering performance-testing sre

Last synced: 08 Mar 2026

https://github.com/googlecloudplatform/reliable-app-platforms

A MVP of a platform for delivering reliable applications on Google Cloud

gke google-cloud kubernetes reliability slos sre terraform

Last synced: 20 Oct 2025

https://github.com/microsoft/tdslib

Open implementation of the TDS protocol (version 7.4) in managed C# code.

dotnet sqlserver sre tds

Last synced: 17 Aug 2025

https://github.com/devopsext/sre

Golang SRE framework for logs, metrics, traces and events. It supports: Jaeger, Prometheus, DataDog, Opentelemetry, NewRelic, Grafana

events logs metrics observability sre traces

Last synced: 12 Jan 2026

https://github.com/tedilabs/terraform-aws-domain

🌳 A sustainable Terraform Package which creates resources for Domain Services on AWS

aws aws-route53 devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 15 Apr 2025

https://github.com/diogopms/monit-docker

Monit is a free open source utility for managing and monitoring, processes, programs, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations.

devops docker kubernetes monit monitoring sre status

Last synced: 25 Oct 2025

https://github.com/butuzov/todayilearned

Because I Can't Trust My Memory

bash go jupyter linux python sre

Last synced: 17 Mar 2025

https://github.com/rootlyhq/terraform-provider-rootly

Terraform provider for Rootly - manage incident management, on-call schedules, workflows, and alerts as code

devops go golang hashicorp iac incident-management incident-response infrastructure-as-code on-call rootly site-reliability-engineering sre terraform terraform-provider

Last synced: 11 Mar 2026

https://github.com/wpjunior/multi-burn-rate-calculator

Calculator to view detection time using error budget consumption rates, based on lessons from Site Reliability Engineering Workbook

error-budget sli slo sre

Last synced: 17 Mar 2026

https://github.com/aptible/unpage

Unpage is the open source framework for building SRE agents with infrastructure context and secure access to any dev tool.

agent agentic-workflow agents ai-agent ai-sre aiops automation devops dspy incident-response incident-response-tooling mcp monitoring observability site-reliability-engineering sre sre-agent

Last synced: 08 Sep 2025

https://github.com/angelopoerio/oom-notifier

Notify about oomed processes reporting full command line

devops kubernetes linux observability rust site-reliability-engineering sre

Last synced: 17 Jan 2026

https://github.com/luan78zaoha/kaldi-timit-sre-ivector

Develop speaker recognition model based on i-vector using TIMIT database

chinese i-vector kaldi speaker-recognition speaker-verification sre

Last synced: 11 Mar 2025

https://github.com/tedilabs/terraform-aws-data

🌳 A sustainable Terraform Package which creates resources for Data Services on AWS

aws aws-athena devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 03 Oct 2025

https://github.com/last9/last9-integrations

Sample applications of supported integrations by Last9 Products

integrations last9 reliability-engineering sre timeseries-database

Last synced: 28 Apr 2025

https://github.com/quzanh1130/multi_metrics_to_compare_images

Comparing two images by using 9 metrics: VIFP, PSNR, SSIM, FSIM, RMSE, ISSM, SRE, SAM, UIQ.

compare-image fsim issm psnr rmse sam sre ssim uiq vifp

Last synced: 28 Oct 2025

https://github.com/tedilabs/k8s-repository

♻️ Repository for Reusable Kubernetes App Manifests with Kustomize

devops gitops hacktoberfest k8s kubernetes kustomize lang-yaml sre tedilabs

Last synced: 19 Oct 2025

https://github.com/nudgebee/nudgebee

AI-driven incident management and observability for Kubernetes, AWS, Azure, and GCP — LLM-powered RCA, runbook automation, and cost optimization.

agentic ai-agents aiops aws azure cost-optimization devops finops gcp golang helm incident-management kubernetes llm multi-cloud nextjs observability root-cause-analysis runbook-automation sre

Last synced: 11 Jun 2026

https://github.com/rsionnach/nthlayer

Generate the complete reliability stack from a service spec in 5 minutes. Dashboards, alerts, SLOs, PagerDuty - zero toil.

alerts devops grafana monitoring observability pagerduty prometheus python slo sre

Last synced: 18 Jan 2026

https://github.com/dkorunic/axfr2hosts

Fetches one or more DNS zones via AXFR and dumps in Unix hosts format for local use

bind bind9 bind9-dns dns dns-server domain linux networking security sre sysops unix zone

Last synced: 12 Apr 2025

https://github.com/bjarneo/gecho

Gecho - a HTTP request echo debugging service

debugging devops echo golang http http-server request sre

Last synced: 25 Apr 2025

https://github.com/tedilabs/terraform-aws-db

🌳 A sustainable Terraform Package which creates resources for Databases on AWS

aws aws-db aws-elasticache aws-rds devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 15 Apr 2025

https://github.com/shantoroy/site-reliability-engineering-101

This GitHub repository contains a comprehensive tutorial on Site Reliability Engineering (SRE), covering topics such as SLAs, SLOs, SLIs, Chaos Engineering, monitoring, alerting, and much more. It also includes a bonus content on SRE best practices. Follow along with the #100daysofSRE challenge and improve your reliability engineering skills.

100daysofcode alerting automation chaos-engineering devops devsecops monitoring reliability-engineering service-level-agreement service-level-indicator service-level-objective site-reliability-engineering sre

Last synced: 27 Mar 2026

https://github.com/todd-dsm/mac-ops

QnD Automation to build a MacBook Pro for DevOps

customizable devops devops-tools macbook-configuration macbook-setup macos sre

Last synced: 13 Apr 2025

https://github.com/chatwoot/faultline

An open-source AI agent for infrastructure debugging.

ai ai-agents ai-sre sre

Last synced: 24 Feb 2026

https://github.com/fault-project/fault-cli

Build Exciting Applications Your Users Can Rely On

chaos-engineering reliability-engineering resilience-engineering sre

Last synced: 29 May 2026

https://github.com/guilhem/devops-training

DevOps culture training

agile cloud devops hugo lean reveal-js sre

Last synced: 19 Mar 2025

https://github.com/Lethe044/hermes-incident-commander

Autonomous SRE agent built on Hermes - detects, heals, and learns from production incidents. Uses Memory + Skills + Cron + Gateway + Subagents + Atropos RL.

atropos autonomous-agents devops hermes-agent incident-response llm-agent nous-research sre

Last synced: 05 May 2026

https://github.com/xe-nvdk/terraform-recipes

This is the repo where I save #Terraform recipes, mostly posted in cduser.com

devops iaac infrastructure-as-code sre terraform

Last synced: 11 Apr 2025

https://github.com/gopatchy/bkl

Layered Configuration Language

configuration deployment devops json k8s kubernetes sre toml yaml

Last synced: 17 Jan 2026

https://github.com/avivl/cloud-sre-agent

An autonomous SRE agent that monitors cloud logs across multiple platforms, leveraging AI models from various providers to detect anomalies, perform root cause analysis, and automate remediation by creating GitHub Pull Requests.

ai-agents ai-ops automation aws cloud devops gcp gemini-ai google-cloud incident-response llm log-analysis log-monitoring platform-engineering python resilience sre vertex-ai

Last synced: 09 Mar 2026

https://github.com/tedilabs/terraform-aws-firewall

🌳 A sustainable Terraform Package which creates resources for Firewall Services on AWS

aws aws-firewall aws-waf devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 21 Jan 2026

https://github.com/input-output-hk/devshell-capsules

Space Capsules for the Modern DevShell

devshell sre

Last synced: 13 Oct 2025

https://github.com/woodprogrammer/postgresql-connection-manager

This is project to manage postgresql connections via cgroup V2

cgroups devops pg postgresql sre

Last synced: 28 Apr 2025

https://github.com/apiaryio/ivy

A Node.js queue library focused on easy, yet flexible task execution.

sre

Last synced: 30 Jul 2025

https://github.com/fluxninja/aperture-go

SDK to interact with Aperture Agent

concurrency-limiter flow-control rate-limiter sdk sre

Last synced: 14 Oct 2025

https://github.com/excoriate/daggerx

DaggerX is a Go package 📦 that helps you avoid DRY while developing Dagger modules.

cli devops ecs example sre tooling

Last synced: 03 Jul 2025

https://github.com/tedilabs/terraform-aws-vpc-connectivity

🌳 A sustainable Terraform Package which creates VPC Connectivity resources (Private Link, Client VPN, Site-to-Site VPN, DX, VPC Lattice) on AWS

aws aws-client-vpn aws-direct-connect aws-dx aws-site-to-site-vpn aws-vpc aws-vpc-lattice aws-vpc-private-link aws-vpn devops hacktoberfest hcl2 iac lang-hcl sre tedilabs terraform terraform-aws terraform-module terraform-modules

Last synced: 24 Oct 2025

https://github.com/antolius/deployments-and-disasters

A tabletop RPG for practicing incident management.

rpg sre training

Last synced: 05 May 2025