Projects in Awesome Lists tagged with reliability-engineering
A curated list of projects in awesome lists tagged with reliability-engineering .
https://github.com/litmuschaos/litmus
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
chaos-engineering chaos-experiments chaos-testing chaoshub cloud-native cncf devops fault-injection fault-simulation golang google-summer-of-code hacktoberfest k8s kubernetes lfx litmuschaos operator-sdk reliability-engineering resilience-testing site-reliability-engineering
Last synced: 23 Oct 2025
https://github.com/bregman-arie/sre-checklist
A checklist of anyone practicing Site Reliability Engineering
automation checklist gitops kubernetes reliability-engineering sre terraform
Last synced: 15 May 2025
https://github.com/awslabs/aws-well-architected-labs
Hands on labs and code to help you learn, measure, and build using architectural best practices.
aws cost lab reliability reliability-engineering resilience resiliency security well-architected wellarchitected
Last synced: 09 Jan 2026
https://github.com/chaostoolkit/chaostoolkit
Chaos Engineering Toolkit & Orchestration for Developers
automation chaos-engineering chaostoolkit devops-tools reliability reliability-engineering resiliency sre
Last synced: 14 May 2025
https://github.com/Azure/Mission-Critical
This repository provides a design methodology and approach to building highly-reliable applications on Microsoft Azure for mission-critical workloads.
azure business-critical mission-critical reliability-engineering safety-critical
Last synced: 22 Jul 2025
https://github.com/azure/mission-critical
This repository provides a design methodology and approach to building highly-reliable applications on Microsoft Azure for mission-critical workloads.
azure business-critical mission-critical reliability-engineering safety-critical
Last synced: 28 Sep 2025
https://github.com/artilleryio/chaos-lambda
Serverless chaos monkey for AWS (runs on AWS Lambda) ☁️ 💥
aws chaos-monkey fault-tolerance reliability-engineering
Last synced: 22 Jul 2025
https://github.com/mikeroyal/openshift-guide
OpenShift Guide. Learn about the Red Hat OpenShift Container Platform, Data Science, Code Ready Containers, Podman, Buildah, and Kubernetes.
active-directory btrfs chaos-engineering container-image container-security deploy-tool hybrid-cloud kubernetes kubernetes-cluster kubevirt multicloud openshift openshift-ansible openshift-cluster openshift-dedicated openshift-deployment openshift4 reliability-engineering site-reliability-engineering systemctl
Last synced: 28 Oct 2025
https://github.com/rakhimov/scram
Probabilistic Risk Analysis Tool (fault tree analysis, event tree analysis, etc.)
bdd c-plus-plus cpp17 event-tree fault-tree fta pra psa python qt5 reliability-engineering risk-analysis zbdd
Last synced: 11 May 2025
https://github.com/temperlang/temper
A programming language for libraries translated to all the others
distributed-systems interoperability programming-language reliability-engineering translation
Last synced: 25 Apr 2026
https://github.com/grafana/k6-docs
The k6 documentation website.
devops docs hacktoberfest k6 performance-testing reliability-engineering
Last synced: 15 May 2025
https://github.com/alphagov/paas-cf
GOV.UK PaaS - Cloud Foundry
cloud-foundry concourse paas reliability-engineering
Last synced: 08 May 2025
https://github.com/chaostoolkit/chaostoolkit-lib
The Chaos Toolkit core library
chaos-engineering chaostoolkit chaostoolkit-core reliability-engineering
Last synced: 05 Apr 2025
https://github.com/theodesp/stable-systems-checklist
An opinionated list of attributes and policies that need to be met in order to establish a stable software system.
architecture continuous-delivery continuous-integration fault-tolerance reliability-engineering security
Last synced: 07 Jan 2026
https://github.com/alphagov/terraform-provider-concourse
A terraform provider for Concourse
concourse reliability-engineering terraform-provider
Last synced: 08 May 2025
https://github.com/dastergon/sreworkbook-templates-md
A collection templates ported from the SRE Workbook
devops error-budget reliability-engineering site-reliability site-reliability-engineering sla sli slo templates
Last synced: 03 Apr 2025
https://github.com/alphagov/paas-docker-cloudfoundry-tools
cloud-foundry docker paas reliability-engineering
Last synced: 27 Jun 2025
https://github.com/alphagov/puppet-aptly
No longer maintained: Puppet module for aptly
govuk puppet reliability-engineering
Last synced: 30 Sep 2025
https://github.com/nobl9/terraform-provider-nobl9
Terraform provider for Nobl9
google-slo metrics monitoring nobl9 observability openslo reliability reliability-engineering slo sre
Last synced: 01 Apr 2026
https://github.com/alphagov/paas-billing
A Go application for generating billing data from cloudfoundry events
cloud-foundry paas reliability-engineering
Last synced: 18 Jun 2025
https://github.com/alphagov/paas-cf-conduit
cloud-foundry paas reliability-engineering
Last synced: 08 May 2025
https://github.com/alphagov/paas-admin
Administration tool for GOV.UK PaaS
cloud-foundry node paas reliability-engineering typescript webpack
Last synced: 08 May 2025
https://github.com/alphagov/paas-aiven-broker
A service broker to provide Aiven Elasticsearch and InfluxDB services to Cloud Foundry users
aiven aws broker cloud-foundry paas reliability-engineering
Last synced: 12 Jul 2025
https://github.com/last9/last9-integrations
Sample applications of supported integrations by Last9 Products
integrations last9 reliability-engineering sre timeseries-database
Last synced: 28 Apr 2025
https://github.com/alphagov/paas-bootstrap
Bootstrap a VPC with BOSH and Concourse to run PaaS
bosh concourse paas reliability-engineering
Last synced: 16 Jun 2025
https://github.com/alphagov/paas-tech-docs
Technical documentation for GOV.UK PaaS
documentation paas reliability-engineering tech-docs-template
Last synced: 08 May 2025
https://github.com/shantoroy/site-reliability-engineering-101
This GitHub repository contains a comprehensive tutorial on Site Reliability Engineering (SRE), covering topics such as SLAs, SLOs, SLIs, Chaos Engineering, monitoring, alerting, and much more. It also includes a bonus content on SRE best practices. Follow along with the #100daysofSRE challenge and improve your reliability engineering skills.
100daysofcode alerting automation chaos-engineering devops devsecops monitoring reliability-engineering service-level-agreement service-level-indicator service-level-objective site-reliability-engineering sre
Last synced: 27 Mar 2026
https://github.com/dastergon/error-budget-calculator
Calculate the tolerable downtime of your service
reliability-engineering service-level-agreement service-level-objective site-reliability-engineering
Last synced: 18 Feb 2026
https://github.com/louisaslett/reliabilitytheory
ReliabilityTheory R package: Tools for structural reliability analysis
Last synced: 12 Apr 2025
https://github.com/alphagov/paas-team-manual
GOV.UK PaaS team manual
documentation paas reliability-engineering
Last synced: 08 May 2025
https://github.com/guilt/chaossquirrel
Like Netflix's Chaos Monkey, packaged to run standalone.
chaos-monkey reliability-engineering sre
Last synced: 12 Aug 2025
https://github.com/alphagov/paas-elasticache-broker
A CloudFoundry service broker for AWS Elasticache Redis services
aws broker cloud-foundry paas reliability-engineering
Last synced: 08 May 2025
https://github.com/alphagov/paas-s3-broker
An Open Service Broker API-compatible service broker for AWS S3
aws broker cloud-foundry paas reliability-engineering s3
Last synced: 08 May 2025
https://github.com/alphagov/zendesk-scripts
Various scripts in various languages to interact with GDS Zendesk
Last synced: 06 Oct 2025
https://github.com/govuk-paas/paas-elasticache-broker
A CloudFoundry service broker for AWS Elasticache Redis services
aws broker cloud-foundry paas reliability-engineering
Last synced: 07 Oct 2025
https://github.com/alphagov/paas-release-ci
Central release CI repository
bosh concourse paas reliability-engineering
Last synced: 08 May 2025
https://github.com/carlescg/faulttreetutorial
This is a tutorial for the package FaultTree from openreliability.com
fault-tree reliability-engineering risk-analysis risk-models tutorial
Last synced: 13 Jul 2025
https://github.com/sbittla/gatling-javafaker-maven
Generating realistic test data or simulating load with authentic, dynamic data using the Gatling framework and JavaFaker
data-sampling datageneration gatling gatling-example gatling-frontline gatling-plugin gatling-simulations javafaker load-testing loadtesting maven open-source opensource performance-monitoring performance-optimization performance-testing reliability-engineering scala scalability test-data-generator
Last synced: 23 Feb 2026
https://github.com/heiderjeffer/misalignment-between-ownership-and-contribution-affects-system-reliability
Research Proposals RP
archtecture data-analysis data-collection nvivo-software python qualitative-analysis quantative-analysis reliability-engineering software-engineering
Last synced: 23 Feb 2026
https://github.com/alphagov/paas-prometheus-charts
generate SVG charts for PromQL queries
Last synced: 06 Jul 2025
https://github.com/exospherehost/ai-reliability-standards
Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).
ai ai-agents ai-reliability aiops durable-execution enterprise evals evaluation observability reliability-engineering sre
Last synced: 15 Feb 2026
https://github.com/alphagov/white-chapel-building-map
:office: :globe_with_meridians: Maps of the GDS office in the white chapel building
Last synced: 08 May 2025
https://github.com/govuk-paas/paas-prometheus-charts
generate SVG charts for PromQL queries
Last synced: 11 Oct 2025
https://github.com/alphagov/paas-service-broker-base
Provides a base for building new service brokers
broker cloud-foundry paas reliability-engineering
Last synced: 08 May 2025
https://github.com/exospherehost/claudeye
Watchtower for Claude Code & Agents SDK - replay sessions, run custom evals, debug agent traces. Uncover, Understand, and Utilize
ai-agents claude claude-agents-sdk claude-code cli eval eval-framework npm observability reliability reliability-engineering
Last synced: 01 Mar 2026
https://github.com/alphagov/paas-sqs-broker
An Open Service Broker API-compatible service broker for AWS SQS
aws broker cloud-foundry paas reliability-engineering sqs
Last synced: 08 May 2025
https://github.com/thalesgroup/statistical-reliability-ml
This package provides implementations of Monte Carlo methods to estimate the probability of failure of neural networks under noisy inputs.
monte-carlo-methods neural-networks reliability-engineering statistical-analysis
Last synced: 17 Mar 2025
https://github.com/1mb-dev/autobreaker
Adaptive circuit breaker for Go with percentage-based thresholds that automatically adjust to traffic patterns. Zero dependencies, <100ns overhead.
adaptive-threshold circuit-breaker distributed-systems fault-tolerance go golang microservices observability reliability-engineering resilience
Last synced: 24 Feb 2026
https://github.com/alphagov/paas-auditor
Stores Cloud Foundry audit events in a Postgres database
cloud-foundry paas reliability-engineering security
Last synced: 08 May 2025
https://github.com/alphagov/paas-metrics-collector-cw
PaaS metrics collector to CloudWatch
cloudwatch paas reliability-engineering
Last synced: 08 May 2025
https://github.com/openpra-org/inverse-canopy
An inverse estimation technique for back-fitting conditional/functional event probability distributions in an event tree to match target end-state frequencies.
event-tree fault-tree pra probabilistic-risk-assessment probability-distribution reliability reliability-engineering risk-assessment
Last synced: 11 Apr 2026
https://github.com/alphagov/paas-observability-release
A repository for the observability prototype for GOV.UK PaaS
bosh observability paas reliability-engineering
Last synced: 08 May 2025
https://github.com/alphagov/paas-accounts
cloud-foundry paas reliability-engineering
Last synced: 06 Jul 2025
https://github.com/alphagov/paas-rds-metric-collector
Small application connecting to all RDS instances hosted on the GOV.UK PaaS and gathering metrics. Pushing them to loggregator.
aws cloud-foundry metrics paas reliability-engineering
Last synced: 19 Sep 2025
https://github.com/alphagov/paas-s3-broker-boshrelease
aws bosh broker cloud-foundry paas reliability-engineering
Last synced: 08 Jul 2025
https://github.com/sunilp/agentic-ai
A practical field guide to building reliable, evaluable, and production-grade agentic AI systems
agent-architecture agent-evaluation agentic-ai ai-agents ai-engineering ai-safety artificial-intelligence book evaluation field-guide generative-ai human-in-the-loop large-language-models llm multi-agent-systems production-ai python reliability-engineering
Last synced: 02 Apr 2026
https://github.com/alphagov/paas-log-cache-adapter
cloud-foundry metrics-exporter paas prometheus reliability-engineering
Last synced: 08 May 2025
https://github.com/jsabo/aws-lambda-failure-flag-app
This project demonstrates how to integrate Gremlin Failure Flags into an AWS Lambda function, enabling you to simulate injected latency and exceptions while measuring processing performance. It’s a serverless demo for testing resiliency and fault injection in real-world cloud environments.
aws chaos-engineering gremlin lambda performance-testing reliability-engineering
Last synced: 08 Jul 2025
https://github.com/alphagov/paas-db-admin-boshrelease
A Bosh release for running administrative operations against Postgres
bosh paas reliability-engineering
Last synced: 08 May 2025
https://github.com/alphagov/paas-submit
🔥 Firebreak Project ℹ️
node paas reliability-engineering typescript webpack
Last synced: 08 May 2025
https://github.com/alphagov/paas-sqs-broker-boshrelease
aws bosh broker cloud-foundry paas reliability-engineering
Last synced: 26 Jun 2025
https://github.com/chirag2203/softwarereliability_gwo_ieee
Research paper on "Comparative analysis of software reliability using GWO and ML" published in IEEE during IATMSI conference.
gwo-optimization-algorithm ieee ml python reliability-engineering software-reliability
Last synced: 07 Sep 2025
https://github.com/alphagov/paas-cdn-broker-boshrelease
A bosh release for 18F's CDN broker
aws bosh broker cloud-foundry paas reliability-engineering
Last synced: 08 May 2025
https://github.com/allenpandas/se4ml-toolkit
人工智能+计算机安全交叉领域科研工具🔧 SE4ML: Security for Machine Learning. This repository is the Toolkit for Security, Robustness and Reliability of the Machine Learning.
ai-security aisecurity machine-learning reinforcement-learning reliability-engineering robustness security software-engineering software-testing tool toolkit
Last synced: 07 Aug 2025
https://github.com/alphagov/paas-rubbernecker
A summary of stuff in PivotalTracker
non-platform-tools paas reliability-engineering
Last synced: 31 Aug 2025
https://github.com/mosher-labs/helm-charts
🚀 This repository serves as a centralized collection of Helm charts for deploying and managing Kubernetes applications. 🎯
axes devops helm helm-charts infrastructure-as-code k8s kubernetes mosher-labs reliability-engineering viking
Last synced: 26 Feb 2025
https://github.com/mosher-labs/basic-repo-template
🚀 This repository serves as a basic template for creating new repositories. It's designed to be a foundation for structure and organization. 🎯
axes devops infrastructure-as-code mosher-labs reliability-engineering templates viking
Last synced: 05 Mar 2026
https://github.com/juanfranciscocis/devprobe_tesis
DevProbe is a progressive web application that provides a platform for Site Reliability Engineers to monitor their websites. The app is built with , IONIC, Angular and Firebase.
angular gemini gemini-api ionic ionic-framework reliability-engineering site site-reliability-engineering site-reliability-engineering-sre sre sre-team typescript
Last synced: 10 Apr 2026
https://github.com/texasbe2trill/constellation-engine
A dependency graph–driven system for reasoning about failure propagation, blast radius, and architectural risk in complex systems.
ai-reasoning architecture good-first-issue graph-theory python reliability-engineering systems-engineering
Last synced: 26 Jan 2026
https://github.com/alphagov/paas-deployments
node paas reliability-engineering typescript
Last synced: 14 Jul 2025
https://github.com/pedroliman/uniconf
Software de Confiabilidade Universitário
reliability reliability-engineering shiny shinyapps
Last synced: 01 Apr 2025
https://github.com/mosher-labs/ansible-node-setup
🚀 This repo provides Ansible playbooks and roles designed to configure and manage nodes for lightweight Kubernetes clusters using K3s. 🎯
ansible axes devops infrastructure-as-code mosher-labs reliability-engineering viking
Last synced: 16 Jan 2026
https://github.com/vsamidurai/outage-reports
Curated list of technical outage/incident reports
cloud learning reliability-engineering sharing-is-caring
Last synced: 02 Feb 2026
https://github.com/alphagov/paas-elasticache-broker-boshrelease
aws bosh broker cloud-foundry paas reliability-engineering
Last synced: 05 Jul 2025
https://github.com/mosher-labs/.github
⚡⚔️🌩️ Welcome to Mosher Labs! Combining Scandinavian heritage and cutting-edge cloud technologies, we deliver precision-crafted solutions with AWS and Infrastructure as Code. ⚡⚔️🌩️
automation aws axes cicd-pipelines cloud-computing cloud-native devops homelab infrastructure-as-code mosher-labs reliability-engineering viking
Last synced: 25 Apr 2026
https://github.com/cakmoel/resilio
Professional technology-agnostic load testing suite built for performance engineering and durability auditing. Implements research-based methodologies (Jain, 1991) and ISO 25010 standards to validate speed, endurance, and scalability across any backend stack.
apachebench benchmarking devops-tools endurance-testing load-testing performance-testing quality-assurance reliability-engineering scalability sre stress-testing tech-agnostic
Last synced: 13 Apr 2026
https://github.com/kretski/orac-nt-ssd-thermal-sdk
Deterministic Physics-based Thermal Control SDK for NVMe Controllers. Extending NAND lifetime by 31.6% using ORAC-NT Vitality modeling and Arrhenius-based BER optimization.
nand-flash-memory nvme reliability-engineering ssd-controller thermal-management
Last synced: 06 Apr 2026
https://github.com/heiderjeffer/https-gitlab.inf.unibz.it-heider.jeffer-vv1718_ayo
MSc in Computer Science UNIBZ. Free University of Bozen-Bolzano
bash-script java reliability-engineering shell-script statistics testing
Last synced: 02 Apr 2025
https://github.com/rebound-how/rebound
The open source toolbox for resilient operations
agentic-ai ai chaos-engineering chaostoolkit devops mcp-server reliability-engineering reliability-tools resilience resilience-testing sre
Last synced: 08 Jul 2025
https://github.com/vnykmshr/autobreaker
Adaptive circuit breaker for Go with percentage-based thresholds that automatically adjust to traffic patterns. Zero dependencies, <100ns overhead.
adaptive-threshold circuit-breaker distributed-systems fault-tolerance go golang microservices observability reliability-engineering resilience
Last synced: 23 Jan 2026
https://github.com/mosher-labs/basic-helm-charts-template
🚀 This repo provides a clean, minimal starting point for creating and managing Helm charts for Kubernetes applications. 🎯
axes devops helm helm-chart helm-charts infrastructure-as-code k8s kubernetes mosher-labs reliability-engineering viking
Last synced: 29 Jan 2026