Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dastergon/awesome-chaos-engineering

A curated list of Chaos Engineering resources.
https://github.com/dastergon/awesome-chaos-engineering

List: awesome-chaos-engineering

awesome awesome-list chaos chaos-community chaos-engineering chaos-monkey chaos-testing netflix-chaos-monkey resilience simian-army site-reliability-engineering

Last synced: 4 months ago
JSON representation

A curated list of Chaos Engineering resources.

Awesome Lists containing this project

README

        

# Awesome Chaos Engineering [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of awesome [Chaos Engineering](http://principlesofchaos.org/) resources.

#### What is Chaos Engineering?
> Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - [Principles Of Chaos Engineering](http://principlesofchaos.org/) website.

## Contents
- [Culture](#culture)
- [Books](#books)
- [Education](#education)
- [Notable Tools](#notable-tools)
- [Papers](#papers)
- [Gamedays](#gamedays)
- [Blogs & Newsletters](#blogs--newsletters)
- [Conferences & Meetups](#conferences--meetups)
- [Forums](#forums)
- [Twitter](#twitter)

## Culture
* [Principles Of Chaos Engineering](http://principlesofchaos.org/)
* [Chaos Community](http://chaos.community/)
* [Chaos Engineering](https://www.infoq.com/articles/chaos-engineering)
* [O'Reilly Velocity San Jose 2017: Precision Chaos](https://www.youtube.com/watch?v=C11LNUEaHuo)
* [The Discipline of Chaos Engineering](https://www.gremlin.com/blog/the-discipline-of-chaos-engineering/)
* [Chaos Monkey for Fun and Profit](https://sharpend.io/chaos-monkey-for-fun-and-profit/)
* [Fault Injection in Production: Making the case for resilience testing](https://queue.acm.org/detail.cfm?id=2353017)
* [Lord of Chaos - Becoming a Chaos Engineer](https://vimeo.com/groups/jz2016/videos/181925286)
* [Chaos testing - Preventing failure by instigation](http://www.cakesolutions.net/teamblogs/chaos-testing-preventing-failure-by-instiga)
* [Orchestrated Chaos](https://docs.google.com/presentation/d/1zzHS3qoPGzwsSna5-uk3Xt7LW_3Fr6ag8JDkeyrKwL4/edit#slide=id.p)
* Choose your own adventure: Chaos Engineering - [Video](https://www.infoq.com/presentations/adopt-chaos-engineering) & [Slides](https://www.slideshare.net/NoraJones1/choose-your-own-adventure-qcon-2017-1)
* [AMA Chaos Engineering + DiRT](http://pages.catchpoint.com/AMA-Chaos-DiRT.html)
* [SRECON17: Principles of Chaos Engineering](https://www.usenix.org/conference/srecon17americas/program/presentation/rosenthal)
* [Chaos & Intuition Engineering at Netflix](https://www.youtube.com/watch?v=Q4nniyAarbs)
* [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM)
* [Too big to test: Breaking a production brokerage platform without causing financial devastation](https://conferences.oreilly.com/velocity/devops-web-performance-ny-2015/public/schedule/detail/45012)
* [Inside Azure Search: Chaos Engineering](https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/)
* [Netflix, the Simian Army, and the culture of freedom and responsibility](https://devops.com/netflix-the-simian-army-and-the-culture-of-freedom-and-responsibility/)
* [FIT: Failure Injection Testing](https://medium.com/netflix-techblog/fit-failure-injection-testing-35d8e2a9bb2)
* [The Netflix Simian Army](https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116)
* [Automated Failure Testing](https://medium.com/netflix-techblog/automated-failure-testing-86c1b8bc841f)
* [The Verification of a Distributed System by Caitie McCaffrey](http://queue.acm.org/detail.cfm?ref=rss&id=2889274)
* [The Journey to Chaos Engineering begins with a single step - Bruce Wong and James Burns (Twilio)](https://www.youtube.com/watch?v=rKAo2wANiHM)
* [Chaos Engineering by Lorin Hochstein](https://www.youtube.com/watch?v=vq4QZ4_YDok)
* [Aaron Rinehart - ChaoSlingr: Introducing Security based Chaos Testing](https://www.youtube.com/watch?v=BLRb-E0G5zk)
* [Chaos Engineering - Casey Rosenthal](https://www.youtube.com/watch?v=6OIOpx_dVFY)
* The Road to Chaos - Velocity 2017- [video](https://www.youtube.com/watch?v=FCZVAZaXIjs) & [slides](https://github.com/norajones/Presentations/blob/master/The%20Road%20To%20Chaos%20-%20Velocity%202017.pdf)
* [How Netflix DDoS’d Itself To Help Protect the Entire Internet](https://www.wired.com/story/netflix-ddos-attack)
* [10 Years of Crashing Google](https://www.usenix.org/conference/lisa15/conference-program/presentation/krishnan)
* [Weathering the Unexpected](http://queue.acm.org/detail.cfm?id=2371516)
* [SRECON17: Breaking Things on Purpose](https://youtu.be/h_-shm0SL08)
* [PuppetConf 2016: Chaos Patterns - Architecting for Failure in Distributed Systems](https://youtu.be/V3P35N_HXNQ)
* [Ship More, Sink Less - Changing Chaos Engineering and Distributed Tracing](https://youtu.be/nr2KWbyWAmA)
* [Cloudcast - Discipline of Chaos Engineering](http://www.thecloudcast.net/2017/05/the-cloudcast-299-discipline-of-chaos.html)
* [Software Engineering Daily - Failure Injection with Kolton Andrus podcast](https://softwareengineeringdaily.com/2017/03/29/failure-injection-with-kolton-andrus/)
* [Responding to Failures in Playback Features with Haley Tucker podcast](https://www.infoq.com/podcasts/netflix-haley-tucker?utm_campaign=infoq_content&utm_source=twitter&utm_medium=feed&utm_term=architecture-design)
* ["Antics, drift, and chaos" by Lorin Hochstein](https://youtu.be/SM2uXpmyJmA)
* [re:invent 2017: Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://youtu.be/rgfww8tLM0A)
* [Failure Friday: Four Years On](https://www.pagerduty.com/blog/failure-fridays-four-years/)
* [Monkeys & Lemurs and Locusts, Oh my!](https://www.slideshare.net/zgrinch/monkeys-lemurs-and-locusts-oh-my)
* [Practical Chaos Engineering](https://youtu.be/Yn4tYxqzFVU)
* [Chaos Day in the Met Office Cloud](https://www.cloudreach.com/fr/blog/training-cloud-operations-teams-met-office/)
* [Cloud Native and Chaos Engineering](https://medium.com/chaosiq/cloud-native-and-chaos-engineering-20842ee2fa8a)
* [Chaos Engineering with Kolton Andrus](https://softwareengineeringdaily.com/2018/02/02/chaos-engineering-with-kolton-andrus/)
* [Chaos Engineering: the history, principles, and practice](https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/)
* [Embracing the Chaos of Chaos Engineering](https://blog.codeship.com/embracing-the-chaos-of-chaos-engineering/)
* [Designing Services for Resilience: Netflix Lessons](https://www.infoq.com/presentations/netflix-microservices-resiliency)
* [Chaos Engineering: A cheat sheet](https://www.techrepublic.com/article/chaos-engineering-a-cheat-sheet/)
* [How to convince your boss and make them say “Yes!” to Chaos Engineering?](https://medium.com/@crochefolle/how-to-convince-your-boss-to-make-them-say-yes-to-chaos-engineering-796ba119bd7)
* [Why the World Needs More Resilient Systems](https://www.infoq.com/news/2018/03/resilient-systems-chaos-engineer)
* [Chaos Architecture](https://www.infoq.com/presentations/chaos-architecture-mindset)
* [Gremlin’s Tammy Bütow on the Business Side of Chaos Engineering](https://thenewstack.io/gremlins-tammy-butow-on-the-business-side-of-chaos-engineering/)
* [Kubernetes Chaos Engineering: Lessons Learned](https://learnk8s.io/blog/kubernetes-chaos-engineering-lessons-learned)
* [Chaos Engineering: managing complexity by breaking things](https://hub.packtpub.com/chaos-engineering-managing-complexity-by-breaking-things/)
* [Podcast:Database Chaos with Tammy Butow](https://softwareengineeringdaily.com/2018/04/10/database-chaos-with-tammy-butow/)
* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
* [GOTO 2018 - Breaking Things on Purpose - Kolton Andrus](https://youtu.be/S89ox7oQn8s)
* [Why should Chaos be part of your Distributed Systems Engineering?](https://medium.com/@bbideep/why-should-chaos-be-part-of-your-distributed-systems-engineering-5bcb21497660)
* [Brian Holt - Chaos Monkeys in Your Browser What Chaos Engineering Means For the Front End](https://www.youtube.com/watch?v=A4_rRj-4Mv0)
* [Chaos Engineering: Why the World Needs More Resilient Systems](https://www.youtube.com/watch?time_continue=242&v=Khqf0XltR_M)
* QCon·Beijing 2017: The Practice of Failure Management and Fault Injection at Alibaba E-Commerce Platforms - [video](http://www.infoq.com/cn/presentations/ali-electricity-supplier-fault-management-and-fault-drills-practice) & [speech draft](http://jm.taobao.org/2017/06/22/20170622/) (Chinese speech)
* [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
* [Breaking to Learn: Chaos Engineering Explained](https://blog.newrelic.com/engineering/chaos-engineering-explained/)
* [Chaos Engineering Traps](https://medium.com/@njones_18523/chaos-engineering-traps-e3486c526059)
* [Chaos Engineering - The Art of Breaking Things Purposefully](https://medium.com/@adhorn/chaos-engineering-ab0cc9fbd12a)
* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering-3434422afb54)
* [Taming chaos: Preparing for your next incident](https://www.oreilly.com/ideas/taming-chaos-preparing-for-your-next-incident)
* [The Future of Chaos Engineering w/ Conde Nast](https://www.youtube.com/watch?v=RqM2sMt11Bw)
* [Chaos Engineering For People Systems w/ Dave Rensin of Google](https://www.youtube.com/watch?v=sn6wokyCZSA)
* [Performing chaos engineering in a serverless world (AWS re:Invent 2019 CMY301)](https://www.youtube.com/watch?v=vbyjpMeYitA)
* [Building Confidence in Healthcare Systems through Chaos Engineering](https://www.infoq.com/presentations/cerner-resiliency)
* [Break Your App before Someone Else Does](https://www.infoq.com/presentations/test-android-apk/)
* [Preparing for Traffic Spikes with Chaos Engineering](https://www.bigmarker.com/gremlin/Preparing-for-Traffic-Spikes-with-Chaos-Engineering)
* [Automating Chaos Engineering GameDays with Terraform](https://www.youtube.com/watch?v=NOOgKNbW0gk)
* [Postmortem Culture: Learning from failure](https://www.youtube.com/watch?v=JtLrlDNdJzg&feature=youtu.be)
* [Problem Detection by John Allspaw](https://www.youtube.com/watch?v=NxctiGRI2y8)
* [New Paradigms for the Next Era of Security](https://www.rsaconference.com/industry-topics/webcast/35-new-paradigms-for-the-next-era-of-security)
* [Cloud-Native Chaos Engineering](https://dev.to/umamukkara/chaos-engineering-for-cloud-native-systems-2fjn)
* [Building resilient services at Prime Video with chaos engineering](https://aws.amazon.com/blogs/opensource/building-resilient-services-at-prime-video-with-chaos-engineering/)
* [Making Chaos Part of Kubernetes/OpenShift Performance and Scalability Tests](https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests)
* [Lucky Lotto, chaos engineering but for teams](https://danlebrero.com/2021/06/30/cto-dairy-lucky-lotto-chaos-engineering-for-teams/)
* [Using Fault Injection Testing to Improve DoorDash Reliability](https://doordash.engineering/2022/04/25/using-fault-injection-testing-to-improve-doordash-reliability/)
* [Chaos Engineering At Ant Group](https://medium.com/@monkeysuzie/chaos-engineering-at-ant-group-30c15cb6ab69)

## Books
* [Chaos Engineering: Building Confidence in System Behavior through Experiment](http://www.oreilly.com/webops-perf/free/chaos-engineering.csp)
* [Site Reliability Engineering: How Google Runs Production Systems](https://landing.google.com/sre/book.html) -
* [The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems](http://the-cloud-book.com/)
* [Antifragile Systems and Teams](http://www.oreilly.com/webops-perf/free/antifragile-systems-and-teams.csp)
* [The InfoQ eMag: Chaos Engineering](https://www.infoq.com/minibooks/emag-chaos-engineering)
* [Learning Chaos Engineering](http://shop.oreilly.com/product/0636920251897.do)
* [Chaos Engineering: System Resilience in Practice](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
* [Chaos Engineering: Crash test your applications](https://www.manning.com/books/chaos-engineering)
* [Security Chaos Engineering: Gaining Confidence in Resilience and Safety at Speed and Scale](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/)
* [Chaos Engineering Observability](https://www.humio.com/resources/reports/chaos-observability/)

## Education
* A Chaos Engineering Bootcamp for O'Reilly Velocity 2017 - [Slides](https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp) & [Source code](https://github.com/tammybutow/chaos_engineering_bootcamp)
* [Your First Chaos Experiment](https://www.gremlin.com/community/tutorials/your-first-chaos-experiment)
* [Chaos Engineering 101](https://sharpend.io/chaos-engineering-101/)
* [A Primer on Automating Chaos](https://www.gremlin.com/community/tutorials/a-primer-on-automating-chaos)
* [Intro to Chaos Engineering](https://www.youtube.com/watch?v=qHykK5pFRW4)
* [Learn the basics of the Chaos Toolkit](https://www.katacoda.com/chaostoolkit/courses/01-chaostoolkit-getting-started)
* [Build System Confidence with Chaos Engineering](https://medium.com/chaosiq/improve-your-cloud-native-devops-flow-with-chaos-engineering-dc32836c2d9a)
* [How we break things at Twitter: failure testing](https://blog.twitter.com/engineering/en_us/a/2015/how-we-break-things-at-twitter-failure-testing.html)
* [Run Chaos Experiments Without Risking Your Job](https://blog.loadmill.com/run-chaos-experiments-without-risking-your-job-2c8a5f4b0bfc)
* [A Guide to Your First Chaos Day](https://victorops.com/blog/a-guide-to-your-first-chaos-day)
* [Planning Your Own Chaos Day](https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/)
* [How To Install Distributed Tensorflow on GCP and Perform Chaos Engineering Experiments](https://www.gremlin.com/community/tutorials/how-to-install-distributed-tensorflow-on-gcp-and-perform-chaos-engineering-experiments/)
* [Monitoring Your Chaos Experiments](https://www.brighttalk.com/webcast/15087/316835)
* [Increasing the Resilience of APIs with Chaos Engineering](https://www.infoq.com/news/2018/05/gremlin-api-chaos)
* [3 key steps for running chaos engineering experiments](https://www.infoworld.com/article/3268017/devops/3-key-steps-for-running-chaos-engineering-experiments.html)
* [Exploring Multi-level Weaknesses using Automated Chaos Experiments](https://medium.com/chaosiq/exploring-multi-level-weaknesses-using-automated-chaos-experiments-aa30f0605ce)
* [Chaos Monkey Guide for Engineers](https://www.gremlin.com/chaos-monkey/)
* [Chaos Engineering for Serverless](https://www.youtube.com/playlist?list=PL70SCo-0vujiQkPAOGuZP-kNZZkzcPVKD)
* [Network Fire Drills with Chaos Engineering](https://speakerdeck.com/homingli/network-automation-meetup-network-fire-drills-with-chaos-engineering)
* [Dev Ops Foundations: Chaos Engineering](https://www.linkedin.com/learning/devops-foundations-chaos-engineering/)
* [Resilience Engineering: Short Course](http://csel.org.ohio-state.edu/ResilienceEngineering.html)
* [The Chaos Engineering Collection](https://medium.com/@adhorn/the-chaos-engineering-collection-5e188d6a90e2)
* [PenTester Academic](https://www.pentesteracademy.com/onlinelabs)
* [Consul and Chaos Engineering](https://learn.hashicorp.com/tutorials/consul/introduction-chaos-engineering?in=consul/resiliency)

## Notable Tools
* [Chaos Monkey](https://github.com/Netflix/chaosmonkey) - A resiliency tool that helps applications tolerate random instance failures.
* [orchestrator](https://github.com/github/orchestrator) - MySQL replication topology management and HA.
* [kube-monkey](https://github.com/asobti/kube-monkey) - An implementation of Netflix's Chaos Monkey for Kubernetes clusters.
* [Gremlin Inc.](https://www.gremlin.com/) - Failure as a Service.
* [Chaos Toolkit](https://github.com/chaostoolkit/chaostoolkit) - A chaos engineering toolkit to help you build confidence in your software system.
* [steadybit](https://www.steadybit.com/) - A Chaos Engineering platform (SaaS or On-Prem) with auto discovery features, different attack types, user management and many more.
* [PowerfulSeal](https://github.com/bloomberg/powerfulseal) - Adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down.
* [drax](https://github.com/dcos-labs/drax) - DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach.
* [Wiremock](http://wiremock.org/) - API mocking (Service Virtualization) which enables modeling real world faults and delays
* [MockLab](http://get.mocklab.io/) - API mocking (Service Virtualization) as a service which enables modeling real world faults and delays.
* [Pod-Reaper](https://github.com/target/pod-reaper) - A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes.
* [Muxy](https://github.com/mefellows/muxy/) - A chaos testing tool for simulating a real-world distributed system failures.
* [Toxiproxy](https://github.com/Shopify/toxiproxy) - A TCP proxy to simulate network and system conditions for chaos and resiliency testing.
* Chaos engineering for Docker:
* [Pumba](https://github.com/gaia-adm/pumba) - Chaos testing and network emulation for Docker containers (and clusters).
* [Blockade](https://github.com/worstcase/blockade) - Docker-based utility for testing network failures and partitions in distributed applications.
* [chaos-lambda](https://github.com/bbc/chaos-lambda) - Randomly terminate ASG instances during business hours.
* [Namazu](https://github.com/osrg/namazu) - Programmable fuzzy scheduler for testing distributed systems.
* [Chaos Monkey for Spring Boot](https://codecentric.github.io/chaos-monkey-spring-boot/) - Injects latencies, exceptions, and terminations into Spring Boot applications
* [Byte-Monkey](https://github.com/mrwilson/byte-monkey) - Bytecode-level fault injection for the JVM. It works by instrumenting application code on the fly to deliberately introduce faults like exceptions and latency.
* [GomJabbar](https://github.com/outbrain/GomJabbar) - ChaosMonkey for your private cloud
* [Turbulence](https://github.com/cppforlife/turbulence-release) - Tool focused on BOSH environments capable of stressing VMs, manipulating network traffic, and more. It is very simmilar to Gremlin.
* [chaosblade](https://github.com/chaosblade-io/chaosblade) - An Easy to Use and Powerful Chaos Engineering Toolkit.
* [KubeInvaders](https://github.com/lucky-sideburn/KubeInvaders) - Gamfied Chaos engineering tool for Kubernetes Clusters
* [Cthulhu](https://github.com/xmatters/cthulhu-chaos-testing) - Chaos Engineering tool that helps evaluating the resiliency of microservice systems simulating various disaster scenarios against a target infrastructure in a data-driven manner.
* [VMware Mangle](https://vmware.github.io/mangle/) - Orchestrating Chaos Engineering.
* [Byteman](https://byteman.jboss.org/) - A Swiss Army Knife for Byte Code Manipulation.
* [Litmus](https://github.com/litmuschaos/litmus) - Framework for Kubernetes environments that enables users to run test suites, capture logs, generate reports and perform chaos tests.
* [Perses](https://github.com/nicolasmanic/perses) - A project to cause (controlled) destruction to a JVM application.
* [ChaosKube](https://github.com/linki/chaoskube) - chaoskube periodically kills random pods in your Kubernetes cluster.
* [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh) - Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments.
* [failure-lambda](https://github.com/gunnargrosch/failure-lambda) - A small Node module for injecting failure into AWS Lambda using latency, exception, statuscode or diskspace.
* [aws-chaos-scripts](https://github.com/adhorn/aws-chaos-scripts) - Collection of python scripts to run failure injection on AWS infrastructure
* [chaos-ssm-documents](https://github.com/adhorn/chaos-ssm-documents) - Collection of AWS SSM Documents to perform Chaos Engineering experiments
* [aws-lambda-chaos-injection](https://github.com/adhorn/aws-lambda-chaos-injection) - A library injecting chaos into AWS Lambda. It offers simple python decorators to do delay, exception and statusCode injection and a Class to add delay to any 3rd party dependencies.
* [chaos-dingo](https://github.com/jmspring/chaos-dingo) - A tool to mess with Azure services using the Azure NodeJS SDK.
* [Chaos HTTP Proxy](https://github.com/bouncestorage/chaos-http-proxy) - Introduce failures into HTTP requests via a proxy server
* [Chaos Lemur](https://github.com/strepsirrhini-army/chaos-lemur) - A self-hostable application to randomly destroy virtual machines in a BOSH-managed environment
* [Simoorg](https://github.com/linkedin/simoorg) - Linkedin’s very own failure inducer framework.
* [react-chaos](https://github.com/jchiatt/react-chaos) - A chaos engineering tool for your React apps
* [vue-chaos](https://github.com/aviadhahami/vue-chaos) - A chaos engineering tool for your Vue apps
* [Chaos Engine](https://github.com/ThalesGroup/chaos-engine) - tool designed to intermittently destroy or degrade application resources running in cloud based infrastructure. [Documentation](https://thalesgroup.github.io/chaos-engine/)
* [kubedoom](https://github.com/storax/kubedoom) - Kill Kubernetes pods by playing Id's DOOM.
* [kubethanos](https://github.com/berkay-dincer/kubethanos) - Kills half of your randomly selected Kubernetes pods.
* [go-fault](https://github.com/github/go-fault) - Fault injection middleware in Go
* [Proofdock's Chaos Engineering Platform](https://proofdock.io) - A chaos engineering platform that seamlessly integrates in Azure DevOps and has a focus on the Azure cloud platform.
* [Pystol](https://www.pystol.org/docs) - Pystol is a fault injection platform allowing users to execute fault injection Actions in cloud-native environments in a controlled and prescribed way.
* [AWSSSMChaosRunner](https://github.com/amzn/awsssmchaosrunner) - Amazon's light-weight open-source library for chaos engineering on AWS. It can be used for [EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html), [ECS (with EC2 launch type)](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/getting-started-ecs-ec2.html) and [Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/getting-started-fargate.html).
* [Kraken](https://github.com/cloud-bulldozer/kraken) - Chaos and resiliency testing tool for Kubernetes and OpenShift.
* [kube-burner](https://github.com/cloud-bulldozer/kube-burner) - A tool aimed at stressing Kubernetes clusters by creating or deleting a high quantity of objects.
* [Chaos Experimentation Framework](https://github.com/lyft/clutch) - An extensible platform for infrastructure management including Chaos Engineering
* [NetHavoc](https://www.cavisson.com/nethavoc-resilience-testing-solution/) - A Chaos Engineering Tool for Linux, K8s, Windows, PCF, Cloud, and Containers for injecting Resource, Infrastructure, Network, and Application failures.
* [gorm-sqlchaos](https://github.com/u2386/gorm-sqlchaos) - A runtime SQL manipulator for your Golang applications based on gorm.
* [Chaos Frontend Toolkit](https://chaos-frontend-toolkit.web.app/) - A set of tools to apply Chaos Engineering to frontend
* [Mitigant](https://mitigant.io/) - The Continuos Security Verification Platform, enables confidence in cloud security posture by leveraging security chaos engineering.

## Retired tools
* [The Simian Army](https://github.com/Netflix/SimianArmy) - A suite of tools for keeping your cloud operating in top form.
* [ChaoSlingr](https://github.com/Optum/ChaoSlingr) - Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation.

## Cloud Services
* [Testing Amazon Aurora Using Fault Injection Queries](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraMySQL.Managing.html#AuroraMySQL.Managing.FaultInjectionQueries)
* [Azure Chaos Studio](https://aka.ms/azurechaosstudio) - A managed fault injection service for Azure applications. See also [Azure Fault Analysis Service](https://docs.microsoft.com/azure/service-fabric/service-fabric-testability-overview) for Azure Service Fabric applications.
* [Security Chaos Engineering for Cloud Services](https://medium.com/@run2obtain/from-resilience-to-dependability-security-chaos-engineering-for-cloud-services-9c6d6d152ed2)

## Papers
* [Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently](https://www.usenix.org/system/files/osdi18-veeraraghavan.pdf)
* [Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf)
* [Automating Failure Testing Research at Internet Scale ](https://people.ucsc.edu/~palvaro/fit-ldfi.pdf)
* [Principles of Antifragile Software](https://arxiv.org/abs/1404.3056)
* [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/citation.cfm?id=3177123.3158134)
* [Chaos Engineering](https://arxiv.org/abs/1702.05843)
* [A Platform for Automating Chaos Experiments](https://arxiv.org/abs/1702.05849)
* [A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM](https://arxiv.org/abs/1805.05246)
* [TripleAgent: Monitoring, Perturbation And Failure-obliviousness for Automated Resilience Improvement in Java Applications](https://arxiv.org/abs/1812.10706)
* [Lineage-driven Fault Injection](https://dl.acm.org/citation.cfm?id=2723711)
* [Antifragility is a Fragile Concept](https://www.linkedin.com/pulse/antifragility-fragile-concept-casey-rosenthal/)
* [Chaos Engineering Security](https://jaxenter.com/chaos-engineering-security-163358.html)
* [Security Chaos Engineering: A new paradigm for cybersecurity](https://opensource.com/article/18/1/new-paradigm-cybersecurity)
* [Security Challenges around Chaos Engineering](https://www.conjur.org/blog/security-challenges-around-chaos-engineering/)
* [CloudStrike: Security Chaos Engineering for Cloud Services](https://www.researchgate.net/publication/335922038_Security_Chaos_Engineering_for_Cloud_Services)
* [Observability and Chaos Engineering on System Calls for Containerized Applications in Docker](https://arxiv.org/abs/1907.13039)
* [Maximizing Error Injection Realism for Chaos Engineering with System Calls](https://arxiv.org/abs/2006.04444)
* [Chaos Engineering of Ethereum Blockchain Clients](https://arxiv.org/abs/2111.00221)

## Gamedays
* [Target: What is a Gameday?](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html) - Chaos Gamedays experience by Target.
* [Codecentric: Chaos Engineering Gamedays](https://blog.codecentric.de/en/2018/08/chaos-engineering-gameday/) - Chaos Gamedays by Codecentric.
* [New Relic: How to run a Gameday?](https://blog.newrelic.com/engineering/how-to-run-a-game-day/) - Chaos Gamedays experience by New Relic.
* [Dius: Gamedays resources](https://dius.com.au/resources/game-day/) - Resources for getting started with GameDay and Chaos Engineering.
* [Gremlin: Gamedays](https://www.gremlin.com/gameday/) - Resources for getting started with GameDay and Chaos Engineering.
* [Gremlin: What is a Chaos Day?](https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/#what-is-a-chaos-day) - What is a Gameday according Gremlin.
* [Gremlin: Why run a Chaos Day?](https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/#why-run-a-chaos-day) - Reasons to run Gamedays according Gremlin.
* [Gremlin: How to run a Gameday?](https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/) - Methodology to run Gamedays according Gremlin.
* [Gremlin DB: Breaking Dynamo DB](https://www.gremlin.com/community/tutorials/gremlin-gameday-breaking-dynamodb/) - Example of a Gameday with DynamoDB by Gremlin.
* [Gremlin: Introduction to Gameday](https://www.gremlin.com/community/tutorials/introduction-to-gamedays/) - What is a Gameday according Gremlin.
* [Gremlin: Planning your own Chaos Day](https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/) - Example of a Gameday with DynamoDB by Gremlin.
* [Gremlin: Inside Gremlin 2019 Gremlin Gamedays Roadmap](https://www.gremlin.com/community/tutorials/inside-gremlin-2019-gremlin-gamedays-roadmap/) - Chaos Gamedays experience by Gremlin.
* [Gremlin: What I lerned running the Chaos Lab with Kafka](https://www.gremlin.com/community/tutorials/what-i-learned-running-the-chaos-lab-kafka-breaks/) - Example of a Gameday with Kafka by Gremlin.
* [Chaos Toolkit: Chaos Engineering with Humans in the loop](https://medium.com/chaos-toolkit/chaos-engineering-with-humans-in-the-loop-f4854900b1eb) - Article about Chaos Gamedays.
* [GooCardless: All fun and games until you start with Gamedays](https://gocardless.com/blog/game-days-at-gc/) - Article about Chaos Gamedays.
* [InfoQ: Gamedays - Achieving Resilience through Chaos Engineering](https://www.infoq.com/presentations/gameday-chaos-engineering) - InfoQ Presentation with experiences about Chaos Gamedays.

## Blogs & Newsletters
* [Netflix Technology Blog](https://medium.com/@NetflixTechBlog) - Learn more about how Netflix designs, builds, and operates our systems and engineering organizations.
* [Production Ready](https://tinyletter.com/production-ready) - A mailing list about building resilient infrastructure and tools.
* [SRE Weekly](https://sreweekly.com/) - Weekly Site Reliability Newsletter.
* [Site Reliability Engineering resources](https://github.com/dastergon/awesome-sre) - A curated list of awesome Site Reliability and Production Engineering resources.
* [SysAdvent](https://sysadvent.blogspot.com) - One article for each day of December, ending on the 25th article.
* [Gremlin Blog](https://blog.gremlininc.com) - Blogs on Chaos Engineering from Gremlin Inc.
* [O’Reilly Systems Engineering and Operations Newsletter](http://www.oreilly.com/webops-perf/newsletter.html) - Weekly systems engineering and operations news and insights from industry insiders.
* [LaunchDarkly Blog](http://blog.launchdarkly.com/) - Continuous delivery and feature flags blog.
* [Verica](https://www.verica.io/) - Chaos engineering, security chaos engineering and continuous verification.
* [Proofdock](https://medium.com/proofdock) - Reliability, resilience and chaos engineering with a focus on MS Azure
* [LitmusChaos Blog](https://dev.to/t/litmuschaos/latest) - Blogs on Chaos Engineering from LitmusChaos
* [ChaosEngineering.news](https://chaosengineering.news/) - Chaos Engineering newsletter. All things chaos engineering, directly to your inbox!
* [Chaos Mesh Blog](https://chaos-mesh.org/blog) - Blogs on Chaos Engineering from Chaos Mesh.
* [Chaos Experimentation Framework](https://eng.lyft.com/chaos-experimentation-an-open-source-framework-built-on-top-of-envoy-proxy-df87519ed681) Chaos Experimentation, an open-source framework built on top of Envoy Proxy
* [Squadcast](https://squadcast.com/blog)- Blog on Site Reliability engineering.
* [steadybit Blog](https://www.steadybit.com/blog) - Blogs on Chaos Engineering, Resilience, SRE and OPS from steadybit.

## Podcasts
* [Break Things On Purpose](https://podcasts.apple.com/us/podcast/break-things-on-purpose/id1460542551) - Monthly podcast about Chaos Engineering presented by Gremlin Inc. Also available on Spotify, Google Play, and Stitcher.

## Conferences & Meetups
* [Chaos Carnival](https://chaoscarnival.io/) - A global two-day virtual conference for Cloud Native Chaos Engineering.
* [Chaos Conf](https://chaosconf.splashthat.com/) - A day of Chaos Engineering demos, expert advice, and connect with your peers putting chaos into practice at their companies.
* [SRECon Conferences](https://www.usenix.org/conferences/byname/925) - The official SRE conference.
* [LISA Conferences](https://www.usenix.org/conferences/byname/5) - Prominent conference about SysAdmin/DevOps/SRE.
* [O'Reilly Velocity Conference](https://conferences.oreilly.com/velocity/) - Prominent conference about Systems Engineering/DevOps/SRE.
* [Chaos Engineering Community Meetup Group](https://www.meetup.com/Chaos-Engineering-Community/) - Bay Area Meetup group for Chaos Engineers.
* [London Chaos Engineering Community](https://www.meetup.com/London-Chaos-Engineering-Community/) _ London Area Meetup group for Chaos Engineers.
* [Stockholm Chaos Engineering Meetup](https://www.meetup.com/Stockholm-Chaos-Engineering-Community/) Stockholm Meetup group for Chaos Engineers.
* [Chaos Engineering Community](https://www.meetup.com/pro/chaos/) - A collection of meetups across the globe about Chaos Engineerings.
* [Conf42.com: Chaos Engineering](https://conf42.com) - Chaos Engineering for practitioners and adopters - London UK, 23 Jan 2020.
* [Kubernetes Chaos Engineering Meetup Group India](https://www.meetup.com/Kubernetes-Chaos-Engineering-Meetup-Group-India/)- India Meetup group for Chaos Engineers.

## Forums
* [Chaos Community Google Group](https://groups.google.com/forum/#!forum/chaos-community)
* [Chaos Engineering LinkedIn Group](https://www.linkedin.com/groups/7057761)
* [Chaos Engineering Slack Community](https://gremlin.com/community)
* [CNCF Chaos Engineering Working Group](https://groups.google.com/forum/#!forum/chaoseng-wg)
* CNCF Chaos Engineering Working Group Slack: #chaosengineering (slack.cncf.io)
* [CNCF Chaos Engineering Working Group Github](https://github.com/chaoseng/wg-chaoseng)
* [Chaos Toolkit Slack Community](https://join.chaostoolkit.org)
* [Litmus Chaos Engineering Slack Community](https://slack.litmuschaos.io/)

## Contributing

Please take a look at the [contribution guidelines](CONTRIBUTING.md) first. Contributions are always welcome!