https://github.com/datadog/chaos-controller
:monkey: :fire: Datadog Failure Injection System for Kubernetes
https://github.com/datadog/chaos-controller
chaos chaos-engineering chaos-monkey k8s kubernetes sre
Last synced: 6 months ago
JSON representation
:monkey: :fire: Datadog Failure Injection System for Kubernetes
- Host: GitHub
- URL: https://github.com/datadog/chaos-controller
- Owner: DataDog
- License: apache-2.0
- Created: 2019-03-15T12:13:57.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2025-03-19T14:48:39.000Z (7 months ago)
- Last Synced: 2025-04-01T09:19:43.377Z (6 months ago)
- Topics: chaos, chaos-engineering, chaos-monkey, k8s, kubernetes, sre
- Language: C
- Homepage:
- Size: 88.3 MB
- Stars: 193
- Watchers: 10
- Forks: 32
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
**Oldest Kubernetes version supported: 1.16**
> :warning: **Kubernetes version 1.20.x is not supported!** _This [Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/97288) prevents the controller from running properly on Kubernetes 1.20.0-1.20.4. Earlier versions of Kubernetes as well as 1.20.5 and later are still supported._
# Datadog Chaos Controller
> *:bomb: Disclaimer :bomb:*
>
> _The Chaos Controller allows you to disrupt your Kubernetes infrastructure through various means including but not limited to: bringing down resources you have provisioned and preventing critical data from being transmitted between resources. The use of Chaos Controller on your production system is done at your own discretion and risk._The Chaos Controller is a Kubernetes controller with which you can inject various systemic failures, at scale, and without caring about the implementation details of your Kubernetes infrastructure. It was created with a specific mindset answering Datadog's internal needs:
* 🐇 **Be fast and operate at scale**
* At Datadog, we are running experiments injecting and cleaning failures to/from thousands of targets within a few minutes.
* 🚑 **Be safe and operate in highly disrupted environments**
* The controller is built to be able to limit the blast radius of failures but also to be able to recover by itself in catastrophic scenarios.
* 💡 **Be smart and operate in various technical environments**
* With Kubernetes, all environments are built differently.
* Whatever your cluster configuration and implement details choice, the controller is able to inject failures by relying on low-level Linux kernel features such as cgroups, tc or even eBPF.
* 🪙 **Be simple and operate at low cost**
* Most of the time, your Chaos Engineering platform is waiting and doing nothing.
* We built this project so it uses resources only when it is really doing something:
* No DaemonSet or any always-running processes on your nodes for injection, no reserved resources when it's not needed.
* Injection pods are created only when it is needed, killed once experiment is done, and built to be evicted if necessary to free resources.
* A single long-running pod, the controller, and nothing else!## Getting Started
> :bulb: Read the [latest release quick installation guide](https://github.com/DataDog/chaos-controller/releases/latest) and the [configuration guide](docs/configuration.md) to know how to deploy the controller.
Disruptions are built as short-living resources which should be manually created and removed once your experiments are done. They should not be part of any application deployment. The `Disruption` resource is **immutable**. Once applied, you can't edit it. If you need to change the disruption definition, you need to delete the existing resource and to re-create it.
Getting started is as simple as creating a Kubernetes resource:
```yaml
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
name: node-failure
namespace: chaos-demo # it must be in the same namespace as targeted resources
spec:
selector: # a label selector used to target resources
app: demo-curl
count: 1 # the number of resources to target, can be a percentage if you suffix with "%", e.g. `count: 50%`
duration: 1h # the amount of time before your disruption automatically terminates itself, for safety
nodeFailure: # trigger a kernel panic on the target node
shutdown: false # do not force the node to be kept down
```To disrupt your cluster, run `kubectl apply -f .yaml`. You can clean up the disruption with `kubectl delete -f .yaml`. For your safety, we recommend you get started with the `dry-run` mode enabled.
> :open_book: The [features guide](docs/features.md) details all the features of the Chaos Controller.
> :open_book: The [examples guide](docs/examples.md) contains a list of various disruption files that you can use.
> Check out [Chaosli](./cli/chaosli/README.md) if you want some help understanding/creating disruption configurations.
## Chaos Scheduling
> New feature in `8.0.0`The Chaos Controller has expanded its capabilities by introducing disruption scheduling, enhancing your ability to automate and test system resilience consistently. Instead of manual creation and deletion, use `DisruptionCron` to regularly disrupt long-lived Kubernetes resources like `Deployments` and `StatefulSets`.
### Example:
```yaml
apiVersion: chaos.datadoghq.com/v1beta1
kind: DisruptionCron
metadata:
name: node-failure
namespace: chaos-demo
spec:
schedule: "*/15 * * * *" # every 15 minutes
targetResource:
kind: deployment
name: demo-curl
disruptionTemplate:
count: 1
duration: 1h
nodeFailure:
shutdown: false
```To schedule disruption in your cluster, run `kubectl apply -f .yaml`. To stop, run `kubectl delete -f .yaml`.
> :mag_right: Check out [DisruptionCron guide](docs/disruption_cron.md) for more detailed information on how to schedule disruptions.
## Contributing
Chaos Engineering is necessarily different from system to system. We encourage you to try out this tool, and extend it for your own use cases. If you want to run the source code locally to make and test implementation changes, visit the [Contributing Doc](CONTRIBUTING.md). By the way, we welcome Pull Requests.
## Useful Links
- [Examples of disruptions](docs/examples.md)
- [General design](docs/design.md)
- [Reported metrics](docs/metrics_events.md)
- [FAQ](docs/faq.md)