{"id":13451611,"url":"https://github.com/atomix/chaos-controller","last_synced_at":"2025-04-07T16:31:55.247Z","repository":{"id":77333487,"uuid":"165804501","full_name":"atomix/chaos-controller","owner":"atomix","description":"Chaos controller for Kubernetes","archived":false,"fork":false,"pushed_at":"2019-03-25T22:35:03.000Z","size":16172,"stargazers_count":55,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-03T05:05:11.630Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/atomix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-01-15T07:22:52.000Z","updated_at":"2024-12-06T15:29:59.000Z","dependencies_parsed_at":"2023-04-10T09:02:20.225Z","dependency_job_id":null,"html_url":"https://github.com/atomix/chaos-controller","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomix%2Fchaos-controller","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomix%2Fchaos-controller/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomix%2Fchaos-controller/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atomix%2Fchaos-controller/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/atomix","download_url":"https://codeload.github.com/atomix/chaos-controller/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247687978,"owners_count":20979574,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T07:00:56.980Z","updated_at":"2025-04-07T16:31:50.232Z","avatar_url":"https://github.com/atomix.png","language":"Go","funding_links":[],"categories":["others","Operators vs Controllers","Go"],"sub_categories":["Chaos / Loading Testing"],"readme":"# Chaos Controller\n\nThe Chaos Controller provides a controller for chaos testing in [Kubernetes][Kubernetes]\nand supports a rich set of supported failure scenarios. It relies in Linux and Docker\nfunctionality to inject network partitions and stress nodes.\n\n* [Setup](#setup)\n  * [Helm](#helm)\n  * [Manual setup](#manual-setup)\n* [Usage](#usage)\n  * [Scheduling](#scheduling)\n  * [Selectors](#selectors)\n  * [Crash monkey](#crash-monkey)\n  * [Partition monkey](#partition-monkey)\n  * [Stress monkey](#stress-monkey)\n* [Architecture](#architecture)\n  * [Controller](#controller)\n    * [Crash controller](#crash-controller)\n    * [Partition controller](#partition-controller)\n    * [Stress controller](#stress-controller)\n  * [Workers](#workers)\n    * [Crash workers](#crash-workers)\n    * [Partition workers](#partition-workers)\n    * [Stress workers](#stress-workers)\n\n## Setup\n\n### Helm\n\nA [Helm][Helm] chart is provided for setting up the controller. To deploy the\ncontroller use `helm install helm` from the project root:\n\n```\nhelm install helm\n```\n\nWhen the chart is installed, following custom resources will be added to\nthe cluster:\n* `ChaosMonkey`\n* `Crash`\n* `NetworkPartition`\n* `Stress`\n\nThe `ChaosMonkey` resource is the primary resource provided by the controller.\nThe remaining custom resources are used by the controller to inject specific\nfailures into pods.\n\nThe chart supports overrides for both the [controller](#controller) and\n[workers](#workers). The controller is deployed as a `Deployment`, and\nthe workers as a `DaemonSet`.\n\n### Manual Setup\n\nBefore running the controller, register the custom resources:\n\n```\n$ kubectl create -f deploy/chaosmonkey.yaml\n$ kubectl create -f deploy/crash.yaml\n$ kubectl create -f deploy/networkpartition.yaml\n$ kubectl create -f deploy/stress.yaml\n```\n\nSetup RBAC and deploy the controller:\n\n```\n$ kubectl create -f deploy/service_account.yaml\n$ kubectl create -f deploy/role.yaml\n$ kubectl create -f deploy/role_binding.yaml\n$ kubectl create -f deploy/controller.yaml\n```\n\nDeploy the workers:\n\n```\n$ kubectl create -f deploy/workers.yaml\n```\n\nExample `ChaosMonkey` resources can be found in the `example` directory:\n```\n$ kubectl create -f example/crash_monkey.yaml\n$ kubectl create -f example/partition_monkey.yaml\n$ kubectl create -f example/stress_monkey.yaml\n```\n\n## Usage\n\nThe chaos controller provides a full suite of tools for chaos testing, injecting\na variety of failures into the nodes and in the k8s pods and networks. Each monkey\nplays a specific role in injecting failures into the cluster:\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: crash-monkey\nspec:\n  rateSeconds: 60\n  jitter: .5\n  crash:\n    crashStrategy:\n      type: Container\n```\n\n### Scheduling\n\nThe scheduling of periodic `ChaosMonkey` executions can be managed by providing a\n_rate_ and _period_ for which the fault occurs:\n* `rateSeconds` - the number of seconds to wait between monkey runs\n* `periodSeconds` - the number of seconds for which to run a monkey, e.g. the amount of\ntime for which to partition the network or stress a node\n* `jitter` - the amount of jitter to apply to the rate\n\n### Selectors\n\nSpecific sets of pods can be selected using pod names, labels, or match expressions\nspecified in the configured `selector`:\n* `matchPods` - a list of pod names on which to match\n* `matchLabels` - a map of label names and values on which to match pods\n* `matchExpressions` - label match expressions on which to match pods\n\nSelector options can be added on a per-monkey basis:\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: crash-monkey\nspec:\n  crash:\n    crashStrategy:\n      type: Pod\n  selector:\n    matchPods:\n    - pod-1\n    - pod-2\n    - pod-3\n    matchLabels:\n      group: raft\n    matchExpressions:\n    - key: group\n      operator: In\n      values:\n      - raft\n      - data\n```\n\nEach monkey type has a custom configuration provided by a named field for the\nmonkey type:\n* [`crash`](#crash-monkey)\n* [`partition`](#partition-monkey)\n* [`stress`](#stress-monkey)\n\n### Crash monkey\n\nThe crash monkey can be used to inject node crashes into the cluster. To configure a\ncrash monkey, use the `crash` configuration:\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: crash-monkey\nspec:\n  rateSeconds: 60\n  jitter: .5\n  crash:\n    crashStrategy:\n      type: Container\n```\n\nThe `crash` configuration supports a `crashStrategy` with the following options:\n* `Container` - kills the process running inside the container\n* `Pod` - deletes the `Pod` using the Kubernetes API\n\n### Partition monkey\n\nThe partition monkey can be used to cut off network communication between a set of\npods. To use the partition monkey, selected pods must have `iptables` installed.\nTo configure a partition monkey, use the `partition` configuration:\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: partition-isolate-monkey\nspec:\n  rateSeconds: 600\n  periodSeconds: 120\n  partition:\n    partitionStrategy:\n      type: Isolate\n```\n\nThe `partition` configuration supports a `partitionStrategy` with the following options:\n* `Isolate` - isolates a single random node in the cluster from all other nodes\n* `Halves` - splits the cluster into two halves\n* `Bridge` - splits the cluster into two halves with a single bridge node able to\ncommunicate with each half (for testing consensus)\n\n### Stress monkey\n\nThe stress monkey uses a variety of tools to simulate stress on nodes and on the \nnetwork. To configure a stress monkey, use the `stress` configuration:\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: stress-cpu-monkey\nspec:\n  rateSeconds: 300\n  periodSeconds: 300\n  stress:\n    stressStrategy:\n      type: All\n    cpu:\n      workers: 2\n```\n\nThe `stress` configuration supports a `stressStrategy` with the following options:\n* `Random` - applies stress options to a random pod\n* `All` - applies stress options to all pods in the cluster\n\nThe stress monkey supports a variety of types of stress using the\n[stress](https://linux.die.net/man/1/stress) tool:\n* `cpu` - spawns `cpu.workers` workers spinning on `sqrt()`\n* `io` - spawns `io.workers` workers spinning on `sync()`\n* `memory` - spawns `memory.workers` workers spinning on `malloc()`/`free()`\n* `hdd` - spawns `hdd.workers` workers spinning on `write()`/`unlink()`\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: stress-all-monkey\nspec:\n  rateSeconds: 300\n  periodSeconds: 300\n  stress:\n    stressStrategy:\n      type: Random\n    cpu:\n      workers: 2\n    io:\n      workers: 2\n    memory:\n      workers: 4\n    hdd:\n      workers: 1\n```\n\nAdditionally, network latency can be injected using the stress monkey via\n[traffic control](http://man7.org/linux/man-pages/man8/tc-netem.8.html) by providing\na `network` stress configuration:\n* `latencyMilliseconds` - the amount of latency to inject in milliseconds\n* `jitter` - the jitter to apply to the latency\n* `correlation` - the correlation to apply to the latency\n* `distribution` - the delay distribution, either `normal`, `pareto`, or `paretonormal`\n\n```yaml\napiVersion: chaos.atomix.io/v1alpha1\nkind: ChaosMonkey\nmetadata:\n  name: stress-network-monkey\nspec:\n  rateSeconds: 300\n  periodSeconds: 60\n  stress:\n    stressStrategy:\n      type: All\n    network:\n      latencyMilliseconds: 500\n      jitter: .5\n      correlation: .25\n```\n\n## Architecture\n\nThe controller consists of two independent components which run as containers in\na k8s cluster.\n\n### Controller\n\nThe controller is the component responsible for monitoring the creation/deletion of\n`ChaosMonkey` resources, scheduling executions, and distributing tasks to [workers](#workers).\nThe controller typically runs as a `Deployment`. When multiple replicas are run, only a\nsingle replica will control the cluster at any given time.\n\nWhen `ChaosMonkey` resources are created in the k8s cluster, the controller receives a\nnotification and, in response, schedules a periodic background task to execute the monkey\nhandler. The periodic task is configured based on the monkey configuration. When a monkey\nhandler is executed, the controller filters pods using the monkey's configured selectors\nand passes the pods to the handler for execution. Monkey handlers then assign tasks to\nspecific [workers](#workers) to carry out the specified chaos function.\n\n![Controller](https://i.ibb.co/d76BDZW/Controller-Worker.png)\n\n#### Crash controller\n\nThe crash controller assigns tasks to workers via `Crash` resources. The `Crash` resource\nwill indicate the `podName` of the pod to crash and the `crashStrategy` with which to crash\nthe pod.\n\n#### Partition controller\n\nThe partition controller assigns tasks to workers according to the configured\n`partitionStrategy` and uses the `NetworkPartition` resource to communicate details of the\nnetwork partition to the workers. After determining the set of routes to cut off between pods,\nthe controller creates a `NetworkPartition` for each source/destination pair.\n\n![Network Partition](https://i.ibb.co/J7DX1v5/Network-Partition.png)\n\n#### Stress controller\n\nThe stress controller assigns tasks to workers via `Stress` resources. The `Stress` resource\nwill indicate the `podName` of the pod to stress and the mechanisms with which to stress the\npod.\n\n### Workers\n\nWorkers are the components responsible for injecting failures on specific k8s nodes.\nLike the [controller](#controller), workers are a type of resource controller, but rather\nthan managing the high level `ChaosMonkey` resources used to randomly inject failures,\nworkers provide for the injection of pod-level failures in response to the creation of\nresources like `Crash`, `NetworkPartition`, or `Stress`.\n\nIn order to ensure a worker is assigned to each node and can inject failures into the OS,\nworkers must be run in a `DaemonSet` and granted `privileged` access to k8s nodes.\n\n#### Crash workers\n\nCrash workers monitor k8s for the creation of `Crash` resources. When a `Crash` resource is\ndetected, if the `podName` contained in the `Crash` belongs to the node on which the worker\nis running, the worker executes the crash. This ensures only one node attempts to execute\na crash regardless of the method by which the crash is performed.\n\nThe method of execution of the crash depends on the configured `crashStrategy`. If the `Pod`\nstrategy is used, the worker simply deletes the pod via the Kubernetes API. If the `Container`\nstrategy is indicates, the worker locates the pod's container(s) via the Docker API and\nkills the containers directly.\n\n#### Partition workers\n\nPartition workers monitor k8s for the creation of `NetworkPartition` resources. Each\n`NetworkPartition` represents a link between two pods to be cut off while the resource is\nrunning. The worker configures the pod indicated by `podName` to drop packets from the\nconfigured `sourceName`. `NetworkPartition` resources may be in one of four _phases_:\n* `started` indicates the resource has been created but the pods are not yet partitioned\n* `running` indicates the pod has been partitioned\n* `stopped` indicates the partition has been stopped but the physical communication has not\nyet been restored\n* `complete` indicates communication between the pod and the source has been restored\n\nWhen a worker receives notification of a `NetworkPartition` in the `started` phase, if the\npod is running on the worker's node, the worker cuts off communication between the pod and\nthe source by locating the virtual network interface for the pod and adding firewall rules\nto the host to drop packets received on the pod's virtual interface from the source IP.\nTo restore communication with the source, the worker simply deletes the added firewall rules.\n\n#### Stress workers\n\nStress workers monitor k8s for the creation of `Stress` resources. When a `Stress` resource\nis detected, the worker on the node to which the stressed pod is assigned may perform several\ntasks to stress the desired pod.\n\nFor I/O, CPU, memory, and HDD stress, the worker will create a container in the pod's\nnamespace to execute the [stress tool](https://linux.die.net/man/1/stress). For each\nconfigured stress option, a separate container will be created to stress the pod.\nWhen the `Stress` resource is `stopped`, all stress containers will be stopped.\n\nFor network stress, the host is configured using the [traffic control](https://linux.die.net/man/8/tc)\nutility to control the pod's virtual interfaces. When the `Stress` resource is `stopped`,\nthe traffic control rule is deleted.\n\n[Kubernetes]: https://kubernetes.io/\n[Helm]: https://helm.sh\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatomix%2Fchaos-controller","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fatomix%2Fchaos-controller","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatomix%2Fchaos-controller/lists"}