{"id":18374527,"url":"https://github.com/massix/chaos-monkey","last_synced_at":"2026-05-20T05:01:32.654Z","repository":{"id":253118770,"uuid":"818328279","full_name":"massix/chaos-monkey","owner":"massix","description":"My attempt at writing a chaos-monkey for Kubernetes in Go using the client-go","archived":false,"fork":false,"pushed_at":"2024-08-14T13:27:22.000Z","size":4092,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-11T03:58:30.162Z","etag":null,"topics":["chaos-engineering","chaos-monkey","golang","kubernetes","operators"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/massix.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-21T15:47:33.000Z","updated_at":"2024-08-14T13:29:38.000Z","dependencies_parsed_at":"2024-08-14T16:04:02.321Z","dependency_job_id":"0dc469c5-bf24-4fe7-a387-917be972087a","html_url":"https://github.com/massix/chaos-monkey","commit_stats":null,"previous_names":["massix/chaos-monkey"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/massix/chaos-monkey","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/massix%2Fchaos-monkey","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/massix%2Fchaos-monkey/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/massix%2Fchaos-monkey/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/massix%2Fchaos-monkey/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/massix","download_url":"https://codeload.github.com/massix/chaos-monkey/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/massix%2Fchaos-monkey/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278846350,"owners_count":26056090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chaos-engineering","chaos-monkey","golang","kubernetes","operators"],"created_at":"2024-11-06T00:14:56.914Z","updated_at":"2025-10-07T20:54:43.830Z","avatar_url":"https://github.com/massix.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Chaos Monkey\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./assets/cm-nobg.png\" width=\"400px\" /\u003e\n\u003c/div\u003e\n\n[Golang](https://go.dev) implementation of the ideas of [Netflix's Chaos Monkey](https://netflix.github.io/chaosmonkey/) natively for [Kubernetes](https://kubernetes.io) clusters.\n\nFor this small project I have decided not to use the official [Operator Framework for Golang](https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/),\nmainly because I wanted to familiarize with the core concepts of CRDs and Watchers with Golang\nbefore adventuring further. In the future I might want to migrate to using the Operator Framework.\n\n## Architecture\nThe architecture of the Chaos Monkey is fairly simple and all fits in a single Pod.\nAs you can imagine, we rely heavily on [Kubernetes' API](https://kubernetes.io/docs/reference/using-api/api-concepts/) to react based on what happens inside the cluster.\n\nFour main components are part of the current architecture.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./assets/cm-architecture.png\" width=\"600px\" /\u003e\n\u003c/div\u003e\n\n### Namespace Watcher\nThe code for the `NamespaceWatcher` can be found [here](./internal/watcher/namespace.go).\n\nIts role is to constantly monitor the changes in the Namespaces of the cluster, and start\nthe CRD Watchers for those Namespaces. We start the watch by passing `ResourceVersion: \"\"`\nto the Kubernetes API, which means that the first events we receive are synthetic events\n(`ADD`) to help us rebuild the current state of the cluster. After that, we react to both\nthe `ADDED` and the `DELETED` events accordingly.\n\nBasically, it spawns a new [goroutine](https://go.dev/tour/concurrency/1) with a [CRD Watcher](#crd-watcher) every time a new namespace is\ndetected and it stops the corresponding goroutine when a namespace is deleted.\n\nThe Namespace can be [configured](#configuration) to either monitor all namespaces by default (with an\nopt-out strategy) or to monitor only the namespaces which contain the label\n`cm.massix.github.io/namespace=\"true\"`.\n\nCheck the [Configuration](#configuration) paragraph for more details.\n\n### CRD Watcher\nWe make use of a [Custom Resource Definition (CRD)](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) in order to trigger the Chaos Monkey.\nThe CRD is defined using the [OpenAPI](https://www.openapis.org/) specification, which you can find [here](./crds/chaosmonkey-configuration.yaml).\n\nFollowing the schema, this is a valid definition of a CRD which can be injected inside\nof a namespace:\n\n```yaml\napiVersion: cm.massix.github.io/v1\nkind: ChaosMonkeyConfiguration\nmetadata:\n  name: chaosmonkey-nginx\n  namespace: target\nspec:\n  enabled: true\n  minReplicas: 0\n  maxReplicas: 9\n  timeout: 10s\n  deployment:\n    name: nginx\n  scalingMode: killPod\n```\n\nThe CRD is **namespaced**, meaning that it **must** reside inside a Namespace and cannot be\ncreated at cluster-level.\n\nThe CRD Watcher, similarly to the [namespace one](#namespace-watcher), reacts to the\n`ADDED` and `DELETED` events accordingly, creating and stopping goroutines, but it also\nreacts to the `MODIFIED` event, making it possible to modify a configuration while the\nMonkey is running.\n\nDepending on the value of the `scalingMode` flag, the CRD watcher will either create a\n[DeploymentWatcher](#deployment-watcher) or a [PodWatcher](#pod-watcher) The difference between\nthe two is highlighted in the right paragraph, but in short: the DeploymentWatcher\noperates by modifying the `spec.replicas` field of the Deployment, using the\n`deployment/scale` APIs, while the PodWatcher simply deletes a random pod using the\nsame `spec.selector` value of the targeted Deployment.\n\nAs of now, three values are supported by the `scalingMode` field:\n* `randomScale`, which will create a [DeploymentWatcher](#deployment-watcher), it will randomly modify the scales of the given deployment;\n* `killPod`, which will create a [PodWatcher](#pod-watcher), it will randomly kill a pod;\n* `antiPressure`, which will create a [AntiPressureWatcher](#antipressure-watcher).\n\n### Deployment Watcher\nThis is where the fun begins, the Deployment Watcher is responsible of creating the\nChaos inside the cluster. The watcher is associated to a specific deployment (see the\nexample CRD above), and at regular intervals, specified by the `spec.timeout` field\nof the CRD, it scales up or down the deployment. This allows us to test both the case\nwhere there are less replicas than we need, but also the case when there are more\nreplicas than the cluster can probably handle.\n\nAll the fields in the CRDs are mandatory and **must** be set. There are some simple\nvalidations done by Kubernetes itself, which are embedded in the\n[OpenAPI Schema](./crds/chaosmonkey-configuration.yaml) and some other validations\nare done in the code.\n\n### Pod Watcher\nThis is another point where the fun begins. The Pod Watcher is responsible of\ncreating the Chaos inside the cluster. The watcher is associated with a specific\n`spec.selector` field, and at regular intervals, specified by the `spec.timeout` field\nof the CRD, it will randomly kill a pod matching the field.\n\nThe Pod Watcher **ignores** the `maxReplicas` and `minReplicas` fields of the CRD,\nthus generating real chaos inside the cluster.\n\n### AntiPressure Watcher\nThis is another point where the fun begins. The AntiPressure Watcher is responsible\nof creating Chaos inside the cluster by detecting which pod of a given container\nis using the most CPU and simply kill it. It works the opposite of a classic\n[Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/), in the code is often referred to as `antiHPA` for this reason.\n\n**WARNING**: for the AntiPressure Watcher to work, your cluster **must** have a\n[metrics server](https://github.com/kubernetes-sigs/metrics-server) installed, this often comes installed by default on most Cloud providers.\nIf you want to install it locally, please refer to the [terraform configuration](./main.tf) included\nin the project itself.\n\n## Deployment inside a Kubernetes Cluster\nIn order to be able to deploy the ChaosMonkey inside a Kubernetes cluster you **must**\nfirst create a [ServiceAccount](https://kubernetes.io/docs/concepts/security/service-accounts/),\nfollowed by a [ClusterRole](https://kubernetes.io/docs/reference/access-authn-authz/rbac/)\nand bind the two together with a [ClusterRoleBinding](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding).\n\nAfter that you need to inject the CRD contained in this repository:\n\n    kubectl apply -f ./crds/chaosmonkey-configuration.yaml\n\nThen you can create a classic [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/),\njust remember to use your newly created ServiceAccount.\n\nFollowing is an example of the manifests you *should* create for the cluster:\n\n```yaml\nkind: Namespace\napiVersion: v1\nmetadata:\n  name: chaosmonkey\n---\nkind: ServiceAccount\napiVersion: v1\nmetadata:\n  name: chaosmonkey\n  namespace: chaosmonkey\n---\nkind: ClusterRole\napiVersion: rbac.authorization.k8s.io/v1\nmetadata:\n  name: chaosmonkey\nrules:\n  - verbs: [\"watch\"]\n    resources: [\"namespaces\"]\n    apiGroups: [\"*\"]\n  - verbs: [\"patch\", \"get\", \"scale\", \"update\"]\n    resources: [\"deployments\"]\n    apiGroups: [\"*\"]\n  - verbs: [\"list\", \"patch\", \"watch\"]\n    resources: [\"chaosmonkeyconfigurations\"]\n    apiGroups: [\"*\"]\n  - verbs: [\"update\"]\n    resources: [\"deployments/scale\"]\n    apiGroups: [\"apps\"]\n  - verbs: [\"watch\", \"delete\"]\n    resources: [\"pods\"]\n    apiGroups: [\"*\"]\n  - verbs: [\"create\", \"patch\"]\n    resources: [\"events\"]\n    apiGroups: [\"*\"]\n  - verbs: [\"get\"]\n    resources: [\"pods\"]\n    apiGroups: [\"metrics.k8s.io\"]\n---\nkind: ClusterRoleBinding\napiVersion: rbac.authorization.k8s.io/v1\nmetadata:\n  name: chaosmonkey-binding\nsubjects:\n  - kind: ServiceAccount\n    name: chaosmonkey\n    namespace: chaosmonkey\nroleRef:\n  kind: ClusterRole\n  apiGroup: rbac.authorization.k8s.io\n  name: chaosmonkey\n---\nkind: Deployment\napiGroup: apps/v1\nmetadata:\n  name: chaosmonkey\n  namespace: chaosmonkey\nspec:\n  # some fields omitted for clarity\n  template:\n    spec:\n      serviceAccountName: chaosmonkey\n```\n\n## A note on CRD\nThe CRD defines multiple versions of the APIs (at the moment two versions are supported:\n`v1alpha1` and `v1`). You should **always** use the latest version available (`v1`), but\nthere is a conversion endpoint in case you are still using the older version of the API.\n\nThe only caveat is that if you **need** to use the conversion Webhook, you **must** install the\nchaosmonkey in a namespace named `chaosmonkey` and create a service named `chaos-monkey`\nfor it.\n\nIf in doubt, do not use the older version of the API.\n\n## Configuration\nThere are some configurable parts of the ChaosMonkey (on top of what the [CRD](./crds/chaosmonkey-configuration.yaml)\nalready permits of course).\n\n**Minimum Log Level**: this is configurable using the environment variable `CHAOSMONKEY_LOGLEVEL`,\nit accepts the following self explaining values: `trace`, `debug`, `info`, `warn`, `error`,\n`critical` or `panic` and it sets the minimum log level for all the logging of the ChaosMonkey.\n\nThe value is not case-sensitive, invalid or empty values will make ChaosMonkey default to\nthe `info` level.\n\n**Default Behavior**: this is used to configure the way the [Namespace Watcher](#namespace-watcher) should\nbehave in regards of additions and modifications of namespaces and it uses the environment\nvariable `CHAOSMONKEY_BEHAVIOR`. It currently accepts two values: `AllowAll` or `DenyAll`\n(not case sensitive).\n\nSetting it to `AllowAll` means that by default all namespaces are monitored, if\nyou want to opt-out a namespace you **must** create a new label in the\nmetadata of the namespace: `cm.massix.github.io/namespace=\"false\"`, this will\nmake the Watcher ignore that namespace. All values which are not the string\n`false` will cause the Watcher to take that namespace into account.\n\nSetting it to `DenyAll` means that by default all namespaces are ignored, if\nyou want to opt-in a namespace you **must** create a new label in\nthe metadata of the namespace: `cm.massix.github.io/namespace=\"true\"`, this will\nmake the Watcher take that namespace into account. All values which are not\nthe string `true` will cause the Watcher to ignore that namespace.\n\nInjecting an incorrect value or no value at all will have ChaosMonkey use its\ndefault behavior: `AllowAll`.\n\n**Watchers Timeout**: not to be confused with the timeout provided by the [CRD](#deployment-inside-a-kubernetes-cluster), this is merely a\ntechnical value, it is the timeout for the `watch` method in Kubernetes. The default value\nis of 48 hours, which should be good for whatever kind of cluster you are running, but\nif you want to increase or decrease it you have three different environment\nvariables you can use:\n- `CHAOSMONKEY_NS_TIMEOUT` to configure the timeout for the [Namespace Watcher](#namespace-watcher)\n- `CHAOSMONKEY_CRD_TIMEOUT` to configure the timeout for the [CRD Watcher](#crd-watcher)\n- `CHAOSMONKEY_POD_TIMEOUT` to configure the timeout for the [Pod Watcher](#pod-watcher).\n\nThe three environment values expect a string following the specification of the [`time.ParseDuration`](https://pkg.go.dev/time#ParseDuration)\nmethod of Golang. Failure in parsing a value will have Chaos Monkey use the\ndefault timeout of 48 hours.\n\nIt is recommended not to touch these values unless you know what you are doing (spoiler: I do not\nknow what I am doing most of the times).\n\n## Observability\nThe Chaos Monkey has two observability endpoints available, both exposed by the HTTP server\nrunning at port 9000 (not configurable).\n\n### Prometheus\nThe Chaos Monkey exposes some metrics using the [Prometheus](https://prometheus.io/) library and format, the metrics are all\navailable under the `/metrics` endpoint.\n\nThis is an _evolving_ list of metrics currently exposed, for more details please take a look\nin the code under the corresponding service (all the services in the [watcher folder](./internal/watcher/) expose\nsome sort of metrics).\n\nAll the events use the prefix `chaos_monkey` which, for readability issues, is not repeated in the\ntable below.\n\n| Name                                   | Description                             | Type      |\n|----------------------------------------|-----------------------------------------|-----------|\n| nswatcher_events                       | events handled by the nswatcher         | Counter   |\n| nswatcher_event_duration               | duration of each event in microseconds  | Histogram |\n| nswatcher_cmc_spawned                  | crd services spawned                    | Counter   |\n| nswatcher_cmc_active                   | currently active crd                    | Gauge     |\n| nswatcher_restarts                     | timeouts happened from K8S APIs         | Counter   |\n| crdwatcher_events                      | events handled by the crd watcher       | Counter   |\n| crdwatcher_pw_spawned                  | PodWatchers spawned                     | Counter   |\n| crdwatcher_pw_active                   | PodWatchers currently active            | Gauge     |\n| crdwatcher_dw_spawned                  | DeploymentWatchers spawned              | Counter   |\n| crdwatcher_dw_active                   | DeploymentWatchers active               | Gauge     |\n| crdwatcher_restarts                    | timeouts happened from K8S APIs         | Counter   |\n| podwatcher_pods_added                  | Pods having been added to the list      | Counter   |\n| podwatcher_pods_removed                | Pods having been removed from the list  | Counter   |\n| podwatcher_pods_killed                 | Pods having been killed                 | Counter   |\n| podwatcher_pods_active                 | Pods currently being targeted           | Gauge     |\n| podwatcher_restarts                    | timeouts happened from K8S APIs         | Counter   |\n| deploymentwatcher_deployments_rescaled | deployments having been rescaled        | Counter   |\n| deploymentwatcher_random_distribution  | random distribution of deployments      | Histogram |\n| deploymentwatcher_last_scale           | last value used to scale the deployment | Gauge     |\n\nIn the [Makefile](./Makefile) there is also a target `deploy-monitoring` used to deploy a very\nbare bone and simple monitoring stack which includes your classic Prometheus and Grafana, with\nno persistence enabled.  The Grafana will be loaded with three dashboards:\n- `node-exporter-full` to have some live statistics about your locally running K8S cluster;\n- `kube-state-metrics-v2` to have some statistics about the internals of K8S, useful to monitor how the ChaosMonkey is behaving;\n- `chaos-monkey`, for which the source is available [here](./assets/grafana-dashboard.json) and exploits some of the metrics of the table above.\n\n### Health Endpoint\nOn top of Prometheus, there is also an endpoint available at `/health`, which gives some very\nbasic information about the state of the Chaos Monkey. It can be used in Kubernetes for the\n[liveness and readiness probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).\n\n## Development\nAll contributions are welcome, of course. Feel free to open an issue or submit a\npull request. If you want to develop and test locally, you need to install:\n- [Golang](https://go.dev) version 1.22 (it should probably work with older versions too, but this is the version I have used)\n- [Docker](https://www.docker.com), latest version but any version will be fine\n- [Terraform](https://www.terraform.io) at least version 1.8.5\n- [Kind](https://github.com/kubernetes-sigs/kind) at least version 0.5.0\n- [Kubectl](https://kubernetes.io/docs/tasks/tools/) at least version 1.30.1\n\n### Unit Tests\nThe project includes a wide variety of unit tests, which are using the `fake` client\nof kubernetes included in the `client-go` library. The problem is that when testing\nwith mocks, most of the times you end up testing the mocks and not the code. That's\nthe reason why there are also some [integration tests](#integration-tests) included.\n\nFor the future, I have plans to completely rewrite the way the tests are run, create\nmore _pure_ functions and test those functions in the unit tests, and let the\n[integration tests](#integration-tests) do the rest. If you want to help me out in reaching this goal, feel\nfree to open a pull request!\n\n### Integration Tests\nThese tests should cover the basic functionalities of the Chaos Monkey in a local\nKubernetes cluster. The script file is [here](./tests/kubetest.sh) and before launching\nit you should create the Kubernetes cluster locally, using the included [Terraform](./main.tf) configuration.\n\nIt should be as easy as launching:\n\n    $ make cluster-test\n    $ ./tests/kubetest.sh\n\nYou can also activate a more verbose logging for the tests with\n\n    TEST_DEBUG=true ./tests/kubetest.sh\n\n# Contributions\nAll kinds of contributions are welcome, simply open a pull request or an issue!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmassix%2Fchaos-monkey","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmassix%2Fchaos-monkey","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmassix%2Fchaos-monkey/lists"}