https://github.com/k-krew/omen
A lightweight, declarative chaos engineering operator for Kubernetes
https://github.com/k-krew/omen
chaos-engineering chaos-testing controller-runtime fault-injection golang helm-chart kubebuilder kubernetes kubernetes-operator reliability sre
Last synced: 3 days ago
JSON representation
A lightweight, declarative chaos engineering operator for Kubernetes
- Host: GitHub
- URL: https://github.com/k-krew/omen
- Owner: k-krew
- License: apache-2.0
- Created: 2026-03-26T02:37:31.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-04-12T20:04:07.000Z (17 days ago)
- Last Synced: 2026-04-12T22:08:43.270Z (16 days ago)
- Topics: chaos-engineering, chaos-testing, controller-runtime, fault-injection, golang, helm-chart, kubebuilder, kubernetes, kubernetes-operator, reliability, sre
- Language: Go
- Homepage:
- Size: 158 KB
- Stars: 6
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: ROADMAP.md
- Agents: AGENTS.md
Awesome Lists containing this project
README


# Omen
A lightweight Kubernetes chaos engineering operator with transparent target selection and optional manual approval.
## Overview
Omen lets you declaratively define chaos experiments against your workloads. Each run:
1. Selects a fixed set of target pods (preview)
2. Optionally waits for manual approval
3. Executes the chaos action against those exact targets
4. Records per-target results and a summary
Two CRDs are provided:
- **Experiment** — defines the schedule, target selector, action, safety limits, and approval policy
- **ExperimentRun** — a single execution instance created by the controller, holding the target preview, approval state, and results
## Roadmap
Curious about what's coming next? Check out our [Roadmap](ROADMAP.md) to see our plans for advanced target filtering, ChatOps integrations, and more!
## Breaking Changes in v0.3.0
Version 0.3.0 introduces a major architectural shift to an **opt-in sidecar model** for network chaos, replacing the old ephemeral containers approach.
- **Namespace Opt-in Required:** You must explicitly label target namespaces with `chaos.kreicer.dev/enabled=true`.
- **Sidecar Injection:** A mutating webhook now automatically injects the `omen-agent` sidecar into pods in enabled namespaces.
- **Removed Flags:** The `ProtectedNamespaces` CLI flag and Helm value have been completely removed in favor of the new opt-in label system.
## Install via Helm
```bash
helm install omen oci://ghcr.io/k-krew/charts/omen \
--namespace omen-system \
--create-namespace \
--version
```
To customise the installation:
```bash
helm install omen oci://ghcr.io/k-krew/charts/omen \
--namespace omen-system \
--create-namespace \
--version \
--set manager.leaderElect=true \
--set resources.limits.memory=256Mi \
--set manager.agentImage="ghcr.io/k-krew/omen-agent:" \
--set manager.agentPort=9999
```
### Controller flags
| Flag | Default | Description |
|---|---|---|
| `--webhook-timeout` | `10s` | Timeout for outgoing approval webhook HTTP requests. |
| `--leader-elect` | `false` | Enable leader election for HA deployments. |
| `--metrics-bind-address` | `0` | Address for the metrics endpoint (`0` disables it). |
| `--health-probe-bind-address` | `:8081` | Address for liveness/readiness probes. |
| `--agent-image` | `ghcr.io/k-krew/omen-agent:v0.3.1` | Container image injected as the `omen-agent` sidecar into target pods. |
| `--agent-port` | `9999` | Port the agent sidecar listens on. Change if it conflicts with application ports. |
## Examples
Ready-to-apply YAML manifests live in the [`examples/`](examples/) directory:
| File | Description |
|---|---|
| [`delete-pod-once.yaml`](examples/delete-pod-once.yaml) | One-shot pod deletion, fixed count |
| [`delete-pod-percent.yaml`](examples/delete-pod-percent.yaml) | One-shot pod deletion, percentage-based |
| [`delete-pod-repeat-approval.yaml`](examples/delete-pod-repeat-approval.yaml) | Recurring deletion with manual approval and webhook notification |
| [`network-fault-latency.yaml`](examples/network-fault-latency.yaml) | Inject 100ms latency + 10ms jitter for 5 minutes |
| [`network-fault-packet-loss.yaml`](examples/network-fault-packet-loss.yaml) | Drop 30% of packets for 3 minutes |
| [`network-fault-blackhole.yaml`](examples/network-fault-blackhole.yaml) | Complete network blackhole (100% packet loss) with approval gate |
To approve a pending run:
```bash
kubectl patch experimentrun \
--type=merge \
-p '{"spec":{"approved":true}}'
```
## Action Types
### `delete_pod`
Deletes the selected pods. Supports `force: true` for immediate deletion (grace period 0).
### `network_fault`
Injects network chaos into target pods using Linux Traffic Control (`tc netem`). The controller sends HTTP requests to the `omen-agent` sidecar running inside each target pod, which applies and removes the fault. The fault is automatically rolled back after the configured `duration`.
**Prerequisite:** The target namespace must be labeled `chaos.kreicer.dev/enabled=true` so that the sidecar is injected (see [Architecture](#architecture) below).
**Parameters (`spec.action.networkFault`):**
| Field | Type | Description |
|---|---|---|
| `latency` | duration | Fixed delay added to outgoing packets (e.g., `100ms`). |
| `jitter` | duration | Random variation on top of latency (e.g., `10ms`). Requires `latency`. |
| `packetLoss` | integer (1-100) | Percentage of packets to drop. Set to `100` for a full blackhole. |
| `duration` | duration | How long to hold the fault before automatic rollback. Defaults to `5m`. |
At least one of `latency` or `packetLoss` must be set.
## Architecture
Omen uses an **opt-in sidecar model**. Chaos is only allowed in namespaces explicitly labeled with `chaos.kreicer.dev/enabled=true`. A Mutating Webhook automatically injects the `omen-agent` sidecar into all new pods in these namespaces.
```
kubectl label namespace chaos.kreicer.dev/enabled=true
```
The controller then:
1. Selects targets only from pods that live in labeled namespaces.
2. For `delete_pod`: deletes the pod via the Kubernetes API.
3. For `network_fault`: sends an HTTP `POST /network-fault` to the agent sidecar inside the pod to apply `tc` rules, then `DELETE /network-fault` after the duration to roll back.
### Security
The `omen-agent` sidecar requires the `NET_ADMIN` Linux capability to run `tc` commands. This means namespaces used for network chaos must allow it via Pod Security Admission:
```bash
kubectl label namespace \
pod-security.kubernetes.io/enforce=baseline
```
Communication between the controller and agents is authenticated with a shared token (generated by Helm and stored in a Kubernetes Secret). The token is automatically injected into each agent sidecar as `OMEN_SECRET_TOKEN` by the mutating webhook.
A `NetworkPolicy` is shipped with the Helm chart that restricts ingress to agent sidecars so only the controller pod can reach them.
### Pre-flight registry check
On startup, the controller performs a TCP connectivity check to the agent image registry. If the registry is unreachable (e.g., in an air-gapped cluster without proper registry credentials), sidecar injection is **disabled** automatically so that user pods are never blocked by `ImagePullBackOff`. A warning is logged:
```
WARNING: agent image registry is not reachable — sidecar injection will be disabled
```
## Safety: Pod-level Opt-out
Individual pods can be excluded from all chaos experiments by adding the annotation `chaos.kreicer.dev/ignore: "true"`. Annotated pods are neither injected with the agent sidecar nor selected as targets.
```bash
kubectl annotate pod chaos.kreicer.dev/ignore=true
```
Or in the pod template:
```yaml
metadata:
annotations:
chaos.kreicer.dev/ignore: "true"
```
Experiment-level protection is also available via `spec.safety.denyNamespaces`:
```yaml
spec:
safety:
denyNamespaces:
- my-critical-namespace
```
## Observability
Every phase transition of an `ExperimentRun` emits a standard Kubernetes Event on the object:
```bash
kubectl describe experimentrun
```
Events use `Normal` type for successful transitions (`PreviewGenerated`, `Approved`, `Running`, `Completed`) and `Warning` for failure states (`Failed`, `Expired`).
The `TOTAL` column in `kubectl get expruns` is populated as soon as targets are selected during the `PreviewGenerated` phase, so you can see how many pods will be affected before the run executes.
## Safe Deletion
`Experiment` objects carry a finalizer (`chaos.omen.com/finalizer`). When an `Experiment` is deleted, the controller first deletes all owned `ExperimentRun`s and waits for them to be removed before releasing the finalizer.
`ExperimentRun`s executing a `network_fault` action carry an additional finalizer (`chaos.omen.com/network-fault`). Before the run object is removed, the controller sends `DELETE /network-fault` to the agent in all targets where the fault was still active, ensuring the network is restored even if the experiment is aborted mid-flight.
## Dry Run
Set `dryRun: true` on the `Experiment` to preview target selection without executing any action. For `delete_pod`, no pods are deleted. For `network_fault`, no HTTP requests are sent to the agent. Results are recorded as `Success` in both cases.
## Run locally (against Kind or Minikube)
### Prerequisites
- Go 1.26+
- `kubebuilder` v4
- `kubectl` pointing at a local cluster
```bash
# Install CRDs
GOTOOLCHAIN=local make install
# Run the controller locally (uses ~/.kube/config)
GOTOOLCHAIN=local make run
```
The controller reads `POD_NAMESPACE` to exclude its own pods from target selection. Set it when running locally:
```bash
POD_NAMESPACE=omen-system GOTOOLCHAIN=local make run
```
## Development
```bash
# Regenerate CRDs and RBAC after editing types
GOTOOLCHAIN=local make manifests generate
# Build the binary
GOTOOLCHAIN=local make build
# Run tests (requires setup-envtest)
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
export KUBEBUILDER_ASSETS=$(setup-envtest use --print path)
go test ./... -v
```