https://github.com/cloudnative-pg/chaos-testing
Chaos testing project to enhance CloudnativePG's resilience and fault tolerance
https://github.com/cloudnative-pg/chaos-testing
Last synced: about 2 months ago
JSON representation
Chaos testing project to enhance CloudnativePG's resilience and fault tolerance
- Host: GitHub
- URL: https://github.com/cloudnative-pg/chaos-testing
- Owner: cloudnative-pg
- License: apache-2.0
- Created: 2025-09-23T05:24:29.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-11T14:11:14.000Z (4 months ago)
- Last Synced: 2026-01-30T17:49:47.635Z (about 2 months ago)
- Language: Shell
- Size: 316 KB
- Stars: 7
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: CODEOWNERS
- Governance: GOVERNANCE.md
Awesome Lists containing this project
README
# CloudNativePG Chaos Testing with Jepsen

Production-ready Jepsen and Litmus chaos automation for CloudNativePG (CNPG) clusters.
---
## 🚀 Quick Start
**Want to run chaos testing immediately?** Follow these streamlined steps:
0. **Clone this repo** → Get the chaos experiments and scripts (section 0)
1. **Setup cluster** → Bootstrap CNPG Playground (section 1)
2. **Install CNPG** → Deploy operator + sample cluster (section 2)
3. **Install Litmus** → Install operator, experiments, and RBAC (sections 3, 3.5, 3.6)
4. **Smoke-test chaos** → Run the quick pod-delete check without monitoring (section 4)
5. **Add monitoring** → Install Prometheus for probe validation (section 5; required before section 6 with probes enabled)
6. **Run Jepsen** → Full consistency testing layered on chaos (section 6)
**First time users:** Use section 4 as a smoke test without Prometheus, then return to section 5 to install monitoring before running the Jepsen workflow in section 6.
---
## ✅ Prerequisites
- Linux/macOS shell with `bash`, `git`, `curl`, `jq`, and internet access.
- Container + Kubernetes tooling: Docker **or** Podman, the [Kind CLI](https://kind.sigs.k8s.io/) tool, `kubectl`, `helm`, the [`kubectl cnpg` plugin](https://cloudnative-pg.io/documentation/current/kubectl-plugin/) binary, and the [`cmctl` utility](https://cert-manager.io/docs/reference/cmctl/) for cert-manager.
- Install the CNPG plugin using kubectl krew (recommended):
```bash
# Install or update to the latest version
kubectl krew update
kubectl krew install cnpg || kubectl krew upgrade cnpg
kubectl cnpg version
```
> **Alternative installation methods:**
>
> - For Debian/Ubuntu: Download `.deb` from [releases page](https://github.com/cloudnative-pg/cloudnative-pg/releases)
> - For RHEL/Fedora: Download `.rpm` from [releases page](https://github.com/cloudnative-pg/cloudnative-pg/releases)
> - See [official installation docs](https://cloudnative-pg.io/documentation/current/kubectl-plugin) for all methods
- Optional but recommended: `kubectx`, `stern`, `kubectl-view-secret` (see the [CNPG Playground README](https://github.com/cloudnative-pg/cnpg-playground#prerequisites) for a complete list).
- **Disk Space:** Minimum **30GB** free disk space recommended:
- Kind cluster nodes: ~5GB
- Container images: ~5GB (first run with image pull)
- Prometheus/MongoDB storage: ~10GB
- Jepsen results + logs: ~5GB
- Buffer for growth: ~5GB
- Sufficient local resources for a multi-node Kind cluster (≈8 CPUs / 12 GB RAM) and permission to run port-forwards.
Once the tooling is present, everything else is managed via repository scripts and Helm charts.
---
## âš¡ Setup and Configuration
> Follow these sections in order; each references the authoritative upstream documentation to keep this README concise.
### 0. Clone the Chaos Testing Repository
**First, clone this repository to access the chaos experiments and scripts:**
```bash
git clone https://github.com/cloudnative-pg/chaos-testing.git
cd chaos-testing
```
All subsequent commands reference files in this repository (experiments, scripts, monitoring configs).
### 1. Bootstrap the CNPG Playground
The upstream documentation provides detailed instructions for prerequisites and networking. Follow the setup instructions here: .
Deploy the `cnpg-playground` project in a parallel folder to `chaos-testing`:
```bash
cd ..
git clone https://github.com/cloudnative-pg/cnpg-playground.git
cd cnpg-playground
./scripts/setup.sh eu # creates kind-k8s-eu cluster
```
Follow the instructions on the screen. In particular, make sure that you:
1. export the `KUBECONFIG` variable, as described
2. set the correct context for kubectl
For example:
```
export KUBECONFIG=/k8s/kube-config.yaml
kubectl config use-context kind-k8s-eu
```
If unsure, type:
```
./scripts/info.sh # displays contexts and access information
```
### 2. Install CloudNativePG and Create the PostgreSQL Cluster
With the Kind cluster running, install the operator using the **kubectl cnpg plugin** as recommended in the [CloudNativePG Installation & Upgrades guide](https://cloudnative-pg.io/documentation/current/installation_upgrade/). This approach ensures you get the latest stable operator version:
**In the `cnpg-playground` folder:**
```bash
# Install the latest operator version using the kubectl cnpg plugin
kubectl cnpg install generate --control-plane | \
kubectl --context kind-k8s-eu apply -f - --server-side
# Verify the controller rollout
kubectl --context kind-k8s-eu rollout status deployment \
-n cnpg-system cnpg-controller-manager
```
**In the `chaos-testing` folder:**
```bash
cd ../chaos-testing
# Create the pg-eu PostgreSQL cluster for chaos testing
kubectl apply -f clusters/pg-eu-cluster.yaml
# Verify cluster is ready (this will watch until healthy)
kubectl get cluster pg-eu -w # Wait until status shows "Cluster in healthy state"
# Press Ctrl+C when you see: pg-eu 3 3 ready XX m
```
### 3. Install Litmus Chaos
Litmus 3.x separates the operator (via `litmus-core`) from the ChaosCenter UI (via `litmus` chart). Install both, then add the experiment definitions and RBAC:
```bash
# Add Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install litmus-core (operator + CRDs)
helm upgrade --install litmus-core litmuschaos/litmus-core \
--namespace litmus --create-namespace \
--wait --timeout 10m
# Verify CRDs are installed
kubectl get crd chaosengines.litmuschaos.io chaosexperiments.litmuschaos.io chaosresults.litmuschaos.io
# Verify operator is running
kubectl -n litmus get deploy litmus
kubectl -n litmus wait --for=condition=Available deployment/litmus --timeout=5m
```
### 3.5. Install ChaosExperiment Definitions
The ChaosEngine requires ChaosExperiment resources to exist before it can run. Install the `pod-delete` experiment:
```bash
# Install from Chaos Hub (has namespace: default hardcoded, so override it)
kubectl apply --namespace=litmus -f https://hub.litmuschaos.io/api/chaos/master?file=faults/kubernetes/pod-delete/fault.yaml
# Verify experiment is installed
kubectl -n litmus get chaosexperiments
# Should show: pod-delete
```
### 3.6. Configure RBAC for Chaos Experiments
Apply the RBAC configuration and verify the service account has correct permissions:
```bash
# Apply RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
kubectl apply -f litmus-rbac.yaml
# Verify the ServiceAccount exists in litmus namespace
kubectl -n litmus get serviceaccount litmus-admin
# Verify the ClusterRoleBinding points to correct namespace
kubectl get clusterrolebinding litmus-admin -o jsonpath='{.subjects[0].namespace}'
# Should output: litmus (not default)
# Test permissions (optional)
kubectl auth can-i delete pods --as=system:serviceaccount:litmus:litmus-admin -n default
# Should output: yes
```
> **Important:** The `litmus-rbac.yaml` ClusterRoleBinding must reference `namespace: litmus` in the subjects section. If you see errors like `"litmus-admin" cannot get resource "chaosengines"`, verify the namespace matches where the ServiceAccount exists.
### 4. (Optional) Test Chaos Without Monitoring
Before setting up the full monitoring stack, you can verify chaos mechanics work independently:
```bash
# Apply the probe-free chaos engine (no Prometheus dependency)
kubectl apply -f experiments/cnpg-jepsen-chaos-noprobes.yaml
# Watch the chaos runner pod start (refreshes every 2s)
# Press Ctrl+C once you see the runner pod appear
watch -n2 'kubectl -n litmus get pods | grep cnpg-jepsen-chaos-noprobes-runner'
# Monitor CNPG pod deletions in real-time
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-eu
# Wait for chaos runner pod to be created, then check logs
kubectl -n litmus wait --for=condition=ready pod -l chaos-runner-name=cnpg-jepsen-chaos-noprobes --timeout=60s && \
runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"
# After completion, check the result (engine name differs)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-noprobes-pod-delete -o jsonpath='{.status.experimentStatus.verdict}'
# Should output: Pass (if probes are disabled) or Error (if Prometheus probes enabled but Prometheus not installed)
# Clean up for next test
kubectl -n litmus delete chaosengine cnpg-jepsen-chaos-noprobes
```
**What to observe:**
- The runner pod starts and creates an experiment pod (`pod-delete-xxxxx`)
- CNPG primary pods are deleted every 60 seconds
- CNPG automatically promotes a replica to primary after each deletion
- Deleted pods are recreated by the StatefulSet controller
- The experiment runs for 10 minutes (TOTAL_CHAOS_DURATION=600)
> **Note:** Keep using `experiments/cnpg-jepsen-chaos-noprobes.yaml` until Section 5 installs Prometheus/Grafana. Once monitoring is online, switch to `experiments/cnpg-jepsen-chaos.yaml` (probes enabled) for full observability.
### 5. Configure monitoring (Prometheus + Grafana)
The **cnpg-playground** provides a built-in monitoring stack with Prometheus and Grafana. From the cnpg-playground directory:
```bash
cd ../cnpg-playground
./monitoring/setup.sh eu
```
This script installs:
- **Prometheus Operator** (in `prometheus-operator` namespace)
- **Grafana Operator** with the official CloudNativePG dashboard (in `grafana` namespace)
- Auto-configured for the `kind-k8s-eu` cluster
Once installation completes, create the PodMonitor to expose CNPG metrics:
```bash
# Switch back to chaos-testing directory
cd ../chaos-testing
# Apply CNPG PodMonitor
kubectl apply -f monitoring/podmonitor-pg-eu.yaml
# Verify PodMonitor
kubectl get podmonitor pg-eu -o wide
# Verify Prometheus is scraping CNPG metrics
kubectl -n prometheus-operator port-forward svc/prometheus-operated 9090:9090 &
curl -s --data-urlencode 'query=sum(cnpg_collector_up{cluster="pg-eu"})' "http://localhost:9090/api/v1/query"
```
**Access Grafana dashboard:**
```bash
kubectl -n grafana port-forward svc/grafana-service 3000:3000
# Open http://localhost:3000 with:
# Username: admin
# Password: admin (you'll be prompted to change on first login)
```
The official CloudNativePG dashboard is pre-configured and available at: **Home → Dashboards → grafana → CloudNativePG**
> **Note:** If you recreate the `pg-eu` cluster, reapply the PodMonitor so Prometheus resumes scraping: `kubectl apply -f monitoring/podmonitor-pg-eu.yaml`
> ✅ **Required before section 6 (when probes are enabled):** Complete this monitoring setup so the Prometheus probes defined in `experiments/cnpg-jepsen-chaos.yaml` can succeed.
#### Dependency on cnpg-playground
This project relies on cnpg-playground's monitoring implementation. Be aware of the following dependencies:
**What we depend on**:
- Script: `/path/to/cnpg-playground/monitoring/setup.sh`
- Namespace: `prometheus-operator`
- Service: `prometheus-operated` (created by Prometheus Operator for CR named `prometheus`)
- Port: `9090` (Prometheus default)
**If cnpg-playground monitoring changes**, you may need to update:
- Prometheus endpoint in `experiments/cnpg-jepsen-chaos.yaml` (lines 89, 132, 148)
- Service check in `.github/workflows/chaos-test-full.yml` (line 57)
- Service check in `scripts/run-jepsen-chaos-test.sh` (line 279)
**Troubleshooting**: If probes fail with connection errors:
```bash
# Verify the Prometheus service exists
kubectl -n prometheus-operator get svc
# If service name changed, update all probe endpoints
# in experiments/cnpg-jepsen-chaos.yaml
```
### 6. Run the Jepsen chaos test
```bash
./scripts/run-jepsen-chaos-test.sh pg-eu app 600
```
This script deploys Jepsen (`jepsenpg` image), applies the Litmus ChaosEngine (primary pod delete), monitors logs, collects results, and cleans up transient resources **automatically** (no manual exit needed - the script handles everything).
**Prerequisites before running the script:**
- Section 5 completed (Prometheus/Grafana running) so probes succeed.
- Chaos workflow validated (run `experiments/cnpg-jepsen-chaos.yaml` once manually if you need to confirm Litmus + CNPG wiring).
- Docker registry access to pull `ardentperf/jepsenpg` image (or pre-pulled into cluster).
- `kubectl` context pointing to the playground cluster with sufficient resources.
- **Increase max open files limit** if needed (required for Jepsen on some systems):
```bash
ulimit -n 65536
```
> This may need to be configured in your container runtime or Kind cluster configuration if running in a containerized environment.
**Script knobs:**
- `LITMUS_NAMESPACE` (default `litmus`) – set if you installed Litmus in a different namespace.
- `PROMETHEUS_NAMESPACE` (default `prometheus-operator`) – used to auto-detect the Prometheus service backing Litmus probes.
- `JEPSEN_IMAGE` is pinned to `ardentperf/jepsenpg@sha256:4a3644d9484de3144ad2ea300e1b66568b53d85a87bf12aa64b00661a82311ac` for reproducibility. Update this digest only after verifying upstream releases.
### 7. Inspect test results
- All test results are stored under `logs/jepsen-chaos-/`.
- Quick validation commands:
```bash
# Check Litmus chaos verdict (note: use -n litmus, not -n default)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
-o jsonpath='{.status.experimentStatus.verdict}'
# View full chaos result details
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete -o yaml
# Check probe results (if Prometheus was installed)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
-o jsonpath='{.status.probeStatuses}' | jq
```
- Archive `history.edn` and `chaos-results/chaosresult.yaml` for analysis or reporting.
---
## 📦 Results & logs
- Each run creates a folder under `logs/jepsen-chaos-/`.
- Key files:
- `results/history.edn` → Jepsen operation history.
- `results/chaos-results/chaosresult.yaml` → Litmus verdict + probe output.
- Quick checks:
```bash
# Chaos results (note: namespace is 'litmus' by default)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
-o jsonpath='{.status.experimentStatus.verdict}'
```
---
## 🔗 References & more docs
- CNPG Playground: https://github.com/cloudnative-pg/cnpg-playground
- CloudNativePG Installation & Upgrades: https://cloudnative-pg.io/documentation/current/installation_upgrade/
- Litmus Helm chart: https://github.com/litmuschaos/litmus-helm/
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
- CNPG Grafana dashboards: https://github.com/cloudnative-pg/grafana-dashboards
- License: Apache 2.0 (see `LICENSE`).
---
## 🔧 Monitoring and Observability Tools
### Real-time Monitoring Script
Watch CNPG pods, chaos engines, and cluster events during experiments:
```bash
# Monitor pod deletions and failovers in real-time
bash scripts/monitor-cnpg-pods.sh
# Example
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-eu
```
**What it shows:**
- CNPG pod status with role labels (primary/replica)
- Active ChaosEngines in the chaos namespace
- Recent Kubernetes events (pod deletions, promotions, etc.)
- Updates every 2 seconds
## 📚 Additional Resources
- **CNPG Documentation:**
- **Litmus Documentation:**
- **Jepsen Documentation:**
- **PostgreSQL High Availability:**
---
Follow the sections above to execute chaos tests. Review the logs for analysis, and consult the `/archive` directory for additional documentation if needed.