https://github.com/jebinjeb/k8s-evacuator

Advanced Kubernetes node evacuation tool with safe batching, workload-aware draining, and progressive (controlled) pod eviction.
https://github.com/jebinjeb/k8s-evacuator

cloud-native cluster-operations devops k8s kubectl kubernetes node-drain platform-engineering pod-eviction sre statefulset

Last synced: 2 months ago
JSON representation

Advanced Kubernetes node evacuation tool with safe batching, workload-aware draining, and progressive (controlled) pod eviction.

Host: GitHub
URL: https://github.com/jebinjeb/k8s-evacuator
Owner: jebinjeb
Created: 2026-03-21T07:08:08.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-23T06:39:12.000Z (4 months ago)
Last Synced: 2026-03-24T03:16:07.974Z (4 months ago)
Topics: cloud-native, cluster-operations, devops, k8s, kubectl, kubernetes, node-drain, platform-engineering, pod-eviction, sre, statefulset
Language: Python
Homepage: https://github.com/jebinjeb/k8s-evacuation#readme
Size: 35.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # 🚀 Kubernetes Node Evacuator

![Python](https://img.shields.io/badge/python-3.8%2B-blue)

![Kubernetes](https://img.shields.io/badge/kubernetes-compatible-green)

![License](https://img.shields.io/badge/license-MIT-blue)

![Status](https://img.shields.io/badge/status-active-success)

![Contributions](https://img.shields.io/badge/contributions-welcome-brightgreen)

![Maintenance](https://img.shields.io/badge/maintained-yes-green)

> 🧠 A smarter, workload-aware alternative to kubectl drain using the Kubernetes Eviction API

A Python-based tool to safely evacuate pods from a Kubernetes node, with **intelligent pod ordering, batch support, StatefulSet handling**, optional **Prometheus metrics**, and live CLI progress tracking.

---

## ✨ Why not just `kubectl drain`?

While `kubectl drain` is safe, it is **not workload-aware**:

- Evicts pods in a mostly **flat / unordered way**

- No control over **batching strategies**

- No awareness of **resource impact per workload**

- Limited visibility into **progress and recovery**

👉 This tool solves these problems with **controlled, observable, and intelligent evacuation**.

---

## ⚙️ Eviction Strategy (Production-Safe)

This tool uses the **Kubernetes Eviction API** — the same mechanism as `kubectl drain`.

### ✅ Guarantees

- ✅ Graceful pod eviction (no abrupt termination)  

- ✅ Respects **PodDisruptionBudgets (PDB)**  

- ✅ Waits for **replacement pods to become Ready**  

- ✅ Ensures workloads reach **desired state before continuing**  

- ❌ Does **NOT** force delete pods by default  

---

### ⚙️ Behavior

- Uses **`policy/v1` Eviction API**

- Retries when blocked by PDB (`429 Too Many Requests`)

- Supports configurable fallback strategies:

  - Graceful delete

  - Force delete *(optional, last resort)*

---

> ⚠️ Designed for **zero-downtime or minimal-disruption operations** in production environments.

---

## Features

- **Safe pod evacuation** without modifying Deployment/StatefulSet specs.  

- **Pod-aware batching**: one-by-one, fixed-size batch, or all-at-once.  

- **StatefulSet support**: pods evicted in ordinal order to preserve stability. 

- **Grouping strategies**:

  - owner (default for workloads) → evacuates pods workload by workload.

  - spread → evicts pods from multiple workloads evenly across batches to minimize impact per workload.

    - Supports controlled execution via `--max-batches` (process N batches and exit).

- **Pre and post-checks**: waits for workloads to reach **desired state**.  

- **Excludes**: DaemonSets, Jobs, completed/failed pods, mirror pods.  

- **Optional metrics**: push per-pod progress and status to Prometheus Pushgateway.  

- **Live progress tracking** in CLI.  

- Fully configurable **timeout** and **retry** logic.

- Dry-run mode for safe validation

---

## TODO

- **Rollback / Retry**: If eviction fails mid-way, optionally rollback already evicted pods or retry safely.

- **Advanced Dry-Run Simulation**: Check if pods can actually be scheduled on other nodes before eviction.

  - NodeSelector / Affinity rules

  - Taints & tolerations

  - Available CPU / memory

- Consider per-workload max-unavailable in spread mode to avoid evicting all replicas at once.

## 🚧 Experimental / In-Progress Features

The following features are under development and may evolve:

### 🔹 `--batch-size 0` (Dynamic Batching)

Automatically calculates a safe batch size based on workload size.

- Adapts eviction speed to the number of replicas

- Prevents over-eviction of small workloads

- Designed to maintain minimum availability during evacuation

---

### ⚡ `--evict-all-safe` (Fast Path Eviction)

Evicts all pods in a workload at once when considered safe.

- Useful for stateless or highly replicated workloads

- Skips batching for faster node evacuation

- Automatically avoids unsafe scenarios (e.g., StatefulSets)

---

### 🛡️ `--respect-pdb` (PDB-Aware Eviction)

Adjusts eviction behavior based on PodDisruptionBudgets.

- Uses Kubernetes PDB limits to determine safe eviction count

- Prevents disruption beyond allowed thresholds

- Aligns evacuation strategy with cluster safety policies

---

> ⚠️ These features are experimental and may change in future releases.

## 📦 Installation

### 🔹 Prerequisites

- Python **3.8+**

- Access to a Kubernetes cluster

- `kubeconfig` configured (e.g., `~/.kube/config`)  

  or running inside a cluster (in-cluster config)

---

### 🔹 Install Dependencies

```bash

pip install kubernetes prometheus-client

```

### 🔹 Run the Tool

```bash

python k8s_evacuator.py --node 

```

## 🏗️ Architecture

```mermaid

flowchart TD

    A[User CLI] --> B[Evacuator Engine]

    B --> C[Cordon Node]

    B --> D[Fetch Pods on Node]

    D --> E[Filter Pods]

    E --> F{Grouping Strategy}

    F -->|owner| G[Group by Workload]

    F -->|spread| H[Spread Across Workloads]

    G --> I[Sort Pods]

    H --> I

    I --> J{Batching Strategy}

    J -->|fixed| K[Static Batching]

    J -->|dynamic| L[Dynamic Batch Calculation]

    J -->|evict-all-safe| M[Fast Path Eviction]

    K --> N[Eviction Loop]

    L --> N

    M --> N

    N --> O[Evict via Eviction API]

    O -->|PDB Block 429| P[Retry / Backoff]

    P --> O

    O --> Q[Scheduler]

    Q --> R[New Pod Placement]

    R --> S[Replacement Pods]

    S --> T[Readiness Check]

    T --> U[Desired State Validation]

    U -->|Not Ready| N

    U -->|Ready| V[Next Batch]

    B --> W[Prometheus Pushgateway]

    W --> X[Metrics]

    style A fill:#f9f,stroke:#333

    style B fill:#bbf,stroke:#333

    style O fill:#fbb,stroke:#333

```

## 📊 Metrics (Prometheus Pushgateway)

This tool can optionally push per-pod evacuation metrics to a Prometheus Pushgateway if `prometheus_client` is installed and the `--pushgateway` flag is provided.

### Metric

| Metric Name                | Labels                     | Description                                                                 |

|-----------------------------|----------------------------|-----------------------------------------------------------------------------|

| `evacuation_pod_status`     | `pod`, `namespace`, `status` | Tracks the status of each pod during evacuation. `status` can be: `evicted`, `ready`, or `failed`. |

### Status Labels

- **evicted** → Pod eviction has been initiated.  

- **ready** → Replacement pod is running and ready.  

- **failed** → Pod eviction failed or replacement pod did not become ready.  

### Usage

```bash

python k8s_evacuator.py --node  --pushgateway http://pushgateway.example.com:9091

```

## 📊 Metrics TODO

### 🚀 1. Pod Movement Metric

Tracks how pods move between nodes during evacuation.

#### Metric Definition

| Metric Name | Labels | Description |

|------------|--------|------------|

| `evacuation_pod_movement` | `old_pod`, `old_node`, `new_pod`, `new_node`, `namespace`, `status` | Tracks pod migration from source node to destination node |

#### Label Details

| Label | Description | Example |

|------|------------|--------|

| `old_pod` | Original pod name | `nginx-abc123` |

| `old_node` | Source node (hostname) | `k3s-lab-worker` |

| `new_pod` | New pod name after rescheduling | `nginx-xyz789` |

| `new_node` | Destination node | `k3s-lab-worker-2` |

| `namespace` | Kubernetes namespace | `default` |

| `status` | Movement result | `moved`, `failed` |

#### Examples

| Scenario | Metric |

|--------|--------|

| Deployment Pod | `evacuation_pod_movement{old_pod="nginx-abc", old_node="node1", new_pod="nginx-def", new_node="node2", namespace="default", status="moved"} 1` |

| StatefulSet Pod | `evacuation_pod_movement{old_pod="mysql-0", old_node="node1", new_pod="mysql-0", new_node="node2", namespace="db", status="moved"} 1` |

| Failed Evacuation | `evacuation_pod_movement{old_pod="redis-abc", old_node="node1", new_pod="unknown", new_node="unknown", namespace="cache", status="failed"} 1` |

---

### 📦 2. Pod Status Metric

Tracks lifecycle of pods during evacuation.

#### Metric Definition

| Metric Name | Labels | Description |

|------------|--------|------------|

| `evacuation_pod_status` | `pod`, `namespace`, `status` | Tracks pod state transitions during evacuation |

#### Label Details

| Label | Description | Example |

|------|------------|--------|

| `pod` | Pod name | `nginx-abc123` |

| `namespace` | Kubernetes namespace | `default` |

| `status` | Pod state | `evicted`, `ready`, `failed` |

#### Examples

| Status | Metric |

|-------|--------|

| Evicted | `evacuation_pod_status{pod="nginx-abc", namespace="default", status="evicted"} 1` |

| Ready | `evacuation_pod_status{pod="nginx-def", namespace="default", status="ready"} 1` |

---

### ⏱️ 3. Pod Rescheduling Duration (Optional)

Tracks time taken for pods to become ready after eviction.

#### Metric Definition

| Metric Name | Labels | Description |

|------------|--------|------------|

| `evacuation_pod_reschedule_duration_seconds` | `pod`, `namespace` | Time taken for pod rescheduling |

#### Example

| Metric Type | Example |

|------------|--------|

| Bucket | `evacuation_pod_reschedule_duration_seconds_bucket{pod="nginx", namespace="default", le="5"} 1` |

| Sum | `evacuation_pod_reschedule_duration_seconds_sum{pod="nginx", namespace="default"} 3.2` |

| Count | `evacuation_pod_reschedule_duration_seconds_count{pod="nginx", namespace="default"} 1` |

---

### 🔍 Useful Prometheus Queries

| Use Case | Query |

|--------|------|

| Pods moved from a node | `evacuation_pod_movement{old_node="k3s-lab-worker", status="moved"}` |

| Distribution across nodes | `sum by (new_node) (evacuation_pod_movement{status="moved"})` |

| Failed evacuations | `evacuation_pod_movement{status="failed"}` |

| Total evicted pods | `count(evacuation_pod_status{status="evicted"})` |

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jebinjeb/k8s-evacuator

Awesome Lists containing this project

README