https://github.com/sudo-kraken/k3s-cluster-maintenance
🚀 Automated K3s cluster maintenance with Ansible - safely maintain nodes one at a time with health checks, Longhorn support, and master node protection. Sequential processing ensures cluster stability during updates and reboots.
https://github.com/sudo-kraken/k3s-cluster-maintenance
ansible ansible-playbook bash bash-script bash-scripting bashrc k3s k3s-cluster kubernetes
Last synced: about 2 months ago
JSON representation
🚀 Automated K3s cluster maintenance with Ansible - safely maintain nodes one at a time with health checks, Longhorn support, and master node protection. Sequential processing ensures cluster stability during updates and reboots.
- Host: GitHub
- URL: https://github.com/sudo-kraken/k3s-cluster-maintenance
- Owner: sudo-kraken
- License: mit
- Created: 2025-06-28T10:30:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-28T11:56:34.000Z (about 1 year ago)
- Last Synced: 2025-06-28T12:22:35.858Z (about 1 year ago)
- Topics: ansible, ansible-playbook, bash, bash-script, bash-scripting, bashrc, k3s, k3s-cluster, kubernetes
- Language: Shell
- Homepage:
- Size: 20.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
### K3s Cluster Maintenance
_A modular Ansible role and playbook that performs automated operating system patching and system maintenance on K3s cluster nodes with zero-downtime semantics. Designed for local runs or CI runners._
[](https://ansible.com) [](https://docs.ansible.com/)
[](https://scorecard.dev/viewer/?uri=github.com/sudo-kraken/k3s-cluster-maintenance)
## Contents
- [Overview](#overview)
- [Architecture at a glance](#architecture-at-a-glance)
- [Role structure](#role-structure)
- [Group variables](#group-variables)
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Quick start](#quick-start)
- [Configuration](#configuration)
- [Role variables](#role-variables)
- [Inventory structure](#inventory-structure)
- [Repository contents](#repository-contents)
- [Tag reference](#tag-reference)
- [Health](#health)
- [Endpoint](#endpoint)
- [Production notes](#production-notes)
- [Development](#development)
- [Troubleshooting](#troubleshooting)
- [Licence](#licence)
- [Security](#security)
- [Contributing](#contributing)
- [Support](#support)
- [Disclaimer](#disclaimer)
## Overview
Enterprise-grade automation for K3s clusters that safely applies system updates, security patches and package upgrades across master and worker nodes without impacting availability. Operations are orchestrated through a production-ready Ansible role that handles draining, reboots and post-update restoration.
## Architecture at a glance
- Modular Ansible role with `maintenance.yml` as the entry point
- Sequential node processing for zero-downtime
- Smart detection to skip when no updates are available
- Longhorn-aware storage health checks and recovery waits
- Robust reboot handling with adaptive wait logic
- Group-based configuration via `group_vars`
### Role structure
```
roles/
k3s_node_maintenance/
├── tasks/
│ ├── main.yml # Main task orchestration
│ ├── prerequisites.yml # Pre-flight checks
│ ├── package_checks.yml # Update detection
│ ├── cluster_preparation.yml # Node draining
│ ├── package_updates.yml # OS updates
│ ├── debian_updates.yml # Debian/Ubuntu specific
│ ├── redhat_updates.yml # RHEL/CentOS specific
│ ├── reboot_handling.yml # Reboot coordination
│ └── cluster_restoration.yml # Node restoration
├── defaults/
│ └── main.yml # Default variables
├── handlers/
│ └── main.yml # Event handlers
└── meta/
└── main.yml # Role metadata
```
### Group variables
```
group_vars/
├── k3s_masters/main.yml # Master-specific settings
├── k3s_workers/main.yml # Worker-specific settings
├── os_debian/main.yml # Debian/Ubuntu settings
└── os_redhat/main.yml # RHEL/CentOS settings
```
## Features
- Automated OS patching: system updates, security patches and package upgrades
- Zero-downtime operations via safe, sequential node handling
- Intelligent detection that exits early when no updates are required
- Health monitoring across nodes, control plane and storage
- Native Longhorn integration with volume health verification and recovery waits
- Control plane safety with quorum-aware master handling
- Smart reboot management that adapts to node boot speeds
- Enterprise-ready modular role for scalability and customisation
## Prerequisites
- K3s cluster, single or multi-node
- Ansible 2.9 or newer, tested with 2.14.x
- kubectl configured for your cluster
- SSH access to all nodes with key-based authentication
- `kubernetes.core` Ansible collection
- Python Kubernetes client for API operations
## Quick start
Run maintenance using simple Ansible commands:
```bash
# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers
# Update all master nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters
# Update a specific node
ansible-playbook -i hosts.yml maintenance.yml --limit node-01
# Update the entire cluster
ansible-playbook -i hosts.yml maintenance.yml
```
## Configuration
### Role variables
Customise behaviour through group variables.
```yaml
# group_vars/k3s_masters/main.yml
k3s_node_maintenance_drain_timeout: 600
k3s_node_maintenance_wait_timeout: 1800
k3s_node_maintenance_skip_drain: true # Masters are not drained
# group_vars/k3s_workers/main.yml
k3s_node_maintenance_drain_timeout: 300
k3s_node_maintenance_wait_timeout: 600
k3s_node_maintenance_skip_drain: false
# group_vars/os_debian/main.yml
k3s_node_maintenance_package_manager: apt
k3s_node_maintenance_cache_valid_time: 3600
# group_vars/os_redhat/main.yml
k3s_node_maintenance_package_manager: dnf
k3s_node_maintenance_needs_restarting_available: true
```
### Inventory structure
Define your cluster in `hosts.yml`:
```yaml
all:
children:
k3s_cluster:
children:
k3s_masters:
hosts:
master-01:
ansible_host: 10.0.0.100
master-02:
ansible_host: 10.0.0.101
master-03:
ansible_host: 10.0.0.102
k3s_workers:
hosts:
worker-01:
ansible_host: 10.0.0.150
worker-02:
ansible_host: 10.0.0.151
os_debian:
hosts:
master-01:
worker-01:
os_redhat:
hosts:
master-02:
master-03:
worker-02:
```
### Repository contents
| File | Description |
|------|-------------|
| `maintenance.yml` | Main playbook using enterprise role architecture |
| `hosts.yml.example` | Example inventory with group structure |
| `ansible.cfg` | Ansible configuration |
| `roles/` | Modular role architecture |
| `group_vars/` | Node type and OS-specific variables |
| `requirements.txt` | Python dependencies |
### Tag reference
| Tag | Description | Use case |
|-----|-------------|----------|
| `prerequisites` | Pre-flight checks | Validate environment setup |
| `check_updates` | Package update detection | See what updates are available |
| `prepare` | Cluster preparation | Cordon and drain nodes only |
| `packages` | All package operations | Package management only |
| `updates` | Package installation | Install updates only |
| `reboot` | Reboot coordination | Reboot handling only |
| `restore` | Cluster restoration | Uncordon and restore scheduling |
| `resume` | Manual recovery | Resume after failures including restore |
| `uncordon` | Node uncordoning | Restore node scheduling only |
| `debian` | Debian or Ubuntu only | OS-specific operations |
| `redhat` | RHEL or CentOS only | OS-specific operations |
| `longhorn` | Longhorn operations | Storage-specific tasks |
## Health
- Pre-flight validation of cluster prerequisites and connectivity
- Node readiness checks before and after maintenance
- Control plane validation for API server and etcd on masters
- Longhorn volume health checks and recovery waits when available
## Endpoint
This project is an Ansible automation, not a network service.
- Primary entry point: `maintenance.yml`
- Invoke with `ansible-playbook -i hosts.yml maintenance.yml` and the tags or limits that fit your scenario
## Production notes
- Process nodes sequentially to preserve availability
- Keep timeouts conservative to match your node boot and image pull times
- Use `check_updates` to avoid unnecessary work when no updates are available
- When using Longhorn, allow time for degraded volumes to become healthy before proceeding
- Keep `k3s_node_maintenance_skip_drain` set appropriately for masters to protect quorum
## Development
```bash
# 1) Clone
git clone https://github.com/sudo-kraken/k3s-cluster-maintenance.git
cd k3s-cluster-maintenance
# 2) Install Python deps
pip install -r requirements.txt
# 3) Install Ansible collections
ansible-galaxy collection install kubernetes.core
# or from the file if present
ansible-galaxy collection install -r collections/requirements.yml
# 4) Configure inventory
cp hosts.yml.example hosts.yml
# edit hosts.yml with your cluster details
# 5) Test connectivity
ansible all -i hosts.yml -m ping
```
## Troubleshooting
- Verify available updates
```bash
ansible all -i hosts.yml -m package_facts
```
- Check cluster health
```bash
kubectl get nodes
kubectl get pods --all-namespaces
```
- Verify Longhorn status if applicable
```bash
kubectl get pods -n longhorn-system
```
Common issues
- No updates needed
Normal behaviour. The role skips maintenance when no packages need updating.
- Node not ready after maintenance
```bash
kubectl get nodes
kubectl uncordon
```
- Ansible connection issues
```bash
ansible all -i hosts.yml -m ping
ssh user@node-ip
```
Debug mode
```bash
ansible-playbook -i hosts.yml maintenance.yml -vvv
ansible-playbook -i hosts.yml maintenance.yml --list-tags
ansible-playbook -i hosts.yml maintenance.yml --tags check_updates --check
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume
```
## Licence
This project is licensed under the MIT Licence. See the [LICENCE](LICENCE) file for details.
## Security
If you discover a security issue, please review and follow the guidance in [SECURITY.md](SECURITY.md), or open a private security-focused issue with minimal details and request a secure contact channel.
## Contributing
Feel free to open issues or submit pull requests if you have suggestions or improvements.
See [CONTRIBUTING.md](CONTRIBUTING.md)
## Support
Open an [issue](/../../issues) with as much detail as possible, including your Ansible version, distribution details and relevant playbook output.
## Disclaimer
This tool performs maintenance operations on your Kubernetes cluster. Always:
- Test in a non-production environment first
- Ensure you have recent backups
- Review the role tasks before deployment
- Monitor the process during execution
Use at your own risk. I am not responsible for any damage or data loss.