https://github.com/baizeai/kcover
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
https://github.com/baizeai/kcover
kubeflow kubernetes kubernetes-controller llm llmops mlops nvidia-gpu pytorchjob tfjob xid-error
Last synced: 21 days ago
JSON representation
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
- Host: GitHub
- URL: https://github.com/baizeai/kcover
- Owner: BaizeAI
- License: apache-2.0
- Created: 2024-07-30T02:36:28.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-12-18T23:56:35.000Z (6 months ago)
- Last Synced: 2025-12-21T23:11:01.635Z (6 months ago)
- Topics: kubeflow, kubernetes, kubernetes-controller, llm, llmops, mlops, nvidia-gpu, pytorchjob, tfjob, xid-error
- Language: Go
- Homepage: https://baizeai.github.io/talks/2024-08-21-kubecon-hk/#/1
- Size: 62.5 KB
- Stars: 33
- Watchers: 1
- Forks: 3
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# kcover - Kubernetes Coverage for Fault Awareness and Recovery
Welcome to `kcover`, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.
## Features
- **Fault Awareness**: Detect and respond to hardware, network, and software failures dynamically.
- **Instant Recovery**: Quickly restore operations without manual intervention, minimizing downtime and ensuring continuous training and service availability.
- **Scalability**: Designed for large-scale environments, handling complexities of distributed AI workloads.
## Getting Started
### Prerequisites
Ensure you have Kubernetes and Helm installed on your cluster. `kcover` is compatible with Kubernetes versions 1.19 and above.
### Installation
Install `kcover` using Helm:
```shell
helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace
```
### Configuration
Configure `kcover` to monitor specific Kubernetes resources by labeling them:
```shell
kubectl label pytorchjobs kcover.io/cascading-recovery=true
kubectl label pytorchjobs kcover.io/need-recovery=true
```
`kcover` and `agent` read the current node name from the `NODE_NAME`
environment variable. Helm templates inject this automatically from
`spec.nodeName`. The legacy `FAST_RECOVERY_NODE_NAME` variable is still read in
code for backward compatibility during migration, but new deployments should use
`NODE_NAME` only.
## Agent Config
The agent supports loading its runtime configuration from a YAML file mounted
from a ConfigMap. The Helm chart creates a default ConfigMap automatically, and
you can also point the agent to an existing user-managed ConfigMap.
The only runtime flag kept by the agent is `--config`, which points to the
mounted configuration file. Business settings such as `interval`, `vendor`, and
all `metaX` thresholds are now read from the config file only.
Default chart-managed config:
```yaml
agent:
config:
data:
vendor: 1
interval: 5
metaX:
hcaIDs:
- mlx5_0
- mlx5_1
day2CheckTime: "10:00"
gpuNum: 8
temperature: 85
eccMaxCount: 64
ntpMaxOffsetMillis: 10
```
The default vendor is Nvidia (`vendor: 1`). To switch the agent to MetaX,
set `agent.config.data.vendor` to `2`. MetaX-specific day2 checks and
preflight report collection are enabled automatically for the MetaX vendor.
Install with MetaX enabled:
```shell
helm install kcover baizeai/kcover \
--namespace kcover-system \
--create-namespace \
--set agent.config.data.vendor=2
```
Switch an existing release to MetaX:
```shell
helm upgrade kcover baizeai/kcover \
--namespace kcover-system \
--reuse-values \
--set agent.config.data.vendor=2
```
If your MetaX nodes require HCA checks, set the HCA IDs as chart values too:
```shell
helm upgrade kcover baizeai/kcover \
--namespace kcover-system \
--reuse-values \
--set agent.config.data.vendor=2 \
--set-json 'agent.config.data.metaX.hcaIDs=["mlx5_0","mlx5_1"]'
```
If `metaX.hcaIDs` is set, the agent runs `ibv_devinfo` and requires every
listed `hca_id` to have `state: PORT_ACTIVE (...)`.
Use a user-defined ConfigMap:
```yaml
agent:
config:
existingConfigMap: my-agent-config
path: /etc/kcover-agent/config.yaml
```
## Usage
Once installed, `kcover` will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.
## Preflight Slow Node Detection
- The collector expects one preflight report per node.
- `workload_size` is required in the report so the manager can determine the
expected report count and batch count.
- Each report must contain exactly `min(workload_size - 1, 5)` logical batch
slots, although fail-fast nodes may skip pairwise batch parsing entirely.
- For the common 16-node topology, this usually means 16 reports and 15
possible pairings, but the current manager-side aggregation only consumes up
to 5 batches per report.
- Nodes that fail `gpu_check` or `storage_check` are marked abnormal directly
and excluded from pairwise slow-node intersection.
- Pairwise slow-node detection marks a node as slow only when its node IP
appears in failed observations across every effective batch considered by the
aggregation logic.
- Agent-side node events carry a compacted preflight payload rather than the
raw host report. The compacted payload keeps only manager-required fields:
report identity plus per-batch `batch_idx`, `pair`, `self_ip`, `status`, and
performance fields needed for bus-bandwidth threshold evaluation.
- Incomplete report collections no longer wait forever. The controller expires
stale job aggregations after the controller flag
`--preflight-report-collection-timeout` and emits a warning event describing
how many reports were received.
Supported compacted report threshold field:
```yaml
node_check_busbw_threshold_gbps: "5"
```
Controller timeout example:
```yaml
controller:
args:
- --preflight-report-collection-timeout=30m
```
Controller leader election can also be toggled from chart values. Keep it
enabled for multi-replica or HA deployments. Disable it only when you want a
single controller instance to bypass Lease lock acquisition.
```yaml
controller:
leaderElection:
enabled: false
```
## Image Build Notes
The MetaX utility `mx-smi` is extracted into a dedicated image so that the
agent image no longer needs to reference the full `maca-pytorch` runtime
directly.
- Extracted image: `ghcr.io/baizeai/mx-smi:v0.2`
- Agent base runtime: `ubuntu:24.04`
- Agent build arg: `MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.2`
Build and push the extracted `mx-smi` image:
```shell
make image-mx-smi
```
Build and push the agent image with the extracted `mx-smi` image injected:
```shell
make image-agent
```
If you need to build manually, use:
```shell
docker build -f docker/mx-smi.Dockerfile -t ghcr.io/baizeai/mx-smi:v0.2 .
docker build -f docker/agent.Dockerfile --build-arg MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.2 -t ghcr.io/baizeai/kcover-agent: .
```