https://github.com/baizeai/kcover
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
https://github.com/baizeai/kcover
kubeflow kubernetes kubernetes-controller llm llmops mlops nvidia-gpu pytorchjob tfjob xid-error
Last synced: about 2 months ago
JSON representation
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
- Host: GitHub
- URL: https://github.com/baizeai/kcover
- Owner: BaizeAI
- License: apache-2.0
- Created: 2024-07-30T02:36:28.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-18T23:56:35.000Z (3 months ago)
- Last Synced: 2025-12-21T23:11:01.635Z (3 months ago)
- Topics: kubeflow, kubernetes, kubernetes-controller, llm, llmops, mlops, nvidia-gpu, pytorchjob, tfjob, xid-error
- Language: Go
- Homepage: https://baizeai.github.io/talks/2024-08-21-kubecon-hk/#/1
- Size: 62.5 KB
- Stars: 33
- Watchers: 1
- Forks: 3
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# kcover - Kubernetes Coverage for Fault Awareness and Recovery
Welcome to `kcover`, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.
## Features
- **Fault Awareness**: Detect and respond to hardware, network, and software failures dynamically.
- **Instant Recovery**: Quickly restore operations without manual intervention, minimizing downtime and ensuring continuous training and service availability.
- **Scalability**: Designed for large-scale environments, handling complexities of distributed AI workloads.
## Getting Started
### Prerequisites
Ensure you have Kubernetes and Helm installed on your cluster. `kcover` is compatible with Kubernetes versions 1.19 and above.
### Installation
Install `kcover` using Helm:
```shell
helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace
```
### Configuration
Configure `kcover` to monitor specific Kubernetes resources by labeling them:
```shell
kubectl label pytorchjobs kcover.io/cascading-recovery=true
kubectl label pytorchjobs kcover.io/need-recovery=true
```
## Usage
Once installed, `kcover` will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.