https://github.com/baizeai/kcover

🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
https://github.com/baizeai/kcover

kubeflow kubernetes kubernetes-controller llm llmops mlops nvidia-gpu pytorchjob tfjob xid-error

Last synced: about 2 months ago
JSON representation

🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.

Host: GitHub
URL: https://github.com/baizeai/kcover
Owner: BaizeAI
License: apache-2.0
Created: 2024-07-30T02:36:28.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-12-18T23:56:35.000Z (3 months ago)
Last Synced: 2025-12-21T23:11:01.635Z (3 months ago)
Topics: kubeflow, kubernetes, kubernetes-controller, llm, llmops, mlops, nvidia-gpu, pytorchjob, tfjob, xid-error
Language: Go
Homepage: https://baizeai.github.io/talks/2024-08-21-kubecon-hk/#/1
Size: 62.5 KB
Stars: 33
Watchers: 1
Forks: 3
Open Issues: 15
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# kcover - Kubernetes Coverage for Fault Awareness and Recovery

Welcome to `kcover`, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.

## Features

- **Fault Awareness**: Detect and respond to hardware, network, and software failures dynamically.
- **Instant Recovery**: Quickly restore operations without manual intervention, minimizing downtime and ensuring continuous training and service availability.
- **Scalability**: Designed for large-scale environments, handling complexities of distributed AI workloads.

## Getting Started

### Prerequisites

Ensure you have Kubernetes and Helm installed on your cluster. `kcover` is compatible with Kubernetes versions 1.19 and above.

### Installation

Install `kcover` using Helm:

```shell
helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace
```

### Configuration

Configure `kcover` to monitor specific Kubernetes resources by labeling them:

```shell
kubectl label pytorchjobs kcover.io/cascading-recovery=true
kubectl label pytorchjobs kcover.io/need-recovery=true
```

## Usage

Once installed, `kcover` will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/baizeai/kcover

Awesome Lists containing this project

README