{"id":36443918,"url":"https://github.com/baizeai/kcover","last_synced_at":"2026-05-28T04:01:12.183Z","repository":{"id":253390014,"uuid":"835522481","full_name":"BaizeAI/kcover","owner":"BaizeAI","description":"🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.","archived":false,"fork":false,"pushed_at":"2025-12-18T23:56:35.000Z","size":64,"stargazers_count":33,"open_issues_count":15,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-21T23:11:01.635Z","etag":null,"topics":["kubeflow","kubernetes","kubernetes-controller","llm","llmops","mlops","nvidia-gpu","pytorchjob","tfjob","xid-error"],"latest_commit_sha":null,"homepage":"https://baizeai.github.io/talks/2024-08-21-kubecon-hk/#/1","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BaizeAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-07-30T02:36:28.000Z","updated_at":"2025-09-10T07:36:59.000Z","dependencies_parsed_at":"2024-08-16T12:13:30.236Z","dependency_job_id":"546715c6-0822-4cbc-b597-0dc7ef95fec1","html_url":"https://github.com/BaizeAI/kcover","commit_stats":null,"previous_names":["baizeai/kcover"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/BaizeAI/kcover","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaizeAI%2Fkcover","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaizeAI%2Fkcover/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaizeAI%2Fkcover/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaizeAI%2Fkcover/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BaizeAI","download_url":"https://codeload.github.com/BaizeAI/kcover/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BaizeAI%2Fkcover/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28324840,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-11T18:42:50.174Z","status":"ssl_error","status_checked_at":"2026-01-11T18:39:13.842Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kubeflow","kubernetes","kubernetes-controller","llm","llmops","mlops","nvidia-gpu","pytorchjob","tfjob","xid-error"],"created_at":"2026-01-11T22:02:18.570Z","updated_at":"2026-05-28T04:01:12.177Z","avatar_url":"https://github.com/BaizeAI.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# kcover - Kubernetes Coverage for Fault Awareness and Recovery\n\nWelcome to `kcover`, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.\n\n## Features\n\n- **Fault Awareness**: Detect and respond to hardware, network, and software failures dynamically.\n- **Instant Recovery**: Quickly restore operations without manual intervention, minimizing downtime and ensuring continuous training and service availability.\n- **Scalability**: Designed for large-scale environments, handling complexities of distributed AI workloads.\n\n## Getting Started\n\n### Prerequisites\n\nEnsure you have Kubernetes and Helm installed on your cluster. `kcover` is compatible with Kubernetes versions 1.19 and above.\n\n### Installation\n\nInstall `kcover` using Helm:\n\n```shell\nhelm repo add baizeai https://baizeai.github.io/charts\nhelm install kcover baizeai/kcover --namespace kcover-system --create-namespace\n```\n\n### Configuration\n\nConfigure `kcover` to monitor specific Kubernetes resources by labeling them:\n\n```shell\nkubectl label pytorchjobs \u003cjob-name\u003e kcover.io/cascading-recovery=true\nkubectl label pytorchjobs \u003cjob-name\u003e kcover.io/need-recovery=true\n```\n\n`kcover` and `agent` read the current node name from the `NODE_NAME`\nenvironment variable. Helm templates inject this automatically from\n`spec.nodeName`. The legacy `FAST_RECOVERY_NODE_NAME` variable is still read in\ncode for backward compatibility during migration, but new deployments should use\n`NODE_NAME` only.\n\n## Agent Config\n\nThe agent supports loading its runtime configuration from a YAML file mounted\nfrom a ConfigMap. The Helm chart creates a default ConfigMap automatically, and\nyou can also point the agent to an existing user-managed ConfigMap.\n\nThe only runtime flag kept by the agent is `--config`, which points to the\nmounted configuration file. Business settings such as `interval`, `vendor`, and\nall `metaX` thresholds are now read from the config file only.\n\nDefault chart-managed config:\n\n```yaml\nagent:\n  config:\n    data:\n      vendor: 1\n      interval: 5\n      metaX:\n        hcaIDs:\n          - mlx5_0\n          - mlx5_1\n        day2CheckTime: \"10:00\"\n        gpuNum: 8\n        temperature: 85\n        eccMaxCount: 64\n        ntpMaxOffsetMillis: 10\n```\n\nThe default vendor is Nvidia (`vendor: 1`). To switch the agent to MetaX,\nset `agent.config.data.vendor` to `2`. MetaX-specific day2 checks and\npreflight report collection are enabled automatically for the MetaX vendor.\n\nInstall with MetaX enabled:\n\n```shell\nhelm install kcover baizeai/kcover \\\n  --namespace kcover-system \\\n  --create-namespace \\\n  --set agent.config.data.vendor=2\n```\n\nSwitch an existing release to MetaX:\n\n```shell\nhelm upgrade kcover baizeai/kcover \\\n  --namespace kcover-system \\\n  --reuse-values \\\n  --set agent.config.data.vendor=2\n```\n\nIf your MetaX nodes require HCA checks, set the HCA IDs as chart values too:\n\n```shell\nhelm upgrade kcover baizeai/kcover \\\n  --namespace kcover-system \\\n  --reuse-values \\\n  --set agent.config.data.vendor=2 \\\n  --set-json 'agent.config.data.metaX.hcaIDs=[\"mlx5_0\",\"mlx5_1\"]'\n```\n\nIf `metaX.hcaIDs` is set, the agent runs `ibv_devinfo` and requires every\nlisted `hca_id` to have `state: PORT_ACTIVE (...)`.\n\nUse a user-defined ConfigMap:\n\n```yaml\nagent:\n  config:\n    existingConfigMap: my-agent-config\n    path: /etc/kcover-agent/config.yaml\n```\n\n## Usage\n\nOnce installed, `kcover` will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.\n\n## Preflight Slow Node Detection\n\n- The collector expects one preflight report per node.\n- `workload_size` is required in the report so the manager can determine the\n  expected report count and batch count.\n- Each report must contain exactly `min(workload_size - 1, 5)` logical batch\n  slots, although fail-fast nodes may skip pairwise batch parsing entirely.\n- For the common 16-node topology, this usually means 16 reports and 15\n  possible pairings, but the current manager-side aggregation only consumes up\n  to 5 batches per report.\n- Nodes that fail `gpu_check` or `storage_check` are marked abnormal directly\n  and excluded from pairwise slow-node intersection.\n- Pairwise slow-node detection marks a node as slow only when its node IP\n  appears in failed observations across every effective batch considered by the\n  aggregation logic.\n- Agent-side node events carry a compacted preflight payload rather than the\n  raw host report. The compacted payload keeps only manager-required fields:\n  report identity plus per-batch `batch_idx`, `pair`, `self_ip`, `status`, and\n  performance fields needed for bus-bandwidth threshold evaluation.\n- Incomplete report collections no longer wait forever. The controller expires\n  stale job aggregations after the controller flag\n  `--preflight-report-collection-timeout` and emits a warning event describing\n  how many reports were received.\n\nSupported compacted report threshold field:\n\n```yaml\nnode_check_busbw_threshold_gbps: \"5\"\n```\n\nController timeout example:\n\n```yaml\ncontroller:\n  args:\n    - --preflight-report-collection-timeout=30m\n```\n\nController leader election can also be toggled from chart values. Keep it\nenabled for multi-replica or HA deployments. Disable it only when you want a\nsingle controller instance to bypass Lease lock acquisition.\n\n```yaml\ncontroller:\n  leaderElection:\n    enabled: false\n```\n\n## Image Build Notes\n\nThe MetaX utility `mx-smi` is extracted into a dedicated image so that the\nagent image no longer needs to reference the full `maca-pytorch` runtime\ndirectly.\n\n- Extracted image: `ghcr.io/baizeai/mx-smi:v0.2`\n- Agent base runtime: `ubuntu:24.04`\n- Agent build arg: `MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.2`\n\nBuild and push the extracted `mx-smi` image:\n\n```shell\nmake image-mx-smi\n```\n\nBuild and push the agent image with the extracted `mx-smi` image injected:\n\n```shell\nmake image-agent\n```\n\nIf you need to build manually, use:\n\n```shell\ndocker build -f docker/mx-smi.Dockerfile -t ghcr.io/baizeai/mx-smi:v0.2 .\ndocker build -f docker/agent.Dockerfile --build-arg MX_SMI_IMAGE=ghcr.io/baizeai/mx-smi:v0.2 -t ghcr.io/baizeai/kcover-agent:\u003ctag\u003e .\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaizeai%2Fkcover","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaizeai%2Fkcover","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaizeai%2Fkcover/lists"}