{"id":24626782,"url":"https://github.com/withlin/k8s-ai-infra","last_synced_at":"2026-04-13T11:02:21.451Z","repository":{"id":274117638,"uuid":"921968716","full_name":"withlin/k8s-ai-infra","owner":"withlin","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-25T01:36:54.000Z","size":21,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T00:19:58.853Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/withlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-25T01:15:32.000Z","updated_at":"2025-02-27T07:38:25.000Z","dependencies_parsed_at":"2025-01-25T02:21:44.124Z","dependency_job_id":"5a177de6-d3a6-4419-ac65-62122b799a9c","html_url":"https://github.com/withlin/k8s-ai-infra","commit_stats":null,"previous_names":["withlin/k8s-ai-infra"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/withlin/k8s-ai-infra","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fk8s-ai-infra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fk8s-ai-infra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fk8s-ai-infra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fk8s-ai-infra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/withlin","download_url":"https://codeload.github.com/withlin/k8s-ai-infra/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/withlin%2Fk8s-ai-infra/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31749765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T09:16:15.125Z","status":"ssl_error","status_checked_at":"2026-04-13T09:16:05.023Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-25T04:49:51.538Z","updated_at":"2026-04-13T11:02:21.434Z","avatar_url":"https://github.com/withlin.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# K8s AI Infrastructure\n\n\u003cdiv align=\"center\"\u003e\n\n![Kubernetes Version](https://img.shields.io/badge/Kubernetes-1.20+-blue?logo=kubernetes)\n![NVIDIA GPU](https://img.shields.io/badge/GPU-A100%2FA800-green?logo=nvidia)\n![InfiniBand](https://img.shields.io/badge/Network-InfiniBand-orange?logo=nvidia)\n![License](https://img.shields.io/badge/License-Apache%202.0-blue)\n\n[English](README.md) | [中文文档](README_CN.md)\n\n\u003c/div\u003e\n\nHigh-performance AI training infrastructure deployment solution for Kubernetes clusters, optimized for NVIDIA A100/A800 GPU clusters with InfiniBand networking.\n\n## ✨ Features\n\n- 🚀 **High Performance**: Optimized for NVIDIA A100/A800 GPU clusters\n- 🌐 **Advanced Networking**: InfiniBand support with RDMA\n- 📊 **Comprehensive Monitoring**: GPU and network metrics tracking\n- 🔄 **Automated Deployment**: Streamlined setup process\n- 🛡️ **Production Ready**: Enterprise-grade security and stability\n\n## 🏗️ System Architecture\n\n```mermaid\ngraph TB\n    subgraph \"Physical Network\"\n        B[Bond4]\n        IB[InfiniBand Network]\n        lan0[LAN0] --\u003e B\n        lan1[LAN1] --\u003e B\n        lan2[LAN2] --\u003e IB\n        lan3[LAN3] --\u003e IB\n        lan4[LAN4] --\u003e IB\n        lan5[LAN5] --\u003e IB\n    end\n\n    subgraph \"Network Control Plane\"\n        NO[NVIDIA Network Operator]\n        VPC[VPC CNI]\n        MC[Multus CNI]\n        SRIOV[SR-IOV Device Plugin]\n        RDMA[RDMA Device Plugin]\n        \n        NO --\u003e VPC\n        NO --\u003e MC\n        NO --\u003e SRIOV\n        NO --\u003e RDMA\n    end\n\n    subgraph \"Pod Networking\"\n        P1[AI Training Pod]\n        eth0[eth0]\n        rdma[RDMA Interface]\n        \n        P1 --\u003e eth0\n        P1 --\u003e rdma\n        eth0 --\u003e B\n        rdma --\u003e IB\n    end\n\n    subgraph \"Monitoring System\"\n        PM[Prometheus]\n        GF[Grafana]\n        PM --\u003e GF\n    end\n```\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Kubernetes 1.20+\n- NVIDIA A100/A800 GPUs\n- Mellanox InfiniBand NICs\n- Helm 3.0+\n\n### Installation\n\n1. Configure network environment:\n```bash\n./scripts/setup-network.sh\n```\n\n2. Deploy NVIDIA Network Operator:\n```bash\n./scripts/deploy-network-operator.sh\n```\n\n3. Verify deployment:\n```bash\n./scripts/test-network.sh\n```\n\n## 📚 Documentation\n\n- [Network Architecture](docs/network-architecture.md)\n- [Ray Cluster Setup](docs/ray-cluster.md)\n- [Monitoring Guide](docs/monitoring.md)\n- [Performance Tuning](docs/performance-tuning.md)\n\n## 🛠️ Components\n\n### Network Infrastructure\n- Bond4 configuration for management traffic\n- InfiniBand network for high-speed data transfer\n- RDMA support for direct memory access\n- SR-IOV for network virtualization\n\n### Monitoring Stack\n- Prometheus for metrics collection\n- Grafana for visualization\n- Custom exporters for GPU and network metrics\n- Comprehensive alerting rules\n\n### Ray Integration\n- Distributed training support\n- GPU-aware scheduling\n- NCCL optimization\n- Topology-aware placement\n\n## 📊 Performance\n\n- NVLink: Up to 600 GB/s bidirectional bandwidth\n- InfiniBand: Up to 200 Gb/s network speed\n- RDMA: Ultra-low latency communication\n- GPUDirect: Optimized GPU-to-GPU transfer\n\n## 🤝 Contributing\n\nContributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details.\n\n## 📝 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwithlin%2Fk8s-ai-infra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwithlin%2Fk8s-ai-infra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwithlin%2Fk8s-ai-infra/lists"}