https://github.com/withlin/k8s-ai-infra

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/withlin/k8s-ai-infra
Owner: withlin
Created: 2025-01-25T01:15:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-25T01:36:54.000Z (over 1 year ago)
Last Synced: 2025-03-20T00:19:58.853Z (about 1 year ago)
Language: Shell
Size: 20.5 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # K8s AI Infrastructure



![Kubernetes Version](https://img.shields.io/badge/Kubernetes-1.20+-blue?logo=kubernetes)

![NVIDIA GPU](https://img.shields.io/badge/GPU-A100%2FA800-green?logo=nvidia)

![InfiniBand](https://img.shields.io/badge/Network-InfiniBand-orange?logo=nvidia)

![License](https://img.shields.io/badge/License-Apache%202.0-blue)

[English](README.md) | [中文文档](README_CN.md)



High-performance AI training infrastructure deployment solution for Kubernetes clusters, optimized for NVIDIA A100/A800 GPU clusters with InfiniBand networking.

## ✨ Features

- 🚀 **High Performance**: Optimized for NVIDIA A100/A800 GPU clusters

- 🌐 **Advanced Networking**: InfiniBand support with RDMA

- 📊 **Comprehensive Monitoring**: GPU and network metrics tracking

- 🔄 **Automated Deployment**: Streamlined setup process

- 🛡️ **Production Ready**: Enterprise-grade security and stability

## 🏗️ System Architecture

```mermaid

graph TB

    subgraph "Physical Network"

        B[Bond4]

        IB[InfiniBand Network]

        lan0[LAN0] --> B

        lan1[LAN1] --> B

        lan2[LAN2] --> IB

        lan3[LAN3] --> IB

        lan4[LAN4] --> IB

        lan5[LAN5] --> IB

    end

    subgraph "Network Control Plane"

        NO[NVIDIA Network Operator]

        VPC[VPC CNI]

        MC[Multus CNI]

        SRIOV[SR-IOV Device Plugin]

        RDMA[RDMA Device Plugin]

        

        NO --> VPC

        NO --> MC

        NO --> SRIOV

        NO --> RDMA

    end

    subgraph "Pod Networking"

        P1[AI Training Pod]

        eth0[eth0]

        rdma[RDMA Interface]

        

        P1 --> eth0

        P1 --> rdma

        eth0 --> B

        rdma --> IB

    end

    subgraph "Monitoring System"

        PM[Prometheus]

        GF[Grafana]

        PM --> GF

    end

```

## 🚀 Quick Start

### Prerequisites

- Kubernetes 1.20+

- NVIDIA A100/A800 GPUs

- Mellanox InfiniBand NICs

- Helm 3.0+

### Installation

1. Configure network environment:

```bash

./scripts/setup-network.sh

```

2. Deploy NVIDIA Network Operator:

```bash

./scripts/deploy-network-operator.sh

```

3. Verify deployment:

```bash

./scripts/test-network.sh

```

## 📚 Documentation

- [Network Architecture](docs/network-architecture.md)

- [Ray Cluster Setup](docs/ray-cluster.md)

- [Monitoring Guide](docs/monitoring.md)

- [Performance Tuning](docs/performance-tuning.md)

## 🛠️ Components

### Network Infrastructure

- Bond4 configuration for management traffic

- InfiniBand network for high-speed data transfer

- RDMA support for direct memory access

- SR-IOV for network virtualization

### Monitoring Stack

- Prometheus for metrics collection

- Grafana for visualization

- Custom exporters for GPU and network metrics

- Comprehensive alerting rules

### Ray Integration

- Distributed training support

- GPU-aware scheduling

- NCCL optimization

- Topology-aware placement

## 📊 Performance

- NVLink: Up to 600 GB/s bidirectional bandwidth

- InfiniBand: Up to 200 Gb/s network speed

- RDMA: Ultra-low latency communication

- GPUDirect: Optimized GPU-to-GPU transfer

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details.

## 📝 License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/withlin/k8s-ai-infra

Awesome Lists containing this project

README