Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nvidia/nvidia-terraform-modules
Infrastructure as code for GPU accelerated managed Kubernetes clusters.
https://github.com/nvidia/nvidia-terraform-modules
gpu kubernetes nvidia terraform
Last synced: about 22 hours ago
JSON representation
Infrastructure as code for GPU accelerated managed Kubernetes clusters.
- Host: GitHub
- URL: https://github.com/nvidia/nvidia-terraform-modules
- Owner: NVIDIA
- License: apache-2.0
- Created: 2023-07-06T13:26:55.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-03T22:51:31.000Z (24 days ago)
- Last Synced: 2024-12-26T22:15:35.974Z (about 22 hours ago)
- Topics: gpu, kubernetes, nvidia, terraform
- Language: HCL
- Homepage: https://nvidia.com
- Size: 122 KB
- Stars: 49
- Watchers: 6
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# NVIDIA Terraform Kubernetes Modules
## Objective
NVIDIA Terraform Modules is a reference architecture that can help to create CSP Kubernetes Cluster with NVIDIA GPU and NIM Operators. All the components that have been listed below tested successfully together
## Life Cycle
When NVIDIA Terraform Modules is released, the previous release enters maintenance support and only receives patch release updates. All prior batches enter end-of-life (EOL) and are no longer supported and do not receive patch updates.
| Release | Status |
| :-----: | :--------------:|
| [24.11.0](https://github.com/NVIDIA/cloud-native-stack/releases/tag/v24.11.0) | Generally Available |
| [0.7.0](https://github.com/NVIDIA/nvidia-terraform-modules/releases/tag/0.7.0) | Maintenance |## Support Matrix
The Kubernetes clusters provisioned by the modules in this repository provide tested and certified versions of Kubernetes, the NVIDIA GPU operator, and NVIDIA NIM Operator.
If your application does not require a specific version of Kubernetes, we recommend using the latest available version. We also recommend you plan to upgrade your version of Kubernetes at least every 6 months.
Each CSP has its own end of life date for the versions of Kubernetes they support. For more information see:
- [Amazon EKS release calendar](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar)
- [Azure AKS release calendar](https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar)
- [GCP GKE release calendar](https://cloud.google.com/kubernetes-engine/docs/release-schedule#schedule_for_static_no-channel_versions).NVIDIA Terraform Modules 24.11.0 Release
| TF Modules | K8s 1.31 | K8s 1.30 | K8s 1.29 |
| :--------- | :-------- | :------- | :------- |
| Platforms | Amazon EKS
Azure AKS
Google GKE | Amazon EKS
Azure AKS
Google GKE | Amazon EKS
Azure AKS
Google GKE |
| Supported OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
| Containerd | EKS: 1.7.12
AKS: 1.7.23-1
GKE: 1.7.22 | EKS: 1.7.12
AKS: 1.7.23-1
GKE: 1.7.22 | EKS: 1.7.12
AKS: 1.7.23-1
GKE: 1.7.22 |
| NVIDIA Container Toolkit | 1.17.0 | 1.17.0 | 1.17.0 |
| CNI | CSP dependent | CSP dependent | CSP dependent |
| NVIDIA GPU Operator | 24.9.0 | 24.9.0 | 24.9.0 |
| NVIDIA DataCenter Driver | 550.127.05 | 550.127.05 | 550.127.05 |
| NVIDIA NIM Operator | 1.0.0 | 1.0.0 | 1.0.0 |
| Helm | 3.16.2 | 3.16.2 | 3.16.2 |## Getting Started
Infrastructure as code for GPU accelerated managed Kubernetes clusters. These scripts automate the deployment of GPU-Enabled Kubernetes clusters on various cloud service platforms.
Terraform is an open-source infrastructure as code software tool that we will use to automate the deployment of Kubernetes clusters with the required add-ons to enable NVIDIA GPUs. This repository contains Terraform [modules](https://developer.hashicorp.com/terraform/tutorials/modules/module), which are sets of Terraform configuration files ready for deployment. The modules in this repository can be incorporated into existing Terraform-managed infrastructure, or used to set up new infrastructure from scratch. You can learn more about Terraform [here](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/infrastructure-as-code).
You can download Terraform (CLI) [here](https://developer.hashicorp.com/terraform/downloads).
### Usage
Clone the repo
```
git clone https://github.com/NVIDIA/nvidia-terraform-modules.git
```#### Provision a GPU enabled Kubernetes Cluster
- Create an [EKS Cluster](./eks/README.md)
- Create an [AKS Cluster](./aks/README.md)
- Create a [GKE Cluster](./gke/README.md)### State Management
These modules do not set up state management for the generated Terraform state file, deleting the statefile (`terraform.tfstate`) generated by Terraform could result in cloud resources needing to be manually deleted. We strongly encourage you [configure remote state](https://developer.hashicorp.com/terraform/language/state/remote).Please see the [Terraform Documentation](https://developer.hashicorp.com/terraform/language/state) for more information.
## Contributing
Pull requests are welcome! Please see our [contribution guidelines](./CONTRIBUTING.md).
## Getting help or Providing feedback
Please open an [issue](https://github.com/NVIDIA/nvidia-terraform-modules/issues) on the GitHub project for any questions. Your feedback is appreciated.
## Useful Links
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html)
- [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html)
- [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/)