{"id":36688805,"url":"https://github.com/converged-computing/aks-infiniband-install","last_synced_at":"2026-01-12T11:17:57.872Z","repository":{"id":253333189,"uuid":"842690306","full_name":"converged-computing/aks-infiniband-install","owner":"converged-computing","description":"Example prototype for installing Infiniband on an AKS cluster","archived":false,"fork":false,"pushed_at":"2025-04-20T23:29:06.000Z","size":37,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-05T07:25:16.035Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/converged-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":"COPYRIGHT","agents":null,"dco":null,"cla":null}},"created_at":"2024-08-14T21:37:24.000Z","updated_at":"2025-04-20T23:29:09.000Z","dependencies_parsed_at":"2025-09-05T07:19:37.881Z","dependency_job_id":"eb94fa46-4cf7-4363-affb-213cf8e1c067","html_url":"https://github.com/converged-computing/aks-infiniband-install","commit_stats":null,"previous_names":["converged-computing/aks-infiniband-install"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/converged-computing/aks-infiniband-install","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Faks-infiniband-install","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Faks-infiniband-install/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Faks-infiniband-install/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Faks-infiniband-install/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/converged-computing","download_url":"https://codeload.github.com/converged-computing/aks-infiniband-install/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Faks-infiniband-install/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28338970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T10:58:46.209Z","status":"ssl_error","status_checked_at":"2026-01-12T10:58:42.742Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-12T11:17:55.850Z","updated_at":"2026-01-12T11:17:57.865Z","avatar_url":"https://github.com/converged-computing.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AKS Infiniband Installer\n\n[![DOI](https://zenodo.org/badge/842690306.svg)](https://doi.org/10.5281/zenodo.15253170)\n\nWe are trying to get Infiniband working on AKS, and this small series of steps will help.\nWe are using the build here to install the drivers to the nodes, and then the [Mellanox/k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/tree/master/deployment/k8s) to provide a CNI to enable Infiniband on the pods.\nThe directories are organized by OS and driver version, since it matters. If you need to install to Usernetes on a node already running the driver, jump down to [install usernetes](#install-usernetes).\n\n## 1. Build Image\n\nIn practice, we built only the latest (22.04) drivers and got that to work on older devices by way of customizing flags. The other directories (aside from ubuntu22.04) are provided\nfor example, but we have not used them. \n\n - [ubuntu22.04](ubuntu22.04): will build for `MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64` (Connect-X 4 and 5)\n\nYou'll need to have the driver that matches your node version. Nvidia has disabled allowing wget / curl of the newer files so you'll need to agree to their license agreement and download it [from this page](https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/) and put the iso in the respective folder you want to build from. Note that we follow the instructions [here](https://docs.nvidia.com/networking/display/mlnxofedv461000/installing+mellanox+ofed) to install it with the daemonset. Then update in the Dockerfile:\n\n1. The base image to use (e.g., ubuntu:22.04) \n2. The `COPY` directive to copy the ISO into the directory\n3. The [driver-installation.yaml](driver-installation.yaml) or [driver-installation-with-gpu.yaml](driver-installation-with-gpu.yaml) that references it\n\nBuilding the image:\n\n```bash\ndocker build -t ghcr.io/converged-computing/aks-infiniband-install:ubuntu-22.04 ubuntu22.04\n```\n\n## 2. Cluster Setup\n\nWhen you create your cluster, you need to do the following.\n\n```bash\n# Enable Infiniband for your AKS cluster \naz feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerService\n\n# Check the status\naz feature list -o table --query \"[?contains(name, 'Microsoft.ContainerService/AKSInfinibandSupport')].{Name:name,State:properties.state}\"\n\n# Register when ready\naz provider register --namespace Microsoft.ContainerService\n```\n\nSome additional notes - you need an AKS nodepool with RDMA-capable skus.\n\n## 3. Node Init\n\nNote that if you shell into a node (install `kubectl node-shell`) if you install `ibverbs-utils` and do `ibv_devices` it will be empty. Let's try to install infiniband next, and we will use a container that is also built with ubuntu 22.04 drivers. I was originally looking at [https://github.com/Mellanox/ib-kubernetes](https://github.com/Mellanox/ib-kubernetes) but opted for this approach instead. You can just do but then I switched to the approach we have here. Let's first install the drivers:\n\n```bash\n# Regular without gpu (ubuntu 22.04 full)\nkubectl apply -f ./driver-installation.yaml\n\n# GPU (without doing a host check)\nkubectl apply -f ./driver-installation-with-gpu.yaml\n```\n\nWhen they are done, here is how to check that it was successful - this isn't perfect but it works. Basically we want to see that the ib0 device is up.\n\n```bash\nfor pod in $(kubectl get pods -o json | jq -r .items[].metadata.name)\ndo\n   kubectl exec -it $pod -- nsenter -t 1 -m /usr/sbin/ip link | grep 'ib0:'\ndone\n```\n\nThat should equal the number of nodes. \nApply the daemonset to make it available to pods:\n\n```bash\nkubectl apply -k ./daemonset/\n```\n\nHere is a quick test:\n\n```bash\n# First node\nkubectl node-shell aks-userpool-14173555-vmss000000\nibv_rc_pingpong\n\n# Second node\nkubectl node-shell aks-userpool-14173555-vmss000001 \nibv_rc_pingpong aks-userpool-14173555-vmss000000\n```\n\nYou can get a test environment in [test](test).\nNote that the [ucx perftest](https://github.com/openucx/ucx/tree/master?tab=readme-ov-file#ucx-performance-test) I have found useful.\nWe will add examples with HPC applications (or a link to a repository with them) if requested.\n\n## Install Usernetes\n\nFor Usernetes you should be running on host machines that already have drivers. Given the setup, you should see `/dev/infiniband` already in the container. We don't need to install drivers, but we do (should) install the driver installer for Kubernetes to do this properly.\n\n```bash\nkubectl apply -k ./daemonset-usernetes/\n\n# Check\nkubectl logs -n kube-system rdma-shared-dp-ds-hghzj\n```\n\n## License\n\nHPCIC DevTools is distributed under the terms of the MIT license.\nAll new contributions must be made under this license.\n\nSee [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),\n[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and\n[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.\n\nSPDX-License-Identifier: (MIT)\n\nLLNL-CODE- 842614\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Faks-infiniband-install","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconverged-computing%2Faks-infiniband-install","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Faks-infiniband-install/lists"}