{"id":13937832,"url":"https://github.com/Langhalsdino/Kubernetes-GPU-Guide","last_synced_at":"2025-07-20T00:31:12.035Z","repository":{"id":46704663,"uuid":"93916919","full_name":"Langhalsdino/Kubernetes-GPU-Guide","owner":"Langhalsdino","description":"This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.","archived":false,"fork":false,"pushed_at":"2022-10-03T08:53:48.000Z","size":441,"stargazers_count":816,"open_issues_count":2,"forks_count":115,"subscribers_count":40,"default_branch":"master","last_synced_at":"2024-11-27T06:36:44.544Z","etag":null,"topics":["cluster","deep-learning","distributed-systems","gpu","gpu-computing","guide","kubernetes","kubernetes-cluster","kubernetes-gpu-cluster","kubernetes-setup","worker-nodes"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Langhalsdino.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-10T05:01:01.000Z","updated_at":"2024-11-26T18:27:10.000Z","dependencies_parsed_at":"2023-01-19T02:45:46.043Z","dependency_job_id":null,"html_url":"https://github.com/Langhalsdino/Kubernetes-GPU-Guide","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Langhalsdino/Kubernetes-GPU-Guide","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Langhalsdino%2FKubernetes-GPU-Guide","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Langhalsdino%2FKubernetes-GPU-Guide/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Langhalsdino%2FKubernetes-GPU-Guide/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Langhalsdino%2FKubernetes-GPU-Guide/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Langhalsdino","download_url":"https://codeload.github.com/Langhalsdino/Kubernetes-GPU-Guide/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Langhalsdino%2FKubernetes-GPU-Guide/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266048494,"owners_count":23868738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","deep-learning","distributed-systems","gpu","gpu-computing","guide","kubernetes","kubernetes-cluster","kubernetes-gpu-cluster","kubernetes-setup","worker-nodes"],"created_at":"2024-08-07T23:03:56.790Z","updated_at":"2025-07-20T00:31:11.379Z","avatar_url":"https://github.com/Langhalsdino.png","language":"Shell","funding_links":[],"categories":["Roadmap","Shell","Featured On"],"sub_categories":["[Deep Learning](#deep-learning)"],"readme":"# How to automate deep learning training with Kubernetes GPU-cluster\n\nThis guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.\u003c/br\u003e\nTherefore I will explain how to easily setup a GPU cluster on multiple Ubuntu 16.04 bare metal servers and provide some useful scripts and .yaml files that do the entire setup for you.\n\nBy the way: If you need a Kubernetes GPU-cluster for other reasons, this guide might be helpful to you as well.\n\n**Why did i write this guide?**\u003c/br\u003e\nI have worked as in intern for the Startup [understand.ai](https://understand.ai) and noticed the hassle of firstly designing a machine learning algorithm locally and than bringing it to the cloud for training with different parameters and datasets.\u003c/br\u003e\nThe second part, bringing it to the cloud for extensive training, takes always longer than thought, is frustrating and involves usually a lot of pitfalls.\n\nFor this reason i decided to work on this problem and make the second part effortless, easy and quick.\u003c/br\u003e\nThe result of this work is this handy guide, that describes how everyone can setup their own Kubernetes GPU cluster to accelerate their work.\n\n**The new process for the deep learning researchers:**\u003c/br\u003e\nThe automated deep learning training with a Kubernetes GPU-cluster improves the process of brining your algorithm for training in the cloud significantly.\n\nThis illustration visualizes the new workflow, that involves only two simple steps:\u003c/br\u003e\n![My inspiration for the project, designed by Langhalsdino.](resources/description.jpg?raw=true \"My inspiration for the project\")\n\n**Disclaimer**\u003c/br\u003e\nBe aware, that the following sections might be opinionated. Kubernetes is an evolving, fast paced environment, which means this guide will probably be outdated at times, depending on the authors spare time and individual contributions. Due to this fact contributions are highly appreciated.\n\n## Table of Contents\n\n  * [Quick Kubernetes revive](#quick-kubernetes-revive)\n  * [Rough overview on the structure of the cluster](#rough-overview-on-the-structure-of-the-cluster)\n  * [Initiate nodes](#initiate-nodes)\n    - [Constraints of my setup](#constraints-of-my-setup)\n    - [Setup instructions](#setup-instructions)\n        - [Use fast setup script](#fast-track---setup-script)\n        - [Manually step by step instructions](#detailed-step-by-step-instructions)\n  * [How to build your GPU container](#how-to-build-your-gpu-container)\n  * [Some helpful commands](#some-helpful-commands)\n  * [Acknowledgements](#acknowledgements)\n  * [License](#license)\n\n## Quick Kubernetes revive\n\n**These articles might be helpful, if you need to refresh your Kubernetes knowledge:**\n\n  * [Introduction to Kubernetes by DigitalOcean](https://www.digitalocean.com/community/tutorials/an-introduction-to-kubernetes)\n  * [Kubernetes concepts](https://kubernetes.io/docs/concepts/)\n  * [Kubernetes by example](http://kubernetesbyexample.com/)\n  * [Kubernetes basics - interactive tutorial](https://kubernetes.io/docs/tutorials/kubernetes-basics/)\n\n## Rough overview on the structure of the cluster\nThe main idea is, to have a small CPU only master node, that controls a cluster of GPU-worker nodes.\n![Rough overview on the structure of the cluster, designed by Langhalsdino](resources/System-overview.jpg?raw=true \"Rough overview\")\n\n## Initiate nodes\nBefore we can use the cluster, it is important to firstly initiate the cluster. \u003c/br\u003e\nTherefore each node has to be manually initiated and joined to the cluster.\n\n### Constraints of my setup\nThis are the constraints for my setup, I have been in some places tighter than necessary, but this is my setup and it worked for me 😒  \n\n**Master**\n\n+ Ubuntu 16.04\n+ SSH access with sudo user\n+ Internet access\n+ ufw deactivated (not recommended, but for ease of use)\n+ Enabled Ports (udp and tcp)\n    - 6443, 443, 8080\n    - 30000-32767 (only if your apps need them)\n    - These will be used to access services from outside of the cluster\n\n**Worker**\n\n+ Ubuntu 16.04\n+ SSH access with sudo user\n+ Internet access\n+ ufw deactivated (not recommended, but for ease of use)\n+ Enabled Ports (udp and tcp)\n    - 6443, 443\n\n### Setup instructions\nThese instruction cover my experience on Ubuntu 16.04 and may or may not be suited to transfer to other OS’s.\n\nI have created two scripts that fully initiate the master and worker node as described bellow. If you want to take the fast track, just use them. Otherwise, i recommended to read the step by step instructions.\n\n\u003ch4\u003eFast Track - Setup script\u003c/h4\u003e\nOk, lets take the fast track. Copy the corresponding scripts on your master and workers.\u003c/br\u003e\nFurthermore make sure that your setup fits into my constraints.\n\n**MASTER NODE**\n\nExecute the initialization script and remember the token 😉 \u003cbr/\u003e\nThe token will look like this: ```—token f38242.e7f3XXXXXXXXe231e```.\n\n```\nchmod +x init-master.sh\nsudo ./init-master.sh \u003cIP-of-master\u003e\n```\n\n**WORKER NODE**\n\nExecute the initialization script with the correct token and IP of your master.\u003cbr/\u003e\nThe port is usually ```6443```.\n\n```\nchmod +x init-worker.sh\nsudo ./init-worker.sh \u003cToken-of-Master\u003e \u003cIP-of-master\u003e:\u003cPort\u003e\n```\n\n\u003ch4\u003eDetailed step by step instructions\u003c/h4\u003e\n\n**MASTER NODE**\n\n**1.** Add Kubernetes Repository to the packagemanager\n```\nsudo su -\napt-get update \u0026\u0026 apt-get install -y apt-transport-https\ncurl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -\ncat \u003c\u003cEOF \u003e/etc/apt/sources.list.d/kubernetes.list\ndeb http://apt.kubernetes.io/ kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni\n\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho 'You might need to reboot / relogin to make docker work correctly'\n```\n\n**3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the master node.\nKeep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes!\n\n**3.I**\nAdd GPU support to the Kubeadm configuration, while cluster is not initialized.\n```\nsudo vim /etc/systemd/system/kubelet.service.d/\u003c\u003cNumber\u003e\u003e-kubeadm.conf\n```\nappend ExecStart with the flag ```—feature-gates=\"Accelerators=true\"```, so it will look like this:\n```\nExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** Restart kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** Now we will initialize the master node.\u003cbr/\u003e\nTherefore you will need the IP of your master node.\nFurthermore this step will provide you with the credentials to add further worker nodes, so remember your token 😉 \u003c/br\u003e\nThe token will look like this: ``` —token f38242.e7f3XXXXXXXXe231e 130.211.XXX.XXX:6443```\n```\nsudo kubeadm init --apiserver-advertise-address=\u003cip-address\u003e\n```\n**5.** Since Kubernetes 1.6 changed from ABAC roll-management to RBAC we need to advertise the credentials of the user.\nYou will need to perform this step for each time you will log into the machine!!\n```\nsudo cp /etc/kubernetes/admin.conf $HOME/\nsudo chown $(id -u):$(id -g) $HOME/admin.conf\nexport KUBECONFIG=$HOME/admin.conf\n```\n\n**6.** Install network add-on that your pods can communicate with each other. Kubernetes 1.6 has some requirements for the network add-on, some of them are:\n\n + CNI-based networks\n + RBAC support\n\nThis GoogleSheet contains a selection of suitable network add- on GoogleSheet-Network-Add-on-vergleich .\nI will use wave-works, just because of my personal preference ;)\n```\nkubectl apply -f https://git.io/weave-kube-1.6\n```\n**5.II** You are ready to go, maybe check your pods to confirm that everything is working ;)\n```\nkubectl get pods —all-namespaces\n```\n**N.** If you want to tear down your master, you will need to reset the master node\n```\nsudo kubeadm reset\n```\n\n**WORKER NODE**\n\nThe beginning should be familiar to you and make this process a lot faster ;)\n\n**1.** Add Kubernetes Repository to the packagemanager\n```\nsudo su -\napt-get update \u0026\u0026 apt-get install -y apt-transport-https\ncurl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -\ncat \u003c\u003cEOF \u003e/etc/apt/sources.list.d/kubernetes.list\ndeb http://apt.kubernetes.io/ kubernetes-xenial main\nEOF\napt-get update\nexit\n```\n\n**2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni\n\n```\nsudo apt-get install -y docker-engine\nsudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni\nsudo groupadd docker\nsudo usermod -aG docker $USER\necho 'You might need to reboot / relogin to make docker work correctly'\n```\n\n**3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the worker nodes that have a GPU installed.\nKeep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes!\n\n**3.I**\nAdd GPU support to the Kubeadm configuration, while cluster is not initialized.\n```\nsudo vim /etc/systemd/system/kubelet.service.d/\u003c\u003cNumber\u003e\u003e-kubeadm.conf\n```\nappend ExecStart with the flag ```—feature-gates=\"Accelerators=true\"```, so it will look like this:\n```\nExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates=\"Accelerators=true\"\n```\n\n**3.II** Restart kubelet\n```\nsudo systemctl daemon-reload\nsudo systemctl restart kubelet\n```\n\n**4.** Now we will add the worker to the cluster.\u003cbr/\u003e\nTherefore you will need to remember the token from your master node, so take a deep dive into your notes xD\n```\nsudo kubeadm join --token f38242.e7f3XXXXXXe231e 130.211.XXX.XXX:6443\n```\n**5.** Finished, check your nodes on your master and see if everything worked.\n```\nkubectl get nodes\n```\n**N.** If you want to tear down your worker node, you will need to remove the node from the cluster and reset the worker node.\nFurthermore it will be beneficial to remove the worker node from the cluster  \n***On master:***   \n```\nkubectl delete node \u003cworker node name\u003e\n```\n***On worker node***\n```     \nsudo kubeadm reset\n```\n\n**Client**\n\nIn order to control your cluster e.g. your master from your client, you will need to authenticate your client with the right user.\nThis guid won’t cover creating a separate user for client, we will just copy the user from the master node.\u003cbr/\u003e\nThis will be easier, trust me 🤓 \u003c/br\u003e\n[Instruction to add custom user, will be added in the future]\n\n**1.** Install kubectl on your client. I have only tested it on may mac, but linux should work as well.\nI don’t know about windows, but who cares about windows anyway :D\u003c/br\u003e\n**On Mac**\n```\nbrew install kubectl\n```\n**2.** Copy the admin authentication from the master to your client\n```\nscp uai@130.211.XXX.64:~/admin.conf ~/.kube/\n```\n**3.** Add the admin.conf configuration and credentials to Kubernetes configuration. You will need to do this for every agent\n```\nexport KUBECONFIG=~/.kube/admin.conf\n```\nYou are ready to use kubectl on you local client.\n\n**3.II** You can test by listing all your pods\n```\nkubectl get pods —all-namespaces\n```\n\n\n**Install Kubernetes dashboard**\n\nThe kubernetes dashboard is pretty beautiful and gives script kiddies like me access to a lot of functionality.\nIn order to use the dashboard you will need to get your client running, RBAC will ensure it 👮\n\n**You can perform this steps directly on the master or from your client**\n\n**1.** Check if the dashboard is already installed\nkubectl get pods --all-namespaces | grep dashboard\n\n**2.** If the dashboard isn’t installed, install it ;)\n```\nkubectl create -f https://git.io/kube-dashboard\n```\nIf this did not work check if the container defined in the .yaml [git.io/kube-dashboard](https/git.io/kube-dashboard) exist. (This bug cost me a lot of time)\n\nIn order to have access to your dashboard you will need to be authenticated with you client.\n\n**3.** Proxy the dashboard to your client\n```\nkubectl proxy\n```\n\n**4.** Access the dashboard within your browser by visiting\n[127.0.0.1:8001/ui](127.0.0.1:8001/ui)\n\n## How to build your GPU container\nThis guide should help you to get a Docker container running, that needs GPU access.\n\nFor this guide i have chosen to build an example Docker container, that uses TensorFlow GPU binaries and can run TensorFlow programs in a Jupyter notebook.\n\nKeep in mind, that this guide has been written for Kubernetes 1.6, therefore further changes can compromise this guide.\n\n### Essential parts of .yml\nIn order to get your Nvidia GPU with CUDA running you have to pass the Nvidia driver and CUDA libraries to your container.\nSo we will use hostPath to make them available to the Kubernetes pod.\nThe actual path differ from machine to machine, since they are set by your Nvidia driver and CUDA installation.\n```\nvolumes:\n    - hostPath:\n        path: /usr/lib/nvidia-375/bin \n        name: bin\n    - hostPath:\n        path: /usr/lib/nvidia-375\n        name: lib\n```\nMount the volumes with the driver and CUDA in the right directory for your container. These might differ, due to specific requirements of your container.\n```\nvolumeMounts:\n    - mountPath: /usr/local/nvidia/bin\n        name: bin\n    - mountPath: /usr/local/nvidia/lib\n        name: lib\n```\nSince you want to tell Kubernetes that you need n GPUs , you can define your requirements here.\n```\nresources:\n    limits:\n        alpha.kubernetes.io/nvidia-gpu: 1\n```\nThats it, it is everything you need to build your Kuberntes 1.6 container 😏\n\nSome note at the end, that describes my overall experience:\u003cbr/\u003e\n**Kubernetes + Docker + Machine Learning + GPUs = Pure awesomeness**\n\n### Example GPU deployment\nMy example-gpu-deployment.yaml file describes two parts, a deployment and a service, since i want to make jupyter notebook available form the outside.\n\nRun kubectl apply to make it available to the outside\n```\nkubectl create -f deployment.yaml\n```\n\nThe deployment.yaml file looks like this:\n```\n---\napiVersion: extensions/v1beta1\nkind: Deployment\nmetadata:\n  name: tf-jupyter\nspec:\n  replicas: 1\n  template:\n    metadata:\n      labels:\n        app: tf-jupyter\n    spec:\n      volumes:\n      - hostPath:\n          path: /usr/lib/nvidia-375/bin\n        name: bin\n      - hostPath:\n          path: /usr/lib/nvidia-375\n        name: lib\n      containers:\n      - name: tensorflow\n        image: tensorflow/tensorflow:0.11.0rc0-gpu\n        ports:\n        - containerPort: 8888\n        resources:\n          limits:\n            alpha.kubernetes.io/nvidia-gpu: 1\n        volumeMounts:\n        - mountPath: /usr/local/nvidia/bin\n          name: bin\n        - mountPath: /usr/local/nvidia/lib\n          name: lib\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: tf-jupyter-service\n  labels:\n    app: tf-jupyter\nspec:\n  selector:\n    app: tf-jupyter\n  ports:\n  - port: 8888\n    protocol: TCP\n    nodePort: 30061\n  type: LoadBalancer\n---\n```\n\n## Some helpful commands\n\n**Get commands** with basic output\n```\nkubectl get services                 # List all services in the namespace\nkubectl get pods --all-namespaces    # List all pods in all namespaces\nkubectl get pods -o wide             # List all pods in the namespace, with more details\nkubectl get deployment my-dep        # List a particular deployment\n```\n\n**Describe commands** with verbose output\n```\nkubectl describe nodes \u003cnode-name\u003e\nkubectl describe pods \u003cpod-name\u003e\n```\n\n**Deleting Resources**\n```\nkubectl delete -f ./pod.yaml                   # Delete a pod using the type and name specified in pod.yaml\nkubectl delete pod,service baz foo             # Delete pods and services with same names \"baz\" and \"foo\"\nkubectl delete pods,services -l name=\u003cLabel\u003e   # Delete pods and services with label name=myLabel\nkubectl -n \u003cnamespace\u003e delete po,svc --all     # Delete all pods and services in namespace my-ns\n```\n\n**Get into the bash console** of one of your pods:\n\n```\nkubectl exec -it \u003cpod-name\u003e — /bin/bash\n```\n\n## Common issues\n\nSome people contacted me with some issues on their CUDA deployment related to the forwarding of drivers.\u003cbr\u003e\nIf the example-gpu-deployment.yaml is not working for you, i would recommended you to try to install CUDA as described by this guide [Installing Tenserflow on Ubuntu](http://simonboehm.com/tech/2017/06/23/installingTensorFlow.html) in more detail and try the example-gpu-deployment-nvidia-375-82.yaml.\nIt might be necessary to adjust the version number in the yaml file.\n\nIf you encountered another issue, feel free to open an issue on github.\n\n## Acknowledgements\nThere are a lot of guides, github repositories, issues and people out there who helped me a lot. \u003c/br\u003e\nSo I want to thank everybody for their help.\u003c/br\u003e\nSpecially the Startup [understand.ai](http://understand.ai) for their support.\n\n### Authors\n\n* **Frederic Tausch** - *Initial work* - [Langhalsdino](https://github.com/Langhalsdino)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLanghalsdino%2FKubernetes-GPU-Guide","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLanghalsdino%2FKubernetes-GPU-Guide","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLanghalsdino%2FKubernetes-GPU-Guide/lists"}