{"id":22347320,"url":"https://github.com/redhat-na-ssa/gpu-workshop","last_synced_at":"2025-07-30T04:33:11.434Z","repository":{"id":51188597,"uuid":"488343494","full_name":"redhat-na-ssa/gpu-workshop","owner":"redhat-na-ssa","description":"Using GPUs on Red Hat Platforms","archived":false,"fork":false,"pushed_at":"2023-09-14T19:22:05.000Z","size":1313,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-04-16T02:01:16.750Z","etag":null,"topics":["cuda","gpu","nvidia","opencl"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/redhat-na-ssa.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-03T19:49:28.000Z","updated_at":"2024-02-08T14:02:49.000Z","dependencies_parsed_at":"2023-02-13T03:45:51.942Z","dependency_job_id":null,"html_url":"https://github.com/redhat-na-ssa/gpu-workshop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redhat-na-ssa%2Fgpu-workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redhat-na-ssa%2Fgpu-workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redhat-na-ssa%2Fgpu-workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/redhat-na-ssa%2Fgpu-workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/redhat-na-ssa","download_url":"https://codeload.github.com/redhat-na-ssa/gpu-workshop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228088846,"owners_count":17867481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu","nvidia","opencl"],"created_at":"2024-12-04T10:08:59.560Z","updated_at":"2024-12-04T10:09:00.703Z","avatar_url":"https://github.com/redhat-na-ssa.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":":scrollbar:\n:data-uri:\n:toc2:\n:linkattrs:\n\n= Red Hat GPU Workshop \n\n:numbered:\n\n== Overview\nA workshop focused on using NVIDIA GPUs on Red Hat platforms. \n\n== Background\n\nGraphics Processing Units (GPUs) were originally invented to allow application developers to program 3D graphics accelerators \nto render photo realistic images in real time. The key is GPUs accelerate matrix and vector math \noperations (dot product, cross product and matrix multiplies). It turns out these math operations are used in many applications \nbesides 3D graphics including high performance computing and machine learning. As a result, software libraries (i.e. CUDA) \nwere developed to allow non-graphics or general purpose computing applications to take advantage of GPU hardware.\n\n[.float-group]\n--\n[.left]\n.Fixed\nimage::./images/skull.jpg[Fixed, 256, 256]\n\n[.left]\n.Programmable\nimage::./images/skullshaded.jpg[Programmable, 256, 256]\n--\n\nGPUs provide a programable graphics pipeline for realistic image rendering.\n\n== RHEL 9.1\n\nThe following NVIDIA® software is required for GPU support.\n```\n\nNVIDIA® GPU drivers version 450.80.02 or higher.\nCUDA® Toolkit 11.2.\ncuDNN SDK 8.1.0.\n(Optional) TensorRT to improve latency and throughput for inference.\n```\n\nThe easy way to install the NVIDIA stack is to use the link:ansible/vm/README.adoc[Ansible based provisioning].\n\nOnce the stack is built, running the `nvidia-smi` utility will cause the kernel modules to load and will verify the driver installation.\n```\nnvidia-smi\n```\n\nExample output:\n```\n+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |\n|-------------------------------+----------------------+----------------------+\n| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n|                               |                      |               MIG M. |\n|===============================+======================+======================|\n|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |\n| N/A   37C    P0    26W /  70W |      0MiB / 15360MiB |     10%      Default |\n|                               |                      |                  N/A |\n+-------------------------------+----------------------+----------------------+\n                                                                               \n+-----------------------------------------------------------------------------+\n| Processes:                                                                  |\n|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n|        ID   ID                                                   Usage      |\n|=============================================================================|\n|  No running processes found                                                 |\n+-----------------------------------------------------------------------------+\n```\n\n==== RHEL Testing\n\n###### Non-container app test\n\nNow the system should be ready to run a gpu workload. Run a simple test to validate the software stack.\n\nCreate a python environment and install tensorflow.\n```\npython3 -m venv venv\nsource venv/bin/activate\npip install pip tensorflow -U\n```\n\nRun a simple tensorflow test to confirm a GPU device is found.\n```\nipython\n\n\u003e\u003e\u003e import tensorflow as tf\n\n\u003e\u003e\u003e tf.config.list_logical_devices()\n```\n\nSample output.\n```\nCreated device /job:localhost/replica:0/task:0/device:GPU:0 with 14644 MB \nmemory:  -\u003e device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5\n\n[LogicalDevice(name='/device:CPU:0', device_type='CPU'),\n LogicalDevice(name='/device:GPU:0', device_type='GPU')]\n```\n\nRun the script to test the tensorflow devices.\n```\npython src/tf-test.py\n```\n\nCompare the CPU vs. GPU elapsed time in the output.\n```\n[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]\nMatrix Multiply Elapsed Time: {'CPU': 6.495161056518555, 'GPU': 0.9890825748443604}\n```\n\n== Additional RHEL Demos\n\n=== CUDA Demo\n```\ncd /usr/local/cuda-11.8/extras/demo_suite\n./nbody -benchmark -cpu\n./nbody -benchmark\n```\n\n=== Container Demos\n\n```\nsudo yum install nvidia-container-toolkit-base -y\n```\n\n==== For RHEL8.x, install the link:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#podman[NVIDIA container toolkit].\n\n==== For RHEL9.1, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman[follow these instructions] to \ninstall the NVIDIA container toolkit.\n\n=== Containerized app tests (verified for RHEL9.2)\n\nThe `nvidia-smi` output should be similar to what was reported above.\n\n```\npodman run --rm --device nvidia.com/gpu=all --security-opt=label=disable nvcr.io/nvidia/cuda:11.3.0-devel-ubi8 nvidia-smi\n\npodman run --rm --device nvidia.com/gpu=all --security-opt=label=disable nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8\n```\n\n== Openshift\n\n=== Install\nThe easy way to install the NVIDIA stack is to use the link:ansible/ocp/README.adoc[Ansible based provisioning]\n\nWait for all the pods to have a completed or running status. This could take several minutes.\n\n```\noc get pods -n nvidia-gpu-operator\n```\n\nThe daemonset pods will build a driver for each node with a GPU.\n\n```\noc logs nvidia-driver-daemonset-410.84.202204112301-0-gf4t4  -n nvidia-gpu-operator  nvidia-driver-ctr --follow\n\nTue May 17 19:41:23 UTC 2022 Waiting for openshift-driver-toolkit-ctr container to build the precompiled driver ...\n```\n\nCheck the logs from one of the `nvidia-cuda-validator` pods.\n\n```\noc logs -n nvidia-gpu-operator nvidia-cuda-validator-qpqcg\n\n\ncuda workload validation is successful\n```\n\n=== GPU Test\n\n. Determine the id of the `gputest` pod:\n+\n-----\n$ POD=$(oc get pods --selector=deploymentconfig=gputest -n gputest --output=custom-columns=:.metadata.name --no-headers)\n-----\n\n. Connect to the tensorflow pod:\n+\n-----\n$ oc rsh ${POD} bash\n-----\n\n. Install the `tensorflow` module:\n+\n-----\n$ pip install tensorflow\n-----\n\n. Install `matplotlib`:\n+\n-----\n$ pip install matplotlib\n-----\n\n\n. Run a quick GPU test:\n\n.. Switch to the `python` interpreter:\n+\n-----\n$ python\n\n\nPython 3.8.10 (default, Mar 15 2022, 12:22:08) \n[GCC 9.4.0] on linux\n-----\n\n. At the python command line, import tensorflow and list physical devices:\n+\n-----\n\u003e\u003e\u003e import tensorflow as tf\n\u003e\u003e\u003e tf.config.list_physical_devices()\n[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]\n-----\n\n. Exit out of the python shell:\n+\n-----\n\u003e\u003e\u003e exit()\n$\n-----\n\n=== GPU Pod test\n`oc apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/tests/gpu-pod.yaml`\n\n=== Jupyter Lab Web App\n\n. Determine route to `GPU Test` web app:\n+\n-----\n$ echo -en \"\\nhttp://$(oc get route gputest --template={{.spec.host}} -n gputest)\\n\"\n-----\n\n. In a new tab of your browser, navigate to the URL returned in the above command.\n\n\n. Determine the `token` needed to authenticate into the jupyter web app:\n+\nFrom the log file of the pod, pick out the token:\n+\n-----\n$ oc logs ${POD} -n gputest | grep \"token=\" | head -n 1 | cut -d \"=\" -f2\n-----\n\n. Use the token to authenticate into the Jupyter Lab web app.\n\n. In Jupyter lab, clone the link:https://github.com/tensorflow/docs.git[tensorflow docs] examples and run the notebook at:  `docs/site/en/tutorials/keras/classification.ipynb`\n\n.. Error:\n+\n-----\n2023-01-24 19:44:26.632828: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA\nTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2023-01-24 19:44:26.776592: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n2023-01-24 19:44:27.622397: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n2023-01-24 19:44:27.622486: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n2023-01-24 19:44:27.622497: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n-----\n\n.. There doesn't seem to be a `tensorRT` image in quay.io/modh/cuda-notebooks.\n+\nTensorRT packages found link:https://developer.nvidia.com/nvidia-tensorrt-7x-download[here].\n\n\n##### Jupyter/Tensorflow Example\n\n- Visit the ${ROUTE} from above.\n- Use the token to login to Jupyter.\n- Open the `tensorflow-tutorials/classification.ipynb` notebook.\n- Run all of the cells.\n- It should train, test and validate a machine learning model.\n\n=== GPU Dashboard (OCP v4.11+)\n\nInstall the GPU console plugin dashboard by following the link:https://docs.openshift.com/container-platform/4.11/monitoring/nvidia-gpu-admin-dashboard.html[Openshift documentation]\n\n=== OpenDataHub (optional)\n\nCreate a new project for OpenDataHub.\n\nUsing the Openshift web console, create an instance of the ODH operator in this project.\n\nCreate an ODH instance in your namespace.\n\nCreate the CUDA enabled notebook image streams.\n```\noc apply -f https://raw.githubusercontent.com/red-hat-data-services/odh-manifests/master/jupyterhub/notebook-images/overlays/additional/tensorflow-notebook-imagestream.yaml \n```\n\n##### Custom Notebook Limits (Optional)\n\nConfigmaps are used to set custom notebook resource limits such as number of cpu cores,\nmemory and GPUs. This is necessary for the jupyter pod to get scheduled\non a GPU node. \n\nApply the following configmap before the launching jupyterhub server.\n```\noc apply -f src/jupyterhub-notebook-sizes.yml\n```\n\nFrom within Jupyter, clone the following repo:\n\nlink:https://github.com/tensorflow/docs.git[Tensor Flow Examples]\n\nThese tensorflow notebook examples should run:\n\n- `docs/site/en/tutorials/keras/classification.ipynb`\n- `docs/site/en/tutorials/quickstart/beginner.ipynb`\n- `docs/site/en/tutorials/quickstart/advanced.ipynb`\n\n== DIY Grafana GPU Dashboard\n```\noc create token grafana-serviceaccount --duration=2000h -n models\n```\n\nEdit `grafana-data-source.yaml` (replace \u003cnamespace\u003e and \u003cservice-account-token\u003e)\n```\noc create -f grafana-data-source.yaml\n```\n\nImport the sample [DCGM exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) (`grafana/NVIDIA_DCGM_Exporter_Dashboard.json`)\n\nimage::images/prometheus.png[]\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredhat-na-ssa%2Fgpu-workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fredhat-na-ssa%2Fgpu-workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredhat-na-ssa%2Fgpu-workshop/lists"}