{"id":34660129,"url":"https://github.com/linode/lke-ai-conformance","last_synced_at":"2026-03-15T15:55:19.615Z","repository":{"id":322358621,"uuid":"1088879837","full_name":"linode/lke-ai-conformance","owner":"linode","description":"The CNCF Kubernetes AI Conformance defines a set of additional capabilities, APIs, and configurations that a Kubernetes cluster MUST offer, on top of standard CNCF Kubernetes Conformance, to reliably and efficiently run AI/ML workloads.","archived":false,"fork":false,"pushed_at":"2025-11-04T19:01:13.000Z","size":139,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-11T01:36:16.638Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-03T15:28:13.000Z","updated_at":"2025-12-05T09:42:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/linode/lke-ai-conformance","commit_stats":null,"previous_names":["linode/lke-ai-conformance"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/linode/lke-ai-conformance","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linode%2Flke-ai-conformance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linode%2Flke-ai-conformance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linode%2Flke-ai-conformance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linode%2Flke-ai-conformance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linode","download_url":"https://codeload.github.com/linode/lke-ai-conformance/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linode%2Flke-ai-conformance/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30546141,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-15T15:03:43.933Z","status":"ssl_error","status_checked_at":"2026-03-15T15:03:37.630Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-24T18:52:01.562Z","updated_at":"2026-03-15T15:55:19.603Z","avatar_url":"https://github.com/linode.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# lke-ai-conformance\n\n\u003cimg src=\"https://github.com/linode/lke-ai-conformance/blob/main/images/AkamaiCloud-Horizontal.png\"\u003e\n\u003cimg src=\"https://github.com/linode/lke-ai-conformance/blob/main/images/AkamaiCloud-HorizontalWhite.png\"\u003e\n\n\nAkamai Cloud LKE (Linode Kubernetes Engine) CNCF Kubernetes 1.34 AI Conformance\n\nCNCF Kubernetes AI Conformance\n\nhttps://docs.google.com/document/d/1hXoSdh9FEs13Yde8DivCYjjXyxa7j4J8erjZPEGWuzc/edit?tab=t.0\n\n\n## Deploy LKE with APP Platform\n\n* https://www.linode.com/docs/guides/deploy-llm-for-ai-inferencing-on-apl/#provision-an-lke-cluster\n\n* https://techdocs.akamai.com/app-platform/docs/lke-automatic-install\n\n## Install the NVIDIA GPU Operator\n\n```\nhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia\nhelm repo update\nhelm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v25.10.0 --set dcgmExporter.serviceMonitor.enabled=true\n```\n\n\n## Verify GPU Operator\n\n````\nkubectl get pods -n gpu-operator\n````\n\n````\nNAME                                                              READY   STATUS      RESTARTS   AGE\ngpu-feature-discovery-gtfsf                                       1/1     Running     0          72s\ngpu-feature-discovery-xs9wz                                       1/1     Running     0          72s\ngpu-feature-discovery-z5qv2                                       1/1     Running     0          71s\ngpu-operator-1762185269-node-feature-discovery-gc-8f698864lm44t   1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-master-86bbqzcxv   1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-7hkl2       1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-dg8n5       1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-ljv8j       1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-n4v7b       1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-nvcn4       1/1     Running     0          85s\ngpu-operator-1762185269-node-feature-discovery-worker-x6rft       1/1     Running     0          85s\ngpu-operator-596944c879-5ngk4                                     1/1     Running     0          85s\nnvidia-container-toolkit-daemonset-4zqn8                          1/1     Running     0          71s\nnvidia-container-toolkit-daemonset-b6j69                          1/1     Running     0          76s\nnvidia-container-toolkit-daemonset-sm4sl                          1/1     Running     0          76s\nnvidia-cuda-validator-4llgh                                       0/1     Completed   0          65s\nnvidia-cuda-validator-4ntfc                                       0/1     Completed   0          46s\nnvidia-cuda-validator-w65bs                                       0/1     Completed   0          47s\nnvidia-dcgm-exporter-46gdf                                        1/1     Running     0          73s\nnvidia-dcgm-exporter-gswfz                                        1/1     Running     0          73s\nnvidia-dcgm-exporter-qd2xz                                        1/1     Running     0          71s\nnvidia-device-plugin-daemonset-ptg8v                              1/1     Running     0          74s\nnvidia-device-plugin-daemonset-v929s                              1/1     Running     0          72s\nnvidia-device-plugin-daemonset-wq7jc                              1/1     Running     0          74s\nnvidia-operator-validator-5465v                                   1/1     Running     0          71s\nnvidia-operator-validator-7c54k                                   1/1     Running     0          75s\nnvidia-operator-validator-zs8cg                                   1/1     Running     0          75s\nuser@tty-1aca4783-532c-46e4-a3de-fee9c279ed11:~$ \n````\n\n## GPU detected and labeled\n````\nkubectl get node -o json | jq '.items[].metadata.labels'\n````\n\n````\n \"kubernetes.io/arch\": \"amd64\",\n  \"kubernetes.io/hostname\": \"lke529760-766837-0a634ab30000\",\n  \"kubernetes.io/os\": \"linux\",\n  \"lke.linode.com/pool-id\": \"766837\",\n  \"node.k8s.linode.com/host-uuid\": \"40917c1f174b1ffae34d73e4febe42e76f5541ea\",\n  \"node.kubernetes.io/instance-type\": \"g2-gpu-rtx4000a1-m\",\n  \"nvidia.com/cuda.driver-version.full\": \"580.95.05\",\n  \"nvidia.com/cuda.driver-version.major\": \"580\",\n  \"nvidia.com/cuda.driver-version.minor\": \"95\",\n  \"nvidia.com/cuda.driver-version.revision\": \"05\",\n  \"nvidia.com/cuda.driver.major\": \"580\",\n  \"nvidia.com/cuda.driver.minor\": \"95\",\n  \"nvidia.com/cuda.driver.rev\": \"05\",\n  \"nvidia.com/cuda.runtime-version.full\": \"13.0\",\n  \"nvidia.com/cuda.runtime-version.major\": \"13\",\n  \"nvidia.com/cuda.runtime-version.minor\": \"0\",\n  \"nvidia.com/cuda.runtime.major\": \"13\",\n  \"nvidia.com/cuda.runtime.minor\": \"0\",\n  \"nvidia.com/gfd.timestamp\": \"1762185321\",\n  \"nvidia.com/gpu-driver-upgrade-state\": \"upgrade-done\",\n  \"nvidia.com/gpu.compute.major\": \"8\",\n  \"nvidia.com/gpu.compute.minor\": \"9\",\n  \"nvidia.com/gpu.count\": \"1\",\n  \"nvidia.com/gpu.deploy.container-toolkit\": \"true\",\n  \"nvidia.com/gpu.deploy.dcgm\": \"true\",\n  \"nvidia.com/gpu.deploy.dcgm-exporter\": \"true\",\n  \"nvidia.com/gpu.deploy.device-plugin\": \"true\",\n  \"nvidia.com/gpu.deploy.driver\": \"pre-installed\",\n  \"nvidia.com/gpu.deploy.gpu-feature-discovery\": \"true\",\n  \"nvidia.com/gpu.deploy.node-status-exporter\": \"true\",\n  \"nvidia.com/gpu.deploy.operator-validator\": \"true\",\n  \"nvidia.com/gpu.family\": \"ampere\",\n  \"nvidia.com/gpu.machine\": \"Compute-Instance\",\n  \"nvidia.com/gpu.memory\": \"20475\",\n  \"nvidia.com/gpu.mode\": \"graphics\",\n  \"nvidia.com/gpu.present\": \"true\",\n  \"nvidia.com/gpu.product\": \"NVIDIA-RTX-4000-Ada-Generation\",\n  \"nvidia.com/gpu.replicas\": \"1\",\n  \"nvidia.com/gpu.sharing-strategy\": \"none\",\n  \"nvidia.com/mig.capable\": \"false\",\n  \"nvidia.com/mig.strategy\": \"single\",\n  \"nvidia.com/mps.capable\": \"false\",\n  \"nvidia.com/vgpu.present\": \"false\",\n  \"topology.kubernetes.io/region\": \"us-ord\",\n  \"topology.linode.com/region\": \"us-ord\"\n````\n\n\n\n## DRA Support\n\n### Install Nvidia DRA driver\n````\nhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia \u0026\u0026 helm repo update\nhelm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --namespace nvidia-dra-driver-gpu --create-namespace --values manifests/dra-driver-values.yaml\n````\n\n````\nkubectl get deviceclasses\n````\n\n````\nNAME                                        AGE\ncompute-domain-daemon.nvidia.com            4m32s\ncompute-domain-default-channel.nvidia.com   4m32s\ngpu.nvidia.com                              4m32s\nmig.nvidia.com                              4m32s\n````\n\n\n### Test DRA Resource Claims\n\n````\nkubectl apply -f manifests/resource-claim-template.yaml\nkubectl apply -f manifests/dra-deployment.yaml\n````\n\n````\nkubectl get resourceclaims \nkubectl get pods\n````\n\nShow DRA GPU Example\n````\nkubectl logs -f -lapp=dra-gpu-example\n````\n````\nTue Nov  4 17:24:18 UTC 2025\nGPU 0: NVIDIA RTX 4000 Ada Generation (UUID: GPU-9d332149-daef-d8e3-1168-b247fd392cbe)\n````\n\nShow Resource Claim\n````\nkubectl get resourceclaims\n````\n````\nNAME                                                STATE                AGE\ndra-gpu-example-7b9b75dbb9-prgkm-single-gpu-mwr6m   allocated,reserved   74m\n````\n\n## AI Inference\n\n### Gateway API in Istio\n\nhttps://istio.io/latest/docs/setup/getting-started/#gateway-api\n\n````\nkubectl get crd gateways.gateway.networking.k8s.io \u0026\u003e /dev/null || \\\n{ kubectl kustomize \"github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.3.0\" | kubectl apply -f -; }\n````\n\nPull down Istio files for demo\n````\ncurl -L https://git.io/getLatestIstio | ISTIO_VERSION=1.27.3 sh -\n````\n\nReset istio to use Gateway API\n\n````\nistio-1.27.3/bin/istioctl install --set profile=minimal -y\n````\n\nApply sample app to show Gateway API operational\n````\nkubectl apply -f istio-1.27.3/samples/bookinfo/platform/kube/bookinfo.yaml\n````\n````\nkubectl get services\nkubectl get pods\nkubectl exec \"$(kubectl get pod -l app=ratings -o jsonpath='{.items[0].metadata.name}')\" -c ratings -- curl -sS productpage:9080/productpage | grep -o \"\u003ctitle\u003e.*\u003c/title\u003e\"\n````\n````\n\u003ctitle\u003eSimple Bookstore App\u003c/title\u003e\n````\n\nApply bookinfo gateway\n````\nkubectl apply -f samples/bookinfo/gateway-api/bookinfo-gateway.yaml\n````\n````\nkubectl get gateway bookinfo-gateway\n````\n````\nNAME               CLASS   ADDRESS          PROGRAMMED   AGE\nbookinfo-gateway   istio   172.234.211.21   True         34m\n````\n\n## Gang Scheduling\n\nInstall Kueue\n\n### Create namespace for kueue install\n\n````\nkubectl create ns  kueue-system\n````\n\n### Install via helm\n````\nhelm install kueue oci://registry.k8s.io/kueue/charts/kueue --version=0.14.1 --namespace kueue-system --wait --timeout 300s\n````\n\nWait until Ready\n````\nkubectl get deployments -n kueue-system\n````\n\n````\nNAME                       READY   UP-TO-DATE   AVAILABLE   AGE\nkueue-controller-manager   1/1     1            1           63s\n````\n\n````\nkubectl get pods -n kueue-system\n````\n\n````\nNAME                                        READY   STATUS    RESTARTS      AGE\nkueue-controller-manager-59957d6f7d-5sp8v   1/1     Running   2 (74s ago)   78s\n````\n\nInstall resource Flavor\n````\nkubectl create namespace team-a\nkubectl create namespace team-b\nkubectl apply -f manifests/resource-flavor.yaml\n````\n\nInstall Cluster Queue\n````\nkubectl apply -f manifests/cluster-queue.yaml\n````\n\nInstall Local Queue\n````\nkubectl apply -f manifests/local-queue.yaml\n````\n\nCreate jobs\n````\nkubectl create -f manifests/job-team-b.yaml\nkubectl create -f manifests/job-team-b.yaml\n````\n\n````\nk get jobs -A\n````\n\n````\nNAMESPACE   NAME                      STATUS     COMPLETIONS   DURATION   AGE\nteam-a      sample-job-team-a-4k2df   Complete   3/3           16s        18s\nteam-b      sample-job-team-b-696zj   Complete   3/3           13s        32s\n````\n\n## Cluster Autoscaling\n\nhttps://techdocs.akamai.com/cloud-computing/docs/manage-nodes-and-node-pools#autoscale-automatically-resize-node-pools\n\n\n## Pod Autoscaling\n\nCreate Prometheus Rule\n\n```shell\nkubectl apply -f manifests/prometheus-rule.yaml\n```\n\nInstall Prometheus Adapter\n\n````\nkubectl label servicemonitor -n gpu-operator gpu-operator prometheus=system\nkubectl label servicemonitor -n gpu-operator nvidia-dcgm-exporter prometheus=system\nhelm install prometheus-adapter -n monitoring prometheus-community/prometheus-adapter --set prometheus.url=\"http://po-prometheus.monitoring.svc.cluster.local\"\n````\n\n### Test Kubernetes Custom Metrics Endpoint\n\nCreate a deployment\n\n```shell\nkubectl apply -f manifests/cuda-deployment.yaml\n```\n\nCheck for the metrics exposed by DCGM Exporter such as `DCGM_FI_DEV_GPU_UTIL`\n\n```\nkubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep cuda_gpu\n```\n\n### Create HPA to Scale deployment based on GPU Utilization\n\n```shell\nkubectl apply -f manifests/hpa.yaml\n```\n\nSimulate load on GPU\n\n```shell\nkubectl exec -it deployment/cuda -- bash\n````\n\n````\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026 \\\nfor (( c=1; c\u003c=5000; c++ )); do ./vectorAdd; done \u0026\n````\n\nMonitor HPA and pod replicas\n````\nkubectl get hpa\n````\n````\nNAME   REFERENCE         TARGETS   MINPODS   MAXPODS   REPLICAS   AGE\ncuda   Deployment/cuda   9/5       1         3         1          3m20s\n````\n\n````\nkubectl get pod -lapp=cuda\n````\n````\nNAME                    READY   STATUS    RESTARTS   AGE\ncuda-75454ffb9f-gv5nc   1/1     Running   0          5m3s\n````\n\n````\nkubectl get hpa\n````\n\n````\nNAME   REFERENCE         TARGETS   MINPODS   MAXPODS   REPLICAS   AGE\ncuda   Deployment/cuda   6/5       1         3         3          4m2s\n````\n\n````\nkubectl get pod -lapp=cuda\n````\n````\nNAME                    READY   STATUS    RESTARTS   AGE\ncuda-75454ffb9f-2v6fx   1/1     Running   0          70s\ncuda-75454ffb9f-7z29g   1/1     Running   0          100s\ncuda-75454ffb9f-gv5nc   1/1     Running   0          5m3s\n````\n\n## Accelerator Metrics\n\nShow the DCGM metrics are installed and available\n\n````\nkubectl -n gpu-operator get svc\n````\n````\nNAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE\ngpu-operator           ClusterIP   10.128.231.34   \u003cnone\u003e        8080/TCP   9h\nnvidia-dcgm-exporter   ClusterIP   10.128.20.33    \u003cnone\u003e        9400/TCP   9h\n````\n\n````\nkubectl port-forward service/nvidia-dcgm-exporter 9400:9400 -n gpu-operator\n````\n\n````\ncurl 127.0.0.1:9400/metrics\n````\n\n````\nHandling connection for 9400\n# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_SM_CLOCK gauge\nDCGM_FI_DEV_SM_CLOCK{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 2310\n# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_MEM_CLOCK gauge\nDCGM_FI_DEV_MEM_CLOCK{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 9001\n# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).\n# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge\nDCGM_FI_DEV_MEMORY_TEMP{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).\n# TYPE DCGM_FI_DEV_GPU_TEMP gauge\nDCGM_FI_DEV_GPU_TEMP{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 48\n# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).\n# TYPE DCGM_FI_DEV_POWER_USAGE gauge\nDCGM_FI_DEV_POWER_USAGE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 40.979000\n# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).\n# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter\nDCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 455497156\n# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.\n# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter\nDCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).\n# TYPE DCGM_FI_DEV_GPU_UTIL gauge\nDCGM_FI_DEV_GPU_UTIL{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).\n# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge\nDCGM_FI_DEV_MEM_COPY_UTIL{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).\n# TYPE DCGM_FI_DEV_ENC_UTIL gauge\nDCGM_FI_DEV_ENC_UTIL{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).\n# TYPE DCGM_FI_DEV_DEC_UTIL gauge\nDCGM_FI_DEV_DEC_UTIL{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.\n# TYPE DCGM_FI_DEV_XID_ERRORS gauge\nDCGM_FI_DEV_XID_ERRORS{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",err_code=\"0\",err_msg=\"No Error\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).\n# TYPE DCGM_FI_DEV_FB_FREE gauge\nDCGM_FI_DEV_FB_FREE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 20015\n# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).\n# TYPE DCGM_FI_DEV_FB_USED gauge\nDCGM_FI_DEV_FB_USED{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 1\n# HELP DCGM_FI_DEV_FB_RESERVED Framebuffer memory reserved (in MiB).\n# TYPE DCGM_FI_DEV_FB_RESERVED gauge\nDCGM_FI_DEV_FB_RESERVED{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 457\n# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors\n# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter\nDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors\n# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter\nDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed\n# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge\nDCGM_FI_DEV_ROW_REMAP_FAILURE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.\n# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter\nDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status\n# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge\nDCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0\n# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active.\n# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge\nDCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0.004656\n# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.\n# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge\nDCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0.000000\n# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data.\n# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge\nDCGM_FI_PROF_DRAM_ACTIVE{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 0.003074\n# HELP DCGM_FI_PROF_PCIE_TX_BYTES The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.\n# TYPE DCGM_FI_PROF_PCIE_TX_BYTES gauge\nDCGM_FI_PROF_PCIE_TX_BYTES{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 78398480\n# HELP DCGM_FI_PROF_PCIE_RX_BYTES The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.\n# TYPE DCGM_FI_PROF_PCIE_RX_BYTES gauge\nDCGM_FI_PROF_PCIE_RX_BYTES{gpu=\"0\",UUID=\"GPU-a8394b51-c086-568a-9f1a-c1e1c934c44c\",pci_bus_id=\"00000000:00:02.0\",device=\"nvidia0\",modelName=\"NVIDIA RTX 4000 Ada Generation\",Hostname=\"lke529760-766837-0a634ab30000\",DCGM_FI_DRIVER_VERSION=\"580.95.05\",container=\"cuda\",namespace=\"default\",pod=\"cuda-75454ffb9f-8ljq7\",pod_uid=\"\"} 45987793\n````\n\n## AI Service Metrics\n\nPrometheus is installable and usable on the LKE cluster\n\nhttps://www.linode.com/docs/guides/deploy-prometheus-operator-with-grafana-on-lke/\n\n\n## Secure Accelerator Access\n\nRun kubernetes e2e DRA test suite.\n\nCreate two Pods, each is allocated an accelerator resource. Execute a command in one Pod to attempt to access the other Pod’s\naccelerator, and should be denied.\n\n**Step 1**: Make Kubernetes DRA e2e tests\n\n```\nmkdir k8s.io \u0026\u0026 \\\ngit clone https://github.com/kubernetes/kubernetes\n```\n```\ngit checkout v1.34.1\n```\n```\nmake WHAT=\"ginkgo k8s.io/kubernetes/test/e2e/e2e.test\"\n```\n\n**Step 2**: Run multi-container access test\n\n```\nKUBECONFIG=/Users/xxx/.kube/config _output/bin/ginkgo -v -focus='must map configs and devices to the right containers' ./test/e2e\n```\n\n```\n  I1104 12:42:43.374995   58787 e2e.go:109] Starting e2e run \"41597bc2-8046-4015-b863-8b98a1109aea\" on Ginkgo node 1\nRunning Suite: Kubernetes e2e suite - /Users/srust/ws/github.com/linode/lke-ai-conformance/k8s.io/kubernetes/test/e2e\n=====================================================================================================================\nRandom Seed: 1762278140 - will randomize all specs\n\nWill run 1 of 7137 specs\n------------------------------\n[sig-node] [DRA] kubelet [Feature:DynamicResourceAllocation] must map configs and devices to the right containers [sig-node, DRA, Feature:DynamicResourceAllocation]\n/Users/.../k8s.io/kubernetes/test/e2e/dra/dra.go:180\n  STEP: Creating a kubernetes client @ 11/04/25 12:42:44.127\n  STEP: Building a namespace api object, basename dra @ 11/04/25 12:42:44.128\n  STEP: Waiting for a default service account to be provisioned in namespace @ 11/04/25 12:42:44.575\n  STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 11/04/25 12:42:44.649\n  STEP: selecting nodes @ 11/04/25 12:42:44.726\n  STEP: deploying driver dra-8441.k8s.io on nodes [lke530427-767898-51df6b5f0000] @ 11/04/25 12:42:44.776\n  STEP: wait for plugin registration @ 11/04/25 12:42:47.302\n  STEP: creating *v1.DeviceClass dra-8441-class @ 11/04/25 12:42:49.304\n  STEP: creating *v1.DeviceClass dra-8441-class0 @ 11/04/25 12:42:49.351\n  STEP: creating *v1.DeviceClass dra-8441-class1 @ 11/04/25 12:42:49.395\n  STEP: creating *v1.DeviceClass dra-8441-class2 @ 11/04/25 12:42:49.44\n  STEP: creating *v1.DeviceClass dra-8441-class3 @ 11/04/25 12:42:49.49\n  STEP: creating *v1.DeviceClass dra-8441-class4 @ 11/04/25 12:42:49.529\n  STEP: creating *v1.DeviceClass dra-8441-class5 @ 11/04/25 12:42:49.571\n  STEP: creating *v1.ResourceClaim all @ 11/04/25 12:42:49.608\n  STEP: creating *v1.ResourceClaim container0 @ 11/04/25 12:42:49.654\n  STEP: creating *v1.ResourceClaim container1 @ 11/04/25 12:42:49.698\n  STEP: creating *v1.Pod tester-1 @ 11/04/25 12:42:49.747\n  STEP: delete pods and claims @ 11/04/25 12:42:56.398\n  STEP: deleting *v1.Pod dra-8441/tester-1 @ 11/04/25 12:42:56.438\n  STEP: deleting *v1.ResourceClaim dra-8441/all @ 11/04/25 12:43:00.667\n  STEP: deleting *v1.ResourceClaim dra-8441/container0 @ 11/04/25 12:43:00.716\n  STEP: deleting *v1.ResourceClaim dra-8441/container1 @ 11/04/25 12:43:00.759\n  STEP: waiting for resources on lke530427-767898-51df6b5f0000 to be unprepared @ 11/04/25 12:43:00.807\n  STEP: waiting for claims to be deallocated and deleted @ 11/04/25 12:43:00.808\n  STEP: scaling down driver proxy pods for dra-8441.k8s.io @ 11/04/25 12:43:01.569\n  STEP: Waiting for ResourceSlices of driver dra-8441.k8s.io to be removed... @ 11/04/25 12:43:02.025\n  STEP: Destroying namespace \"dra-8441\" for this suite. @ 11/04/25 12:43:02.23\n\nRan 1 of 7137 Specs in 18.904 seconds\nSUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 7136 Skipped\nPASS\n\nGinkgo ran 1 suite in 42.3242365s\nTest Suite Passed\n```\n\n## Robust Controller \n\n### Install KubeRay Operator\n\nAdd the following helm repos and install the kuberay-operator\n\n```shell\nhelm repo add kuberay https://ray-project.github.io/kuberay-helm/\nhelm repo update\nhelm install kuberay-operator kuberay/kuberay-operator --version 1.4.2\n```\n\nVerify that kuberay-operator pod is running and in Ready state\n\n```shell\nkubectl get pod -lapp.kubernetes.io/name=kuberay-operator\n```\n\n```shell\nNAME                               READY   STATUS    RESTARTS   AGE\nkuberay-operator-87c45b7f8-czg6d   1/1     Running   0          22s\n```\n\n### Install RayCluster\n\nInstall raycluster using the following helmchart\n\n```shell\nhelm install raycluster kuberay/ray-cluster --version 1.4.2\n```\n\n\nVerify that raycluster resources are in Ready state\n\n```shell\nkubectl get rayclusters\n```\n\n```shell\nNAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE\nraycluster-kuberay   1                 1                   2      3G       0      ready    4m18s\n```\n\n### Run KubeRay Job\n\nDeploy a ray job for Modlin workload\n```shell\nkubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-job.modin.yaml\n```\n\n```\nkubectl logs -l=job-name=rayjob-sample\n```\n```\n2025-11-03 18:05:10,467\tINFO worker.py:1554 -- Using address 10.2.2.187:6379 set in the environment variable RAY_ADDRESS\n2025-11-03 18:05:10,467\tINFO worker.py:1694 -- Connecting to existing Ray cluster at address: 10.2.2.187:6379...\n2025-11-03 18:05:10,480\tINFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at 10.2.2.187:8265\nModin Engine: Ray\nFutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.\nTime to compute isnull: 0.12875682900630636\nTime to compute rounded_trip_distance: 0.4448743859975366\n2025-11-03 18:05:46,335\tSUCC cli.py:65 -- -----------------------------------\n2025-11-03 18:05:46,335\tSUCC cli.py:66 -- Job 'rayjob-sample-84rdk' succeeded\n2025-11-03 18:05:46,335\tSUCC cli.py:67 -- -----------------------------------\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinode%2Flke-ai-conformance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinode%2Flke-ai-conformance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinode%2Flke-ai-conformance/lists"}