{"id":23257600,"url":"https://github.com/victorgu-github/gpu-metrics-eks-cloudwatch","last_synced_at":"2026-01-19T06:01:32.968Z","repository":{"id":158246542,"uuid":"633927773","full_name":"victorgu-github/GPU-metrics-EKS-Cloudwatch","owner":"victorgu-github","description":null,"archived":false,"fork":false,"pushed_at":"2023-04-28T16:24:17.000Z","size":839,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T04:43:06.454Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/victorgu-github.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-28T15:49:15.000Z","updated_at":"2023-04-28T15:49:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"ff1ca7dd-8c77-4f67-bc87-0ed9d919156b","html_url":"https://github.com/victorgu-github/GPU-metrics-EKS-Cloudwatch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/victorgu-github/GPU-metrics-EKS-Cloudwatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victorgu-github%2FGPU-metrics-EKS-Cloudwatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victorgu-github%2FGPU-metrics-EKS-Cloudwatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victorgu-github%2FGPU-metrics-EKS-Cloudwatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victorgu-github%2FGPU-metrics-EKS-Cloudwatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/victorgu-github","download_url":"https://codeload.github.com/victorgu-github/GPU-metrics-EKS-Cloudwatch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victorgu-github%2FGPU-metrics-EKS-Cloudwatch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28562233,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T03:31:16.861Z","status":"ssl_error","status_checked_at":"2026-01-19T03:31:15.069Z","response_time":67,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-19T12:37:48.892Z","updated_at":"2026-01-19T06:01:32.953Z","avatar_url":"https://github.com/victorgu-github.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# GPU-metrics-EKS-Cloudwatch\nThe demand for GPU instances is increasing due to their use in various Machine Learning applications such as neural network training, complex simulations, and other related tasks. However, this demand presents challenges for customers, such as cost optimization and increased efficiency, as well as the need to allocate GPU usage costs among different parties on a platform solution. To address these challenges, specific data is necessary, and this blog post aims to provide guidance on how to collect this data.\n\nThe main focus of this blog post is to obtain utilization metrics at the container, pod, or namespace level. We will provide details on how to set up container-based GPU metrics and demonstrate how it is integrated into Amazon Elastic Kubernetes Service (EKS) using CloudWatch as an example.\n\n## Solution Overview\nIn summary, our approach involves creating an Amazon Elastic Kubernetes Service (EKS) cluster using a G4.2xlarge instance, although it can be applied to any NVIDIA-accelerated instance family. We will deploy NVIDIA Data Center GPU Manager (DCGM) exporter, which exposes GPU metrics at an HTTP endpoint for monitoring solutions like CloudWatch. We will then use the Prometheus EKS deployment specification to collect the metrics and push them to Amazon CloudWatch Metrics. These metrics provide developers with data to identify optimization opportunities for their workloads. Moreover, we will demonstrate how the collected metrics can be used to allocate costs internally in platform setups where different teams run their workloads on the platform. \u003cbr\u003e\n![](img/image1.png)\nAlternatively, the collected metrics can be sent to Prometheus and viewed using Grafana. For this purpose, Nvidia offers [Grafana dashboards](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) to visualize metrics from the DCGM exporter. Although this blog post will focus on sending the metrics to CloudWatch and viewing them in a CloudWatch dashboard, it is important to consider [Metrics Insights limits](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-limits.html). However, the DCGM exporter setup and the Prometheus scrape configuration described in this blog post can be reused for other monitoring solutions. For those interested in setting up Prometheus as part of an EKS cluster, this [documentation](https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html) provides a starting point.\n\n## NVIDIA GPU DRIVERS\nThere are various methods to set up and configure the GPU nodes in your EKS cluster. The deployment of the DCGM Exporter depends on your specific setup. If you use the [Amazon EKS optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html) provided by Amazon, the NVIDIA drivers are already included in the machine image. In this case, you need to install the [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin), and then deploy the [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) as an additional component into the cluster.\n\nNVIDIA recommends deploying the DCGM Exporter as part of the [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) instead of manual deployment. The GPU Operator simplifies node setup by including various components, such as the NVIDIA device plugin for Kubernetes, to configure the nodes. NVIDIA provides details on how these components fit together in a blog post titled [NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes](https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/).\n\nHowever, in this blog post, we will use the \"Amazon EKS optimized accelerated Amazon Linux AMIs\" and deploy the DCGM exporter as a separate component without using the GPU Operator. This is because the optimized AMI already includes the components that the GPU Operator would otherwise deploy.\n\n## Prerequisite \nIn order to deploy the entire stack you will need the AWS command line utility, helm, eksctl, and kubectl. You also need an EKS cluster with GPU nodes.\n\n## Deploy\n1. Attach the CloudWatchFullAccess policy to the IAM role of the Nodes to authorize them to deploy metrics.\n\n2. Install the NVIDIA Helm Repo\n```bash\nhelm repo add gpu-helm-charts-2 \\\n  https://nvidia.github.io/dcgm-exporter/helm-charts \u0026\u0026 helm repo update\n```\n\n** Make sure that you add the repo gpu-helm-charts-2. Helm charts for the dcgm exporter are no longer updated in the repo gpu-helm-charts. The dcgm exporter version used in the latest helm chart version of the gpu-helm-charts repo is 2.2.9-2.4.0. The version of the dcgm-exporter in gpu-helm-charts-2 chart version 3.1.3 is 3.1.6-3.1.3.\n\nInstall the Prometheus Operator Dependency on ServiceMonitor for in version \"monitoring.coreos.com/v1“ \n\n3. Create the DCGM export Config Map and install the DCGM exporter.\n```bash\ncurl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv \u003e dcgm-metrics.csv \u0026\u0026 \\\nkubectl create namespace dcgm-exporter \u0026\u0026 \\\nkubectl create configmap metrics-config -n dcgm-exporter --from-file=dcgm-metrics.csv\n```\nYou have the option to edit the dcgm-metrics.csv file if you want add/remove any metrics to be collected by the dcgm exporter. However, we will do a selection of metrics later as part of the CloudWatch logs upload.\n\n4. Apply the DCGM export to the EKS cluster\n```bash\nhelm install --wait --generate-name -n dcgm-exporter --version 3.1.3 --create-namespace gpu-helm-charts-2/dcgm-exporter \\\n--set config.name=metrics-config \\\n--set env[0].name=DCGM_EXPORTER_COLLECTORS \\\n--set env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv\n```\nConfirm that the DCGM exporter pod is running - if you inspect the logs the webserver should be up and running.\n```bash\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"Starting webserver\"\namrraga@3c22fbbb46bb ~ % kubectl logs nvidia-dcgm-exporter-g2xfp -n gpu-operator\nDefaulted container \"nvidia-dcgm-exporter\" out of: nvidia-dcgm-exporter, toolkit-validation (init)\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"Starting dcgm-exporter\"\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"DCGM successfully initialized!\"\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"Collecting DCP Metrics\"\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv\"\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"Kubernetes metrics collection enabled!\"\ntime=\"2022-08-24T19:32:11Z\" level=info msg=\"Pipeline starting\"\n```\nYou can test that the DCGM exporter is working by forwarding the port locally to your developer machine like this:\n```bash\nkubectl port-forward -n dcgm-exporter \u003cpod-name\u003e 9500:9400\n```\nYou can then view the metrics by opening http://localhost:9500/metrics in your Browser. It should like this:\n\n5. Create Cloudwatch Agent configuration to scrape the metrics and upload them to CloudWatch\n```bash\n# Download an example configuration:\ncurl -O https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks.yaml\n```\nFor the cwagent.json metrics declaration add the following for cwagent.json configuration file\nThis configuration file only captures GPU utilization as well as main memory usage and power usage. If you want to add additional metrics, you can review a list of field identifiers in the [NVIDIA dcgm documentation](https://docs.nvidia.com/datacenter/dcgm/3.1/dcgm-api/dcgm-api-field-ids.html).\n\nThe pod_name as part of the cw-agent configration represents the value provided by the K8s metadata hence will always be equivalent with the name of the dcgm export pod. The value of pod is the value scraped from the dcgm-exporter, the real value we are interested in and which we will forward to Cloudwatch metrics. \nIn the ConfigMap for the Prometheus configuration add a section for the dcgm exporter.\n```bash\n- job_name: 'kubernetes-pod-dcgm-exporter'\n      sample_limit: 10000\n      kubernetes_sd_configs:\n      - role: endpoints\n      relabel_configs:\n      - source_labels: [__meta_kubernetes_pod_container_name]\n        action: keep\n        regex: '^exporter.*$'\n      - source_labels: [__address__]\n        action: replace\n        regex: ([^:]+)(?::\\d+)?\n        replacement: ${1}:9400\n        target_label: __address__\n      - action: labelmap\n        regex: __meta_kubernetes_pod_label_(.+)\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_namespace\n        target_label: Namespace\n      - source_labels: [__meta_kubernetes_pod_name]\n        action: replace\n        target_label: pod_name\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_pod_container_name\n        target_label: container_name\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_pod_controller_name\n        target_label: pod_controller_name\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_pod_controller_kind\n        target_label: pod_controller_kind\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_pod_phase\n        target_label: pod_phase\n      - action: replace\n        source_labels:\n        - __meta_kubernetes_pod_node_name\n        target_label: NodeName\n```\nThe regex for __meta_kubernetes_pod_container_name matches with the container name inside of the DCGM Pod as part of the service discovery mechanism within Prometheus. \nPlease consider that this configuration is intended to show the implementation of the GPU metric collection hence only includes collection of these and no other metrics. \nIf you have already setup the CloudWatch Agent you can also integrate the configuration from the provided configuration into your own configuration.\n\n6. Deploy the YAML from step 5.\n```bash\nkubectl apply -f prometheus-eks.yaml \n\n```\nConfirm that cloudwatch agent - prometheus pod is running\n```bash\nNAMESPACE         NAME                                READY STATUS  RESTARTS AGE\namazon-cloudwatch cwagent-prometheus-56c448885d-v2k9l 1/1   Running \n```\n\n7. Confirm Metrics in the Cloudwatch Console.\nNavigate to the AWS Cloudwatch Console. Go to All Metrics and in the custom namespaces you should see a new entry for ContainerInsights/Prometheus. If you graph the metrics then you should see per the metrics per container metrics captured via DCGM.\n![](img/image2.png)\n![](img/image3.png)\n\n## Example Dasboards\nInside of Cloudwatch you can create Dashboards helping you to identify optimization potential for your containers:\n![](img/image4.png)\nIn the example Dashboard above you can see overall GPU utilization for your cluster in the left top corner. In the right top corner, you can see the GPU utilization for every GPU grouped by Instance \u0026 GPU. In the middle graph is the GPU utilization per Pod which shows that the Pod with the name dcgmproftester2 has a low GPU utilization. In this specific case, the Pod has requested two GPUs but is only using one, blocking the second GPU for other workloads without using it. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvictorgu-github%2Fgpu-metrics-eks-cloudwatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvictorgu-github%2Fgpu-metrics-eks-cloudwatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvictorgu-github%2Fgpu-metrics-eks-cloudwatch/lists"}