{"id":13716488,"url":"https://github.com/NVIDIA/dcgm-exporter","last_synced_at":"2025-05-07T05:33:14.869Z","repository":{"id":37462197,"uuid":"395039200","full_name":"NVIDIA/dcgm-exporter","owner":"NVIDIA","description":"NVIDIA GPU metrics exporter for Prometheus leveraging DCGM","archived":false,"fork":false,"pushed_at":"2025-04-16T17:54:51.000Z","size":4168,"stargazers_count":1155,"open_issues_count":108,"forks_count":185,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-04-23T16:08:55.340Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"security.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-08-11T15:40:37.000Z","updated_at":"2025-04-23T15:56:06.000Z","dependencies_parsed_at":"2023-12-08T18:25:41.475Z","dependency_job_id":"e3ac33fd-d62e-4c50-9878-11c1a8027b86","html_url":"https://github.com/NVIDIA/dcgm-exporter","commit_stats":null,"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fdcgm-exporter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fdcgm-exporter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fdcgm-exporter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fdcgm-exporter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/dcgm-exporter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252823500,"owners_count":21809705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T00:01:10.971Z","updated_at":"2025-05-07T05:33:14.858Z","avatar_url":"https://github.com/NVIDIA.png","language":"Go","funding_links":[],"categories":["Monitoring","11. GPU Observability","Go"],"sub_categories":["Prometheus Based","Cost \u0026 Usage Tracking"],"readme":"# DCGM-Exporter\n\nThis repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA DCGM](https://developer.nvidia.com/dcgm).\n\n### Documentation\n\nOfficial documentation for DCGM-Exporter can be found on [docs.nvidia.com](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html).\n\n### Quickstart\n\nTo gather metrics on a GPU node, simply start the `dcgm-exporter` container:\n\n```shell\ndocker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:4.2.0-4.1.0-ubuntu22.04\ncurl localhost:9400/metrics\n# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_SM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_MEM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).\n# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge\n...\nDCGM_FI_DEV_SM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 139\nDCGM_FI_DEV_MEM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 405\nDCGM_FI_DEV_MEMORY_TEMP{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 9223372036854775794\n...\n```\n\n### Quickstart on Kubernetes\n\nNote: Consider using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) rather than DCGM-Exporter directly.\n\nEnsure you have already setup your cluster with the [default runtime as NVIDIA](https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup).\n\nThe recommended way to install DCGM-Exporter is to use the Helm chart:\n\n```shell\nhelm repo add gpu-helm-charts \\\n  https://nvidia.github.io/dcgm-exporter/helm-charts\n```\n\nUpdate the repo:\n\n```shell\nhelm repo update\n```\n\nAnd install the chart:\n\n```shell\nhelm install \\\n    --generate-name \\\n    gpu-helm-charts/dcgm-exporter\n```\n\nOnce the `dcgm-exporter` pod is deployed, you can use port forwarding to obtain metrics quickly:\n\n```shell\nkubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml\n\n# Let's get the output of a random pod:\nNAME=$(kubectl get pods -l \"app.kubernetes.io/name=dcgm-exporter\" \\\n                         -o \"jsonpath={ .items[0].metadata.name}\")\n\nkubectl port-forward $NAME 8080:9400 \u0026\n\ncurl -sL http://127.0.0.1:8080/metrics\n# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_SM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_MEM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).\n# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge\n...\nDCGM_FI_DEV_SM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\",container=\"\",namespace=\"\",pod=\"\"} 139\nDCGM_FI_DEV_MEM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\",container=\"\",namespace=\"\",pod=\"\"} 405\nDCGM_FI_DEV_MEMORY_TEMP{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\",container=\"\",namespace=\"\",pod=\"\"} 9223372036854775794\n...\n\n```\n\nTo integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/).\n`dcgm-exporter` is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#gpu-telemetry).\n\n### TLS and Basic Auth\n\nExporter supports TLS and basic auth using [exporter-toolkit](https://github.com/prometheus/exporter-toolkit). To use TLS and/or basic auth, users need to use `--web-config-file` CLI flag as follows\n\n```shell\ndcgm-exporter --web-config-file=web-config.yaml\n```\n\nA sample `web-config.yaml` file can be fetched from [exporter-toolkit repository](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-config.yml). The reference of the `web-config.yaml` file can be consulted in the [docs](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md).\n\n### How to include HPC jobs in metric labels\n\nThe DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate files that map GPUs to HPC jobs.\n\n#### File Conventions\n\nThese mapping files follow a specific format:\n\n* Each file is named after a unique GPU ID (e.g., 0, 1, 2, etc.).\n* Each line in the file contains JOB IDs that run on the corresponding GPU.\n\n#### Enabling HPC Job Mapping on DCGM-Exporter\n\nTo enable GPU-to-job mapping on the DCGM-exporter side, users must run the DCGM-exporter with the --hpc-job-mapping-dir command-line parameter, pointing to a directory where the HPC cluster creates job mapping files. Or, users can set the environment variable DCGM_HPC_JOB_MAPPING_DIR to achieve the same result.\n\n### Building from Source\n\nIn order to build dcgm-exporter ensure you have the following:\n\n* [Golang \u003e= 1.22 installed](https://golang.org/)\n* [DCGM installed](https://developer.nvidia.com/dcgm)\n* Have Linux machine with GPU, compatible with DCGM.\n\n```shell\ngit clone https://github.com/NVIDIA/dcgm-exporter.git\ncd dcgm-exporter\nmake binary\nsudo make install\n...\ndcgm-exporter \u0026\ncurl localhost:9400/metrics\n# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_SM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).\n# TYPE DCGM_FI_DEV_MEM_CLOCK gauge\n# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).\n# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge\n...\nDCGM_FI_DEV_SM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 139\nDCGM_FI_DEV_MEM_CLOCK{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 405\nDCGM_FI_DEV_MEMORY_TEMP{gpu=\"0\", UUID=\"GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52\"} 9223372036854775794\n...\n```\n\n### Changing Metrics\n\nWith `dcgm-exporter` you can configure which fields are collected by specifying a custom CSV file.\nYou will find the default CSV file under `etc/default-counters.csv` in the repository, which is copied on your system or container to `/etc/dcgm-exporter/default-counters.csv`\n\nThe layout and format of this file is as follows:\n\n```\n# Format\n# If line starts with a '#' it is considered a comment\n# DCGM FIELD, Prometheus metric type, help message\n\n# Clocks\nDCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).\nDCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).\n```\n\nA custom csv file can be specified using the `-f` option or `--collectors` as follows:\n\n```shell\ndcgm-exporter -f /tmp/custom-collectors.csv\n```\n\nNotes:\n\n* Always make sure your entries have 2 commas (',')\n* The complete list of counters that can be collected can be found on the DCGM API reference manual: \u003chttps://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html\u003e\n\n### What about a Grafana Dashboard?\n\nYou can find the official NVIDIA DCGM-Exporter dashboard here: \u003chttps://grafana.com/grafana/dashboards/12239\u003e\n\nYou will also find the `json` file on this repo under `grafana/dcgm-exporter-dashboard.json`\n\nPull requests are accepted!\n\n### Building the containers\n\nThis project uses [docker buildx](https://docs.docker.com/buildx/working-with-buildx/) for multi-arch image creation. Follow the instructions on that page to get a working builder instance for creating these containers. Some other useful build options follow.\n\nBuilds local images based on the machine architecture and makes them available in 'docker images'\n\n```shell\nmake local\n```\n\nBuild the ubuntu image and export to 'docker images'\n\n```shell\nmake ubuntu22.04 PLATFORMS=linux/amd64 OUTPUT=type=docker\n```\n\nBuild and push the images to some other 'private_registry'\n\n```shell\nmake REGISTRY=\u003cprivate_registry\u003e push\n```\n\n## Issues and Contributing\n\n[Checkout the Contributing document!](CONTRIBUTING.md)\n\n* Please let us know by [filing a new issue](https://github.com/NVIDIA/dcgm-exporter/issues/new)\n* You can contribute by opening a [pull request](https://github.com/NVIDIA/dcgm-exporter)\n\n### Reporting Security Issues\n\nWe ask that all community members and users of DCGM Exporter follow the standard NVIDIA process for reporting security vulnerabilities. This process is documented at the [NVIDIA Product Security](https://www.nvidia.com/en-us/security/) website.\nFollowing the process will result in any needed CVE being created as well as appropriate notifications being communicated\nto the entire DCGM Exporter community. NVIDIA reserves the right to delete vulnerability reports until they're fixed.\n\nPlease refer to the policies listed there to answer questions related to reporting security issues.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fdcgm-exporter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2Fdcgm-exporter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fdcgm-exporter/lists"}