https://github.com/vuvietnguyenit/gpuxray
A lightweight GPU observability tool focused on per-process GPU metrics, with optional deep tracing powered by eBPF.
https://github.com/vuvietnguyenit/gpuxray
ebpf gpu gpu-monitoring tracing
Last synced: 3 months ago
JSON representation
A lightweight GPU observability tool focused on per-process GPU metrics, with optional deep tracing powered by eBPF.
- Host: GitHub
- URL: https://github.com/vuvietnguyenit/gpuxray
- Owner: vuvietnguyenit
- License: mit
- Created: 2026-01-26T04:07:15.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-03-12T08:49:30.000Z (4 months ago)
- Last Synced: 2026-03-12T14:56:26.869Z (4 months ago)
- Topics: ebpf, gpu, gpu-monitoring, tracing
- Language: Go
- Homepage:
- Size: 438 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# GPUXRAY



An opensource observability tool for debugging GPU workloads on Linux servers.
It traces CUDA activity using eBPF and provides:
- per-process GPU metrics
- GPU memory leak detection
- Prometheus metrics for monitoring systems.
This tool is inspired by [pidstat](https://man7.org/linux/man-pages/man1/pidstat.1.html) but designed for GPU monitoring.
GPUXRAY is designed for AI/ML workloads running on GPU servers.
### Why use gpuxray?
- `gpuxray` provides GPU observability at the process level, which is not fully supported by [DCGM exporter](https://github.com/NVIDIA/dcgm-exporter). It also leverages eBPF to perform deeper tracing of GPU workloads.
- It leverages eBPF to trace CUDA activity from the kernel, enabling low-overhead and deep tracing of GPU workloads.
## Usecases
- Works well for tracing and get stats from processes that use GPU resources through the CUDA API (e.g., ML jobs, AI workloads).
- Exposes Prometheus metrics fpr GPU resources associated with each PID.
- Very convinient for detecting GPU memory leaked. It can show stack traces of leaked GPU memory blocks and identify the CUDA functions responsible for allocations that were not freed.
## Notice
- Currently, this tool only inspects PIDs that use the CUDA Driver API. Processes that use the CUDA Runtime API may be omitted.
- Requires Linux kernel version >= 5.6. Kernel versions in the 6.x series are recommended.
- Currently supports only the `amd64` CPU architecture.
## Table of Contents
- [GPUXRAY](#gpuxray)
- [Why use gpuxray?](#why-use-gpuxray)
- [Usecases](#usecases)
- [Notice](#notice)
- [Table of Contents](#table-of-contents)
- [Architecture](#architecture)
- [Install](#install)
- [Binary](#binary)
- [Docker](#docker)
- [Build from source](#build-from-source)
- [Quickstart](#quickstart)
- [Run GPU exporter](#run-gpu-exporter)
- [Memory statistics](#memory-statistics)
- [Show memory-leaked stacktraces](#show-memory-leaked-stacktraces)
- [GPUXRAY for Kubernetes](#gpuxray-for-kubernetes)
- [GPUXRAY for Docker](#gpuxray-for-docker)
- [Debugging](#debugging)
- [Contributing](#contributing)
## Architecture
GPUXRAY collects GPU information using multiple techniques:
1. **NVML** – retrieves GPU metrics per process
2. **eBPF** – traces CUDA calls
3. **Go exporter** – exposes metrics for Prometheus
## Install
### Binary
Install gpuxray easily with one command:
```sh
curl -s https://raw.githubusercontent.com/vuvietnguyenit/gpuxray/main/install.sh | sh
```
### Docker
```bash
docker pull ghcr.io/vuvietnguyenit/gpuxray:latest
```
### Build from source
```sh
git clone https://github.com/vuvietnguyenit/gpuxray
cd gpuxray
go build -o gpuxray
```
## Quickstart
### Run GPU exporter
Running the exporter exposes metrics related to processes using GPU resources on the server.
```sh
# gpuxray mon
```
Metric definitions are available in: [metrics.txt](./metrics.txt)
Example result
```text
curl http://localhost:2112/metrics
...
# HELP gpu_free_memory_bytes Remaining available GPU memory for the process in bytes.
# TYPE gpu_free_memory_bytes gauge
gpu_free_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 9.903734784e+09
# HELP gpu_process_active 1 for each process currently using a GPU.
# TYPE gpu_process_active gauge
gpu_process_active{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 1
gpu_process_active{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_
index="0",hostname="gpu1.itim.vn",pid="2912"} 1
gpu_process_active{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vinai/phobert-base
-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 1
# HELP gpu_process_sm_utilization_percent GPU SM utilisation of the process (0–100). Requires NVML r470+ drivers; returns 0 on older drivers.
# TYPE gpu_process_sm_utilization_percent gauge
gpu_process_sm_utilization_percent{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 0
gpu_process_sm_utilization_percent{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7
cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="2912"} 0
gpu_process_sm_utilization_percent{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vi
nai/phobert-base-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 86
# HELP gpu_process_used_memory_bytes GPU memory consumed by the process in bytes.
# TYPE gpu_process_used_memory_bytes gauge
gpu_process_used_memory_bytes{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 1.1296768e+07
gpu_process_used_memory_bytes{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc8
1e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="2912"} 1.0575872e+07
gpu_process_used_memory_bytes{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vinai/p
hobert-base-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 2.3716691968e+10
# HELP gpu_total_memory_bytes Total GPU memory available in bytes.
# TYPE gpu_total_memory_bytes gauge
gpu_total_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 3.4190917632e+10
# HELP gpu_used_memory_bytes GPU memory currently allocated in bytes.
# TYPE gpu_used_memory_bytes gauge
gpu_used_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 2.4287182848e+10
...
```
### Memory statistics
This command reports statistics about GPU memory usage of process.
```sh
./gpuxray memtrace -p 2806854
```
Example result
```sh
# ./gpuxray memtrace -p 2806854
TIME PID USER GPU INUSE_MB AL_CNT FR_CNT COMM
12:03:24 2806854 root 0 512 B 199 198 python3
12:03:29 2806854 root 0 512 B 402 401 python3
12:03:34 2806854 root 0 512 B 607 606 python3
12:03:39 2806854 root 0 1.00 KiB 802 800 python3
12:03:44 2806854 root 0 1.00 KiB 994 992 python3
12:03:49 2806854 root 0 2.00 KiB 1197 1193 python3
```
To see the meaning of each column, run: `./gpuxray memtrace -h` flag to see more information
### Show memory-leaked stacktraces
This command prints stack traces responsible for leaked GPU memory allocations.
```sh
./gpuxray memtrace -p 332361 -i 1 --print-stacks
```
Example result
```sh
# ./gpuxray memtrace -p 332361 -i 1 --print-stacks
2026-03-04T15:59:44+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 1 TotalBytes: 512 B
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)
2026-03-04T15:59:45+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 1 TotalBytes: 512 B
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)
2026-03-04T15:59:46+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 2 TotalBytes: 1.00 KiB
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)
^C2026/03/04 15:59:46 Received signal, exiting..
```
### GPUXRAY for Kubernetes
Follow [kubernetes.md](./docs/kubernetes.md)
### GPUXRAY for Docker
Follow [docker.md](./docs/docker.md)
## Debugging
Follow [debugging.md](./docs/debugging.md)
## Contributing
Contributions are welcome. Feel free to open issues or submit pull requests.