https://github.com/vuvietnguyenit/gpuxray

A lightweight GPU observability tool focused on per-process GPU metrics, with optional deep tracing powered by eBPF.
https://github.com/vuvietnguyenit/gpuxray

ebpf gpu gpu-monitoring tracing

Last synced: 3 months ago
JSON representation

A lightweight GPU observability tool focused on per-process GPU metrics, with optional deep tracing powered by eBPF.

Host: GitHub
URL: https://github.com/vuvietnguyenit/gpuxray
Owner: vuvietnguyenit
License: mit
Created: 2026-01-26T04:07:15.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-12T08:49:30.000Z (4 months ago)
Last Synced: 2026-03-12T14:56:26.869Z (4 months ago)
Topics: ebpf, gpu, gpu-monitoring, tracing
Language: Go
Homepage:
Size: 438 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# GPUXRAY

![GitHub release](https://img.shields.io/github/v/release/vuvietnguyenit/gpuxray)
![Go Version](https://img.shields.io/github/go-mod/go-version/vuvietnguyenit/gpuxray)
![License](https://img.shields.io/github/license/vuvietnguyenit/gpuxray?style=flat)

An opensource observability tool for debugging GPU workloads on Linux servers.

It traces CUDA activity using eBPF and provides:
- per-process GPU metrics
- GPU memory leak detection
- Prometheus metrics for monitoring systems.

This tool is inspired by [pidstat](https://man7.org/linux/man-pages/man1/pidstat.1.html) but designed for GPU monitoring.
GPUXRAY is designed for AI/ML workloads running on GPU servers.

### Why use gpuxray?
- `gpuxray` provides GPU observability at the process level, which is not fully supported by [DCGM exporter](https://github.com/NVIDIA/dcgm-exporter). It also leverages eBPF to perform deeper tracing of GPU workloads.
- It leverages eBPF to trace CUDA activity from the kernel, enabling low-overhead and deep tracing of GPU workloads.

## Usecases
- Works well for tracing and get stats from processes that use GPU resources through the CUDA API (e.g., ML jobs, AI workloads).
- Exposes Prometheus metrics fpr GPU resources associated with each PID.
- Very convinient for detecting GPU memory leaked. It can show stack traces of leaked GPU memory blocks and identify the CUDA functions responsible for allocations that were not freed.

## Notice
- Currently, this tool only inspects PIDs that use the CUDA Driver API. Processes that use the CUDA Runtime API may be omitted.
- Requires Linux kernel version >= 5.6. Kernel versions in the 6.x series are recommended.
- Currently supports only the `amd64` CPU architecture.

## Table of Contents
- [GPUXRAY](#gpuxray)
- [Why use gpuxray?](#why-use-gpuxray)
- [Usecases](#usecases)
- [Notice](#notice)
- [Table of Contents](#table-of-contents)
- [Architecture](#architecture)
- [Install](#install)
- [Binary](#binary)
- [Docker](#docker)
- [Build from source](#build-from-source)
- [Quickstart](#quickstart)
- [Run GPU exporter](#run-gpu-exporter)
- [Memory statistics](#memory-statistics)
- [Show memory-leaked stacktraces](#show-memory-leaked-stacktraces)
- [GPUXRAY for Kubernetes](#gpuxray-for-kubernetes)
- [GPUXRAY for Docker](#gpuxray-for-docker)
- [Debugging](#debugging)
- [Contributing](#contributing)

## Architecture

GPUXRAY collects GPU information using multiple techniques:

1. **NVML** – retrieves GPU metrics per process
2. **eBPF** – traces CUDA calls
3. **Go exporter** – exposes metrics for Prometheus

## Install

### Binary
Install gpuxray easily with one command:
```sh
curl -s https://raw.githubusercontent.com/vuvietnguyenit/gpuxray/main/install.sh | sh
```
### Docker

```bash
docker pull ghcr.io/vuvietnguyenit/gpuxray:latest
```
### Build from source

```sh
git clone https://github.com/vuvietnguyenit/gpuxray
cd gpuxray
go build -o gpuxray
```

## Quickstart
### Run GPU exporter

Running the exporter exposes metrics related to processes using GPU resources on the server.
```sh
# gpuxray mon
```
Metric definitions are available in: [metrics.txt](./metrics.txt)

Example result

```text
curl http://localhost:2112/metrics
...
# HELP gpu_free_memory_bytes Remaining available GPU memory for the process in bytes.
# TYPE gpu_free_memory_bytes gauge
gpu_free_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 9.903734784e+09
# HELP gpu_process_active 1 for each process currently using a GPU.
# TYPE gpu_process_active gauge
gpu_process_active{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 1
gpu_process_active{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_
index="0",hostname="gpu1.itim.vn",pid="2912"} 1
gpu_process_active{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vinai/phobert-base
-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 1
# HELP gpu_process_sm_utilization_percent GPU SM utilisation of the process (0–100). Requires NVML r470+ drivers; returns 0 on older drivers.
# TYPE gpu_process_sm_utilization_percent gauge
gpu_process_sm_utilization_percent{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 0
gpu_process_sm_utilization_percent{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7
cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="2912"} 0
gpu_process_sm_utilization_percent{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vi
nai/phobert-base-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 86
# HELP gpu_process_used_memory_bytes GPU memory consumed by the process in bytes.
# TYPE gpu_process_used_memory_bytes gauge
gpu_process_used_memory_bytes{args="/usr/bin/gnome-shell",comm="gnome-shell",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="3112"} 1.1296768e+07
gpu_process_used_memory_bytes{args="/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -nolisten tcp -background none -noreset -keeptty -novtswitch -verbose 3",comm="Xorg",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc8
1e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="2912"} 1.0575872e+07
gpu_process_used_memory_bytes{args="python -m src.models.classifier --train-file /shared_storage/ailab/intent-classifier/train/raw-click.gz --valid-file /shared_storage/ailab/intent-classifier/valid/raw-click.gz --model-name vinai/p
hobert-base-v2",comm="python",gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn",pid="401948"} 2.3716691968e+10
# HELP gpu_total_memory_bytes Total GPU memory available in bytes.
# TYPE gpu_total_memory_bytes gauge
gpu_total_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 3.4190917632e+10
# HELP gpu_used_memory_bytes GPU memory currently allocated in bytes.
# TYPE gpu_used_memory_bytes gauge
gpu_used_memory_bytes{gpu="GPU-47def375-4603-e5fa-82d3-c7cddc81e65a",gpu_index="0",hostname="gpu1.itim.vn"} 2.4287182848e+10
...
```

### Memory statistics

This command reports statistics about GPU memory usage of process.

```sh
./gpuxray memtrace -p 2806854
```

Example result

```sh
# ./gpuxray memtrace -p 2806854
TIME PID USER GPU INUSE_MB AL_CNT FR_CNT COMM
12:03:24 2806854 root 0 512 B 199 198 python3
12:03:29 2806854 root 0 512 B 402 401 python3
12:03:34 2806854 root 0 512 B 607 606 python3
12:03:39 2806854 root 0 1.00 KiB 802 800 python3
12:03:44 2806854 root 0 1.00 KiB 994 992 python3
12:03:49 2806854 root 0 2.00 KiB 1197 1193 python3
```

To see the meaning of each column, run: `./gpuxray memtrace -h` flag to see more information

### Show memory-leaked stacktraces

This command prints stack traces responsible for leaked GPU memory allocations.

```sh
./gpuxray memtrace -p 332361 -i 1 --print-stacks
```

Example result

```sh
# ./gpuxray memtrace -p 332361 -i 1 --print-stacks
2026-03-04T15:59:44+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 1 TotalBytes: 512 B
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)

2026-03-04T15:59:45+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 1 TotalBytes: 512 B
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)

2026-03-04T15:59:46+07:00
[1] PID: 332361 GPU: 0 StackID: 1908 Remaining Blocks: 2 TotalBytes: 1.00 KiB
#00 0x71263447d86e libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c
#01 0x7126344491c3 libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf
#02 0x71263448d993 cudaMalloc
#03 0x712634420cde __pyx_f_13cupy_backends_4cuda_3api_7runtime_malloc(unsigned long, int)

^C2026/03/04 15:59:46 Received signal, exiting..
```

### GPUXRAY for Kubernetes
Follow [kubernetes.md](./docs/kubernetes.md)

### GPUXRAY for Docker
Follow [docker.md](./docs/docker.md)

## Debugging
Follow [debugging.md](./docs/debugging.md)

## Contributing
Contributions are welcome. Feel free to open issues or submit pull requests.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vuvietnguyenit/gpuxray

Awesome Lists containing this project

README