https://github.com/vimalk78/my-gpu-exporter
https://github.com/vimalk78/my-gpu-exporter
Last synced: 21 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/vimalk78/my-gpu-exporter
- Owner: vimalk78
- Created: 2025-11-20T08:26:15.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-26T11:11:52.000Z (7 months ago)
- Last Synced: 2025-12-25T20:12:57.340Z (6 months ago)
- Language: Go
- Size: 77.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# My GPU Exporter
A Prometheus exporter that exposes **per-process GPU energy consumption** for Kubernetes workloads.
## Key Features
- ✅ **Hardware-Measured Energy** - Direct from GPU hardware (when available)
- ✅ **SM-Based Estimation** - Automatic fallback when time-slicing detected
- ✅ **Per-Process Attribution** - Accurate workload-level power consumption
- ✅ **Kubernetes Integration** - Automatic pod/namespace/container labeling
- ✅ **GPU Time-Slicing Support** - Each pod gets its own measured/estimated energy
- ✅ **Process Lifecycle Management** - Retains metrics after process exits
- ✅ **Prometheus Native** - Standard metrics and labels
## Energy Measurement: Hardware vs Estimation
This exporter prioritizes **hardware-measured energy** from DCGM but automatically falls back to **SM-based estimation** when needed.
### Hardware-Measured Energy (Default)
**When single process per GPU:**
- ✅ Uses DCGM's per-process energy API (`dcgm.GetProcessInfo().EnergyConsumed`)
- ✅ Hardware telemetry from the GPU
- ✅ Actual energy consumed by each process
- ✅ Most accurate attribution
- ✅ Label: `energy_estimated="false"`
### SM-Based Estimation (Fallback)
**When time-slicing detected with DCGM bug (identical energy values):**
- ✅ Automatically detects the issue
- ✅ Estimates energy using SM utilization ratios
- ✅ Formula: `process_energy = gpu_power × (process_sm_util / total_sm_util)`
- ✅ Logs estimation mode for transparency
- ✅ Label: `energy_estimated="true"`
- ℹ️ Enable/disable with `--enable-energy-estimation` (enabled by default)
**Why estimation is needed:** DCGM has a bug where all time-sliced processes report identical energy values (the GPU total). See [DCGM Time-Slicing Energy Bug](docs/DCGM-TIME-SLICING-ENERGY-BUG.md) for details.
## Requirements
### Hardware
- NVIDIA GPU with Volta architecture or newer
- GPU must support DCGM per-process energy tracking
### Software
- NVIDIA Driver
- DCGM library (Data Center GPU Manager)
- **GPU Accounting Mode enabled**: `nvidia-smi -am 1` (must run as root)
- Kubernetes 1.20+ (for pod-resources API)
### Privileges
- Root access OR
- GPU accounting mode pre-enabled on all nodes
## Installation
### Kubernetes (Recommended)
**Note:** For OpenShift deployment, see [OpenShift Deployment Guide](docs/openshift-deployment.md).
1. **Create namespace and deploy:**
```bash
kubectl apply -f kubernetes/daemonset.yaml
```
The DaemonSet includes an init container that automatically enables GPU accounting mode.
2. **Verify deployment:**
```bash
kubectl -n gpu-monitoring get pods
kubectl -n gpu-monitoring logs -l app=my-gpu-exporter
```
3. **Check metrics:**
```bash
kubectl -n gpu-monitoring port-forward svc/my-gpu-exporter 9400:9400
curl http://localhost:9400/metrics
```
### Docker
```bash
# Build image
docker build -t my-gpu-exporter:latest .
# Run (requires GPU access and privileged mode)
docker run -d \
--name my-gpu-exporter \
--gpus all \
--privileged \
--pid=host \
--network=host \
-v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources:ro \
-v /proc:/proc:ro \
my-gpu-exporter:latest
```
### Binary
```bash
# Build
make build
# Run (requires root for GPU access)
sudo ./my-gpu-exporter --log-level=info
```
## Configuration
### Command-line Flags
```bash
--dcgm-update-frequency=1s # DCGM sampling frequency
--process-scan-interval=10s # How often to scan for GPU processes
--kubernetes-enabled=true # Enable Kubernetes pod mapping
--pod-resources-socket=/var/lib/kubelet/pod-resources/kubelet.sock
--metric-retention=5m # Retain exited process metrics
--metric-prefix=my_gpu_process # Prometheus metric name prefix
--enable-energy-estimation=true # Enable SM-based estimation for time-slicing
--listen-address=:9400 # HTTP server address
--metrics-path=/metrics # Metrics endpoint path
--log-level=info # Log level (debug, info, warn, error)
```
## Metrics
### Per-Process Metrics
All per-process metrics include these labels:
- `pid` - Process ID
- `gpu` - GPU index
- `process_name` - Process executable name
- `pod` - Kubernetes pod name
- `namespace` - Kubernetes namespace
- `container` - Container name
- `container_id` - Container ID
#### Energy (Counter)
```prometheus
# Hardware-measured (single process)
my_gpu_process_energy_joules_total{...,energy_estimated="false"} 15234.5
# SM-based estimation (time-slicing with DCGM bug)
my_gpu_process_energy_joules_total{...,energy_estimated="true"} 8421.2
```
**Energy Source:**
- `energy_estimated="false"` - Hardware-measured from GPU (most accurate)
- `energy_estimated="true"` - SM-based estimation (automatic fallback for time-slicing)
**Query to filter by energy source:**
```promql
# Only hardware-measured energy
my_gpu_process_energy_joules_total{energy_estimated="false"}
# Only estimated energy
my_gpu_process_energy_joules_total{energy_estimated="true"}
```
**Usage:**
```promql
# Average power in Watts
rate(my_gpu_process_energy_joules[1m])
# Total energy consumed in last hour (Joules)
increase(my_gpu_process_energy_joules{pod="training-job"}[1h])
```
#### Utilization (Gauges)
```prometheus
my_gpu_process_sm_utilization_ratio{...} 0.85
my_gpu_process_memory_utilization_ratio{...} 0.72
```
Values are 0.0-1.0 (0%-100%).
#### Memory (Gauge)
```prometheus
my_gpu_process_memory_used_bytes{...} 8589934592
```
GPU memory used by process in bytes.
#### Lifecycle (Gauges)
```prometheus
my_gpu_process_start_time_seconds{...} 1699564800
my_gpu_process_active{...} 1 # 1=running, 0=exited
```
### GPU-Level Aggregation Metrics
These metrics aggregate per-process data at the GPU level, useful for validating time-slicing:
#### Total GPU Energy (Counter)
```prometheus
my_gpu_process_gpu_energy_joules_total{gpu="0"} 45234.5
```
Sum of energy consumed by all processes on this GPU. With time-slicing, this represents the total GPU energy distributed across multiple processes.
#### GPU Process Count (Gauge)
```prometheus
my_gpu_process_gpu_process_count{gpu="0"} 3
```
Number of active processes on this GPU. When `> 1`, indicates time-slicing is active.
**Usage:**
```promql
# Detect time-slicing (GPUs with multiple processes)
my_gpu_process_gpu_process_count > 1
# Total power per GPU (Watts)
rate(my_gpu_process_gpu_energy_joules_total[1m])
# Verify: GPU total should equal sum of per-process
rate(my_gpu_process_gpu_energy_joules_total{gpu="0"}[1m])
==
sum(rate(my_gpu_process_energy_joules_total{gpu="0"}[1m]))
```
## Example Queries
### Power Consumption
```promql
# Current power per pod (Watts)
rate(my_gpu_process_energy_joules{namespace="ml"}[1m])
# Total power across all pods
sum(rate(my_gpu_process_energy_joules[1m]))
# Power by namespace
sum by (namespace) (rate(my_gpu_process_energy_joules[5m]))
```
### Energy Accounting
```promql
# Energy consumed by pod in last hour (Joules)
increase(my_gpu_process_energy_joules{pod="training-job-123"}[1h])
# Convert to kWh
increase(my_gpu_process_energy_joules{pod="training-job-123"}[1h]) / 3600000
# Total energy cost (assuming $0.10/kWh)
(increase(my_gpu_process_energy_joules[1h]) / 3600000) * 0.10
```
### Efficiency
```promql
# Power efficiency (compute utilization per Watt)
my_gpu_process_sm_utilization_ratio / rate(my_gpu_process_energy_joules[1m])
# Most power-hungry pods
topk(10, rate(my_gpu_process_energy_joules[5m]))
```
### Active Processes
```promql
# Count of active GPU processes
sum(my_gpu_process_active)
# Active processes per namespace
sum by (namespace) (my_gpu_process_active)
```
## Time-Slicing Support
my-gpu-exporter **fully supports GPU time-slicing** with automatic detection and intelligent energy attribution:
### Features
1. **Automatic Detection**: Detects when multiple processes share a GPU
2. **Smart Energy Attribution**:
- **Hardware-measured** (preferred): Uses DCGM when values are differentiated
- **SM-based estimation** (fallback): Automatically applied when DCGM reports identical values (bug)
3. **Transparent Labeling**: `energy_estimated` label indicates measurement method
4. **Validation**: Detects and logs DCGM time-slicing bug
5. **Aggregation Metrics**: GPU-level totals for validation
### Testing Time-Slicing
See [Time-Slicing Testing Guide](docs/TIMESLICING-TEST.md) for comprehensive testing instructions.
Quick validation:
```bash
# Deploy test workload (3 pods sharing GPU)
kubectl apply -f timeslicing-test.yaml
# Check metrics show different energy per process
curl http://exporter:9400/metrics | grep energy_joules_total
# Verify process count > 1 (indicates time-slicing)
curl http://exporter:9400/metrics | grep gpu_process_count
```
### Logs
**When time-slicing detected with proper DCGM values:**
```
INFO Time-slicing detected: multiple processes on same GPU gpu=0 process_count=3
DEBUG Time-slicing validation: energy values properly differentiated gpu=0 process_count=3
```
**When DCGM bug detected (identical values) and estimation is applied:**
```
INFO Time-slicing detected: multiple processes on same GPU gpu=0 process_count=3
INFO Applying SM-based energy estimation for time-sliced processes gpu=0 process_count=3
DEBUG Applied energy estimation pid=12345 pod=training-job sm_util=0.39 proportion=0.78 estimated_energy_J=245.3
```
**If estimation is disabled:**
```
WARN SUSPICIOUS: All time-sliced processes show identical energy values (estimation disabled) hint="Enable --enable-energy-estimation to use SM-based estimation"
```
## Comparison with dcgm-exporter
| Feature | dcgm-exporter | my-gpu-exporter |
|---------|---------------|-----------------|
| **Scope** | GPU-level | Process-level |
| **Power metric** | `DCGM_FI_DEV_POWER_USAGE` | `my_gpu_process_energy_joules` |
| **Time-slicing** | Duplicates same value | Smart attribution (measured or estimated) |
| **Time-slice detection** | No | Yes (automatic) |
| **DCGM bug detection** | No | Yes (with auto-fallback) |
| **Energy attribution** | N/A (whole GPU) | Hardware-measured (preferred), SM-estimated (fallback) |
| **Transparency** | N/A | `energy_estimated` label shows method |
| **Use case** | GPU monitoring | Workload cost attribution |
### Example with Time-Slicing
**dcgm-exporter** (both show 200W):
```prometheus
DCGM_FI_DEV_POWER_USAGE{gpu="0",exported_pod="pod-a"} 200
DCGM_FI_DEV_POWER_USAGE{gpu="0",exported_pod="pod-b"} 200
Sum = 400W (wrong - GPU only uses 200W!)
```
**my-gpu-exporter** (intelligent attribution):
```prometheus
# Scenario 1: DCGM provides correct per-process values (hardware-measured)
my_gpu_process_energy_joules_total{gpu="0",pod="pod-a",energy_estimated="false"} 120
my_gpu_process_energy_joules_total{gpu="0",pod="pod-b",energy_estimated="false"} 80
# Scenario 2: DCGM bug detected, SM-based estimation applied
# (pod-a has 60% SM util, pod-b has 40% SM util, GPU power is 200W)
my_gpu_process_energy_joules_total{gpu="0",pod="pod-a",energy_estimated="true"} 120
my_gpu_process_energy_joules_total{gpu="0",pod="pod-b",energy_estimated="true"} 80
# GPU-level aggregation (always correct)
my_gpu_process_gpu_energy_joules_total{gpu="0"} 200
# Process count (indicates time-slicing)
my_gpu_process_gpu_process_count{gpu="0"} 2
```
## Troubleshooting
### No metrics appearing
1. **Check GPU accounting mode:**
```bash
nvidia-smi -q | grep "Accounting Mode"
# Should show: Enabled
```
If disabled:
```bash
sudo nvidia-smi -am 1
```
2. **Check DCGM is working:**
```bash
dcgmi discovery -l
```
3. **Check exporter logs:**
```bash
kubectl -n gpu-monitoring logs -l app=my-gpu-exporter
```
### Energy values are zero
- GPU accounting mode must be enabled **before** processes start
- Restart GPU workloads after enabling accounting mode
- Wait 3+ seconds after process starts for DCGM to collect data
### "Failed to get container ID"
- Exporter needs access to `/proc//cgroup`
- Ensure `hostPID: true` in DaemonSet
- Ensure `/proc` volume is mounted
### "Failed to get pod info"
- Check kubelet pod-resources socket exists:
```bash
ls -la /var/lib/kubelet/pod-resources/kubelet.sock
```
- Ensure socket is mounted in container
- Check Kubernetes version (requires 1.20+)
### Processes not showing up
- Exporter **only tracks Kubernetes pods**, not system processes
- Verify process is running in a container:
```bash
cat /proc//cgroup
```
## Architecture
```
┌─────────────────────────────────────┐
│ my-gpu-exporter │
├─────────────────────────────────────┤
│ │
│ ┌──────────────────────────────┐ │
│ │ NVML Process Discovery │ │
│ │ (GetComputeRunningProcs) │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ DCGM Process Metrics │ │
│ │ GetProcessInfo() │ │
│ │ → EnergyConsumed (ACTUAL) │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Kubernetes Pod Mapper │ │
│ │ /proc/PID/cgroup │ │
│ │ + Pod Resources API │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Prometheus Exporter │ │
│ │ /metrics endpoint │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
```
## Contributing
Contributions welcome! Please ensure:
- Code follows Go best practices
- Tests pass
- Documentation is updated
- No introduction of estimation or approximation (use actual measurements only)
## License
[Add your license here]
## Acknowledgments
- NVIDIA DCGM team for per-process energy API
- Prometheus community
- Kubernetes sig-node for pod-resources API