{"id":34567349,"url":"https://github.com/vimalk78/my-gpu-exporter","last_synced_at":"2026-05-29T10:31:16.572Z","repository":{"id":326213526,"uuid":"1100400030","full_name":"vimalk78/my-gpu-exporter","owner":"vimalk78","description":null,"archived":false,"fork":false,"pushed_at":"2025-11-26T11:11:52.000Z","size":79,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-25T20:12:57.340Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vimalk78.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-20T08:26:15.000Z","updated_at":"2025-11-26T11:11:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"0f5f4e93-f864-4fc3-9a87-e061f3e253de","html_url":"https://github.com/vimalk78/my-gpu-exporter","commit_stats":null,"previous_names":["vimalk78/my-gpu-exporter"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vimalk78/my-gpu-exporter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vimalk78%2Fmy-gpu-exporter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vimalk78%2Fmy-gpu-exporter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vimalk78%2Fmy-gpu-exporter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vimalk78%2Fmy-gpu-exporter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vimalk78","download_url":"https://codeload.github.com/vimalk78/my-gpu-exporter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vimalk78%2Fmy-gpu-exporter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33648530,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-24T09:10:57.854Z","updated_at":"2026-05-29T10:31:16.556Z","avatar_url":"https://github.com/vimalk78.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# My GPU Exporter\n\nA Prometheus exporter that exposes **per-process GPU energy consumption** for Kubernetes workloads.\n\n## Key Features\n\n- ✅ **Hardware-Measured Energy** - Direct from GPU hardware (when available)\n- ✅ **SM-Based Estimation** - Automatic fallback when time-slicing detected\n- ✅ **Per-Process Attribution** - Accurate workload-level power consumption\n- ✅ **Kubernetes Integration** - Automatic pod/namespace/container labeling\n- ✅ **GPU Time-Slicing Support** - Each pod gets its own measured/estimated energy\n- ✅ **Process Lifecycle Management** - Retains metrics after process exits\n- ✅ **Prometheus Native** - Standard metrics and labels\n\n## Energy Measurement: Hardware vs Estimation\n\nThis exporter prioritizes **hardware-measured energy** from DCGM but automatically falls back to **SM-based estimation** when needed.\n\n### Hardware-Measured Energy (Default)\n\n**When single process per GPU:**\n- ✅ Uses DCGM's per-process energy API (`dcgm.GetProcessInfo().EnergyConsumed`)\n- ✅ Hardware telemetry from the GPU\n- ✅ Actual energy consumed by each process\n- ✅ Most accurate attribution\n- ✅ Label: `energy_estimated=\"false\"`\n\n### SM-Based Estimation (Fallback)\n\n**When time-slicing detected with DCGM bug (identical energy values):**\n- ✅ Automatically detects the issue\n- ✅ Estimates energy using SM utilization ratios\n- ✅ Formula: `process_energy = gpu_power × (process_sm_util / total_sm_util)`\n- ✅ Logs estimation mode for transparency\n- ✅ Label: `energy_estimated=\"true\"`\n- ℹ️ Enable/disable with `--enable-energy-estimation` (enabled by default)\n\n**Why estimation is needed:** DCGM has a bug where all time-sliced processes report identical energy values (the GPU total). See [DCGM Time-Slicing Energy Bug](docs/DCGM-TIME-SLICING-ENERGY-BUG.md) for details.\n\n## Requirements\n\n### Hardware\n- NVIDIA GPU with Volta architecture or newer\n- GPU must support DCGM per-process energy tracking\n\n### Software\n- NVIDIA Driver\n- DCGM library (Data Center GPU Manager)\n- **GPU Accounting Mode enabled**: `nvidia-smi -am 1` (must run as root)\n- Kubernetes 1.20+ (for pod-resources API)\n\n### Privileges\n- Root access OR\n- GPU accounting mode pre-enabled on all nodes\n\n## Installation\n\n### Kubernetes (Recommended)\n\n**Note:** For OpenShift deployment, see [OpenShift Deployment Guide](docs/openshift-deployment.md).\n\n1. **Create namespace and deploy:**\n```bash\nkubectl apply -f kubernetes/daemonset.yaml\n```\n\nThe DaemonSet includes an init container that automatically enables GPU accounting mode.\n\n2. **Verify deployment:**\n```bash\nkubectl -n gpu-monitoring get pods\nkubectl -n gpu-monitoring logs -l app=my-gpu-exporter\n```\n\n3. **Check metrics:**\n```bash\nkubectl -n gpu-monitoring port-forward svc/my-gpu-exporter 9400:9400\ncurl http://localhost:9400/metrics\n```\n\n### Docker\n\n```bash\n# Build image\ndocker build -t my-gpu-exporter:latest .\n\n# Run (requires GPU access and privileged mode)\ndocker run -d \\\n  --name my-gpu-exporter \\\n  --gpus all \\\n  --privileged \\\n  --pid=host \\\n  --network=host \\\n  -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources:ro \\\n  -v /proc:/proc:ro \\\n  my-gpu-exporter:latest\n```\n\n### Binary\n\n```bash\n# Build\nmake build\n\n# Run (requires root for GPU access)\nsudo ./my-gpu-exporter --log-level=info\n```\n\n## Configuration\n\n### Command-line Flags\n\n```bash\n--dcgm-update-frequency=1s          # DCGM sampling frequency\n--process-scan-interval=10s         # How often to scan for GPU processes\n--kubernetes-enabled=true           # Enable Kubernetes pod mapping\n--pod-resources-socket=/var/lib/kubelet/pod-resources/kubelet.sock\n--metric-retention=5m               # Retain exited process metrics\n--metric-prefix=my_gpu_process      # Prometheus metric name prefix\n--enable-energy-estimation=true     # Enable SM-based estimation for time-slicing\n--listen-address=:9400              # HTTP server address\n--metrics-path=/metrics             # Metrics endpoint path\n--log-level=info                    # Log level (debug, info, warn, error)\n```\n\n## Metrics\n\n### Per-Process Metrics\n\nAll per-process metrics include these labels:\n- `pid` - Process ID\n- `gpu` - GPU index\n- `process_name` - Process executable name\n- `pod` - Kubernetes pod name\n- `namespace` - Kubernetes namespace\n- `container` - Container name\n- `container_id` - Container ID\n\n#### Energy (Counter)\n\n```prometheus\n# Hardware-measured (single process)\nmy_gpu_process_energy_joules_total{...,energy_estimated=\"false\"} 15234.5\n\n# SM-based estimation (time-slicing with DCGM bug)\nmy_gpu_process_energy_joules_total{...,energy_estimated=\"true\"} 8421.2\n```\n\n**Energy Source:**\n- `energy_estimated=\"false\"` - Hardware-measured from GPU (most accurate)\n- `energy_estimated=\"true\"` - SM-based estimation (automatic fallback for time-slicing)\n\n**Query to filter by energy source:**\n```promql\n# Only hardware-measured energy\nmy_gpu_process_energy_joules_total{energy_estimated=\"false\"}\n\n# Only estimated energy\nmy_gpu_process_energy_joules_total{energy_estimated=\"true\"}\n```\n\n**Usage:**\n```promql\n# Average power in Watts\nrate(my_gpu_process_energy_joules[1m])\n\n# Total energy consumed in last hour (Joules)\nincrease(my_gpu_process_energy_joules{pod=\"training-job\"}[1h])\n```\n\n#### Utilization (Gauges)\n\n```prometheus\nmy_gpu_process_sm_utilization_ratio{...} 0.85\nmy_gpu_process_memory_utilization_ratio{...} 0.72\n```\n\nValues are 0.0-1.0 (0%-100%).\n\n#### Memory (Gauge)\n\n```prometheus\nmy_gpu_process_memory_used_bytes{...} 8589934592\n```\n\nGPU memory used by process in bytes.\n\n#### Lifecycle (Gauges)\n\n```prometheus\nmy_gpu_process_start_time_seconds{...} 1699564800\nmy_gpu_process_active{...} 1  # 1=running, 0=exited\n```\n\n### GPU-Level Aggregation Metrics\n\nThese metrics aggregate per-process data at the GPU level, useful for validating time-slicing:\n\n#### Total GPU Energy (Counter)\n\n```prometheus\nmy_gpu_process_gpu_energy_joules_total{gpu=\"0\"} 45234.5\n```\n\nSum of energy consumed by all processes on this GPU. With time-slicing, this represents the total GPU energy distributed across multiple processes.\n\n#### GPU Process Count (Gauge)\n\n```prometheus\nmy_gpu_process_gpu_process_count{gpu=\"0\"} 3\n```\n\nNumber of active processes on this GPU. When `\u003e 1`, indicates time-slicing is active.\n\n**Usage:**\n```promql\n# Detect time-slicing (GPUs with multiple processes)\nmy_gpu_process_gpu_process_count \u003e 1\n\n# Total power per GPU (Watts)\nrate(my_gpu_process_gpu_energy_joules_total[1m])\n\n# Verify: GPU total should equal sum of per-process\nrate(my_gpu_process_gpu_energy_joules_total{gpu=\"0\"}[1m])\n==\nsum(rate(my_gpu_process_energy_joules_total{gpu=\"0\"}[1m]))\n```\n\n## Example Queries\n\n### Power Consumption\n\n```promql\n# Current power per pod (Watts)\nrate(my_gpu_process_energy_joules{namespace=\"ml\"}[1m])\n\n# Total power across all pods\nsum(rate(my_gpu_process_energy_joules[1m]))\n\n# Power by namespace\nsum by (namespace) (rate(my_gpu_process_energy_joules[5m]))\n```\n\n### Energy Accounting\n\n```promql\n# Energy consumed by pod in last hour (Joules)\nincrease(my_gpu_process_energy_joules{pod=\"training-job-123\"}[1h])\n\n# Convert to kWh\nincrease(my_gpu_process_energy_joules{pod=\"training-job-123\"}[1h]) / 3600000\n\n# Total energy cost (assuming $0.10/kWh)\n(increase(my_gpu_process_energy_joules[1h]) / 3600000) * 0.10\n```\n\n### Efficiency\n\n```promql\n# Power efficiency (compute utilization per Watt)\nmy_gpu_process_sm_utilization_ratio / rate(my_gpu_process_energy_joules[1m])\n\n# Most power-hungry pods\ntopk(10, rate(my_gpu_process_energy_joules[5m]))\n```\n\n### Active Processes\n\n```promql\n# Count of active GPU processes\nsum(my_gpu_process_active)\n\n# Active processes per namespace\nsum by (namespace) (my_gpu_process_active)\n```\n\n## Time-Slicing Support\n\nmy-gpu-exporter **fully supports GPU time-slicing** with automatic detection and intelligent energy attribution:\n\n### Features\n\n1. **Automatic Detection**: Detects when multiple processes share a GPU\n2. **Smart Energy Attribution**:\n   - **Hardware-measured** (preferred): Uses DCGM when values are differentiated\n   - **SM-based estimation** (fallback): Automatically applied when DCGM reports identical values (bug)\n3. **Transparent Labeling**: `energy_estimated` label indicates measurement method\n4. **Validation**: Detects and logs DCGM time-slicing bug\n5. **Aggregation Metrics**: GPU-level totals for validation\n\n### Testing Time-Slicing\n\nSee [Time-Slicing Testing Guide](docs/TIMESLICING-TEST.md) for comprehensive testing instructions.\n\nQuick validation:\n```bash\n# Deploy test workload (3 pods sharing GPU)\nkubectl apply -f timeslicing-test.yaml\n\n# Check metrics show different energy per process\ncurl http://exporter:9400/metrics | grep energy_joules_total\n\n# Verify process count \u003e 1 (indicates time-slicing)\ncurl http://exporter:9400/metrics | grep gpu_process_count\n```\n\n### Logs\n\n**When time-slicing detected with proper DCGM values:**\n```\nINFO Time-slicing detected: multiple processes on same GPU gpu=0 process_count=3\nDEBUG Time-slicing validation: energy values properly differentiated gpu=0 process_count=3\n```\n\n**When DCGM bug detected (identical values) and estimation is applied:**\n```\nINFO Time-slicing detected: multiple processes on same GPU gpu=0 process_count=3\nINFO Applying SM-based energy estimation for time-sliced processes gpu=0 process_count=3\nDEBUG Applied energy estimation pid=12345 pod=training-job sm_util=0.39 proportion=0.78 estimated_energy_J=245.3\n```\n\n**If estimation is disabled:**\n```\nWARN SUSPICIOUS: All time-sliced processes show identical energy values (estimation disabled) hint=\"Enable --enable-energy-estimation to use SM-based estimation\"\n```\n\n## Comparison with dcgm-exporter\n\n| Feature | dcgm-exporter | my-gpu-exporter |\n|---------|---------------|-----------------|\n| **Scope** | GPU-level | Process-level |\n| **Power metric** | `DCGM_FI_DEV_POWER_USAGE` | `my_gpu_process_energy_joules` |\n| **Time-slicing** | Duplicates same value | Smart attribution (measured or estimated) |\n| **Time-slice detection** | No | Yes (automatic) |\n| **DCGM bug detection** | No | Yes (with auto-fallback) |\n| **Energy attribution** | N/A (whole GPU) | Hardware-measured (preferred), SM-estimated (fallback) |\n| **Transparency** | N/A | `energy_estimated` label shows method |\n| **Use case** | GPU monitoring | Workload cost attribution |\n\n### Example with Time-Slicing\n\n**dcgm-exporter** (both show 200W):\n```prometheus\nDCGM_FI_DEV_POWER_USAGE{gpu=\"0\",exported_pod=\"pod-a\"} 200\nDCGM_FI_DEV_POWER_USAGE{gpu=\"0\",exported_pod=\"pod-b\"} 200\nSum = 400W (wrong - GPU only uses 200W!)\n```\n\n**my-gpu-exporter** (intelligent attribution):\n```prometheus\n# Scenario 1: DCGM provides correct per-process values (hardware-measured)\nmy_gpu_process_energy_joules_total{gpu=\"0\",pod=\"pod-a\",energy_estimated=\"false\"} 120\nmy_gpu_process_energy_joules_total{gpu=\"0\",pod=\"pod-b\",energy_estimated=\"false\"} 80\n\n# Scenario 2: DCGM bug detected, SM-based estimation applied\n# (pod-a has 60% SM util, pod-b has 40% SM util, GPU power is 200W)\nmy_gpu_process_energy_joules_total{gpu=\"0\",pod=\"pod-a\",energy_estimated=\"true\"} 120\nmy_gpu_process_energy_joules_total{gpu=\"0\",pod=\"pod-b\",energy_estimated=\"true\"} 80\n\n# GPU-level aggregation (always correct)\nmy_gpu_process_gpu_energy_joules_total{gpu=\"0\"} 200\n\n# Process count (indicates time-slicing)\nmy_gpu_process_gpu_process_count{gpu=\"0\"} 2\n```\n\n## Troubleshooting\n\n### No metrics appearing\n\n1. **Check GPU accounting mode:**\n```bash\nnvidia-smi -q | grep \"Accounting Mode\"\n# Should show: Enabled\n```\n\nIf disabled:\n```bash\nsudo nvidia-smi -am 1\n```\n\n2. **Check DCGM is working:**\n```bash\ndcgmi discovery -l\n```\n\n3. **Check exporter logs:**\n```bash\nkubectl -n gpu-monitoring logs -l app=my-gpu-exporter\n```\n\n### Energy values are zero\n\n- GPU accounting mode must be enabled **before** processes start\n- Restart GPU workloads after enabling accounting mode\n- Wait 3+ seconds after process starts for DCGM to collect data\n\n### \"Failed to get container ID\"\n\n- Exporter needs access to `/proc/\u003cpid\u003e/cgroup`\n- Ensure `hostPID: true` in DaemonSet\n- Ensure `/proc` volume is mounted\n\n### \"Failed to get pod info\"\n\n- Check kubelet pod-resources socket exists:\n```bash\nls -la /var/lib/kubelet/pod-resources/kubelet.sock\n```\n\n- Ensure socket is mounted in container\n- Check Kubernetes version (requires 1.20+)\n\n### Processes not showing up\n\n- Exporter **only tracks Kubernetes pods**, not system processes\n- Verify process is running in a container:\n```bash\ncat /proc/\u003cpid\u003e/cgroup\n```\n\n## Architecture\n\n```\n┌─────────────────────────────────────┐\n│       my-gpu-exporter               │\n├─────────────────────────────────────┤\n│                                     │\n│  ┌──────────────────────────────┐  │\n│  │   NVML Process Discovery     │  │\n│  │   (GetComputeRunningProcs)   │  │\n│  └──────────────────────────────┘  │\n│              │                      │\n│              ▼                      │\n│  ┌──────────────────────────────┐  │\n│  │   DCGM Process Metrics       │  │\n│  │   GetProcessInfo()           │  │\n│  │   → EnergyConsumed (ACTUAL)  │  │\n│  └──────────────────────────────┘  │\n│              │                      │\n│              ▼                      │\n│  ┌──────────────────────────────┐  │\n│  │   Kubernetes Pod Mapper      │  │\n│  │   /proc/PID/cgroup           │  │\n│  │   + Pod Resources API        │  │\n│  └──────────────────────────────┘  │\n│              │                      │\n│              ▼                      │\n│  ┌──────────────────────────────┐  │\n│  │   Prometheus Exporter        │  │\n│  │   /metrics endpoint          │  │\n│  └──────────────────────────────┘  │\n└─────────────────────────────────────┘\n```\n\n## Contributing\n\nContributions welcome! Please ensure:\n- Code follows Go best practices\n- Tests pass\n- Documentation is updated\n- No introduction of estimation or approximation (use actual measurements only)\n\n## License\n\n[Add your license here]\n\n## Acknowledgments\n\n- NVIDIA DCGM team for per-process energy API\n- Prometheus community\n- Kubernetes sig-node for pod-resources API\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvimalk78%2Fmy-gpu-exporter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvimalk78%2Fmy-gpu-exporter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvimalk78%2Fmy-gpu-exporter/lists"}