{"id":48237249,"url":"https://github.com/kempnerinstitute/kempnerpulse","last_synced_at":"2026-04-15T04:01:35.650Z","repository":{"id":349204584,"uuid":"1196569875","full_name":"KempnerInstitute/kempnerpulse","owner":"KempnerInstitute","description":"KempnerPulse - real-time GPU monitoring dashboard for DCGM Prometheus metrics.","archived":false,"fork":false,"pushed_at":"2026-04-13T22:51:26.000Z","size":6356,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-14T00:28:39.970Z","etag":null,"topics":["gpu","gpu-computing","gpu-monitoring","gpu-utilization","nvidia","nvidia-gpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KempnerInstitute.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-30T20:33:18.000Z","updated_at":"2026-04-13T22:19:39.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/KempnerInstitute/kempnerpulse","commit_stats":null,"previous_names":["kempnerinstitute/kempnerpulse"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/KempnerInstitute/kempnerpulse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fkempnerpulse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fkempnerpulse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fkempnerpulse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fkempnerpulse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KempnerInstitute","download_url":"https://codeload.github.com/KempnerInstitute/kempnerpulse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fkempnerpulse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31825515,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"online","status_checked_at":"2026-04-15T02:00:06.175Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu","gpu-computing","gpu-monitoring","gpu-utilization","nvidia","nvidia-gpu"],"created_at":"2026-04-04T20:03:17.260Z","updated_at":"2026-04-15T04:01:35.640Z","avatar_url":"https://github.com/KempnerInstitute.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KempnerPulse\n\n[![PyPI](https://img.shields.io/pypi/v/kempnerpulse)](https://pypi.org/project/kempnerpulse/)\n\n\u003e `nvidia-smi` says 100% GPU utilization - but are your tensor cores even active? KempnerPulse shows what's *actually* happening.\n\nReal-time GPU monitoring dashboard for DCGM metrics. A single-file\nRich-based TUI that streams metrics from\n[dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) via Prometheus HTTP\nor directly from `dcgmi dmon` for high-resolution profiling (~100 ms),\nand renders four interactive views in the terminal.\n\n![KempnerPulse Demo](https://raw.githubusercontent.com/KempnerInstitute/kempnerpulse/main/docs/images/kempner_pulse_screen_record.gif)\n\n## Features\n\n- **Fleet View** : All GPUs at a glance: utilization, memory, power,\n  temperature, PCIe/NVLink bandwidth, sparkline bars.\n- **Focus View** : Deep dive into one GPU with per-metric sparkline history.\n- **Plot View** : Stacked line charts across all GPUs.\n- **Job View** : Running GPU compute processes with per-GPU metrics.\n- **Real Utilization** : Weighted composite metric from SM active, tensor pipe,\n  DRAM active, and GR engine counters (customizable weights with presets for\n  AI/ML, HPC, and memory-bound workflows).\n- **Workload Classification** : 12-category status based on NVIDIA DCGM\n  profiling metric guidance (idle, tensor-heavy compute, memory-bound, I/O,\n  etc.).\n- **Health Monitoring** : Temperature, PCIe replay errors, and ECC errors\n  with color-coded alerts.\n- **SLURM/CUDA Aware** : Automatically detects `CUDA_VISIBLE_DEVICES`,\n  `SLURM_JOB_GPUS`, etc. to show only your allocated GPUs.\n- **Direct DCGM Backend** : `--backend dcgm` queries `dcgmi dmon` directly,\n  bypassing dcgm-exporter for true high-resolution sampling. Automatically\n  resolves physical GPU IDs inside SLURM cgroups.\n- **Zero Dependencies** beyond Python 3.9+ and `rich`.\n\n## Screenshots\n\n### Fleet View\n\nAll GPUs at a glance with utilization bars, memory, power, temperature, and bandwidth.\n\n![Fleet View](https://raw.githubusercontent.com/KempnerInstitute/kempnerpulse/main/docs/images/fleet_view.png)\n\n### Focus View\n\nDeep dive into a single GPU with per-metric sparkline history.\n\n![Focus View](https://raw.githubusercontent.com/KempnerInstitute/kempnerpulse/main/docs/images/focus_view.png)\n\n### Plot View\n\nStacked line charts across all GPUs.\n\n![Plot View](https://raw.githubusercontent.com/KempnerInstitute/kempnerpulse/main/docs/images/plot_view.png)\n\n### Job View\n\nRunning GPU compute processes with per-GPU metrics.\n\n![Job View](https://raw.githubusercontent.com/KempnerInstitute/kempnerpulse/main/docs/images/job_view.png)\n\n## Requirements\n\n- Linux with NVIDIA GPUs\n- [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) running and\n  exposing `/metrics` (default: `http://localhost:9400/metrics`) — or\n  `dcgmi` CLI available for `--backend dcgm`\n- Python \u003e= 3.9\n- `nvidia-smi` on the PATH (for hardware queries and process listing)\n\n\u003e **Note:** KempnerPulse currently supports NVIDIA datacenter GPUs (V100, A100, H100, H200, B200, B300).\n\u003e Grace-Hopper (GH200), Grace-Blackwell (GB200) and RTX support is planned but not yet tested.\n\u003e AMD GPUs are not supported.\n\n## Installation\n\n```bash\npip install kempnerpulse\n```\n\nOr install from source:\n\n```bash\npip install .\n```\n\n## Quick Start\n\n```bash\n# Default: connect to localhost:9400/metrics, show SLURM/CUDA-visible GPUs\nkempnerpulse\n\n# Explicit source and GPU selection\nkempnerpulse --source http://gpu-node:9400/metrics --gpus 0,1,2,3\n\n# Show all GPUs on the node\nkempnerpulse --show-all\n\n# Start in focus view for GPU 0\nkempnerpulse --focus-gpu 0\n\n# Use direct DCGM backend (bypasses Prometheus, higher resolution)\nkempnerpulse --backend dcgm\n\n# Use HPC weight preset\nkempnerpulse --hpc-weights\n\n# Custom weights (SM, Tensor, DRAM, GR; normalized automatically)\nkempnerpulse --weights 0.40,0.30,0.20,0.10\n\n# Export CSV (default columns) — only GPUs with your processes\nkempnerpulse --export \u003e metrics.csv\n\n# Export all columns\nkempnerpulse --export all \u003e metrics.csv\n\n# Export custom columns\nkempnerpulse --export gpu_id,real_util_pct,power_w,tensor_active_pct \u003e metrics.csv\n\n# Single snapshot export\nkempnerpulse --export --once\n```\n\n## Interactive Commands\n\n| Command       | Action                                      |\n|---------------|---------------------------------------------|\n| `:focus \u003cid\u003e` | Enter focused view for a specific GPU       |\n| `:plot`       | Enter plot view (line charts)   |\n| `:job`        | Enter job view (running GPU processes)      |\n| `:q`          | Return to fleet view (or exit if in fleet)  |\n| `:exit`       | Exit the dashboard                          |\n| `Ctrl+C`      | Exit the dashboard                          |\n| `Esc`         | Cancel an unfinished `:` command            |\n\n## CLI Reference\n\n| Flag | Type | Default | Description |\n|------|------|---------|-------------|\n| `--version` | | | Show version and exit. |\n| `--backend` | string | `prometheus` | Data source backend: `prometheus` (dcgm-exporter HTTP) or `dcgm` (dcgmi dmon direct). |\n| `--source URL` | string | `http://localhost:9400/metrics` | dcgm-exporter `/metrics` endpoint or a local text file (prometheus backend only). |\n| `--poll SECS` | float | `1.0` | Sampling/refresh interval in seconds. With `--backend dcgm`, drives a persistent `dcgmi` stream and is honored down to a 100ms floor (DCGM profiling counters refresh at ~10Hz internally; below 100ms most profiling rows would be blank). With `--backend prometheus`, must be `\u003e= 1.0` (dcgm-exporter scrapes profiling fields at ~30s, so sub-second values just duplicate samples). |\n| `--history N` | int | `120` | Number of samples kept for sparkline history. |\n| `--focus-gpu ID` | string | | Start in Focus View for the given GPU id (e.g. `0`). |\n| `--once` | flag | | Render a single snapshot and exit instead of running live. |\n| `--gpus IDS` | string | | Explicit GPU ids or ranges (`0,1` or `0-3`). Overrides SLURM/CUDA env vars. |\n| `--show-all` | flag | | Ignore SLURM/CUDA visibility env vars; show every GPU in the source. |\n| `--weights W` | 4 floats | `0.35,0.35,0.20,0.10` | Comma-separated Real Util weights: SM,TENSOR,DRAM,GR. Auto-normalized. |\n| `--ai-weights` | preset | | AI/LLM training preset `(0.35, 0.35, 0.20, 0.10)`. This is the default. |\n| `--hpc-weights` | preset | | HPC / mixed CUDA preset `(0.45, 0.15, 0.25, 0.15)`. |\n| `--mem-weights` | preset | | Memory-bound / bandwidth-heavy preset `(0.35, 0.10, 0.40, 0.15)`. |\n| `--export` | string | *(off)* | Output CSV to stdout. `--export` for default columns, `--export all` for every column, or `--export col1,col2,...` for a custom set. Rows are emitted for every GPU in the visibility set (`CUDA_VISIBLE_DEVICES` / `SLURM_JOB_GPUS` / `--gpus`), so you can start the recorder before your job launches. |\n\n### GPU Visibility Selection\n\nThe dashboard picks the first available source in this order:\n\n1. `--gpus` flag\n2. `CUDA_VISIBLE_DEVICES` env var\n3. `NVIDIA_VISIBLE_DEVICES` env var\n4. `SLURM_STEP_GPUS` env var\n5. `SLURM_JOB_GPUS` env var\n\nIf none are set, all GPUs on the node are shown. Use `--show-all` to\nexplicitly override all env vars. All GPU selections are filtered against\nGPUs accessible to the current process (as reported by `nvidia-smi`),\nwhich respects cgroup and container restrictions.\n\n## Weight Presets\n\n| Preset          | Flag             | SM    | Tensor | DRAM  | GR    | Best For |\n|-----------------|------------------|-------|--------|-------|-------|----------|\n| AI/ML (default) | `--ai-weights`   | 0.35  | 0.35   | 0.20  | 0.10  | DL training, LLM inference, transformers |\n| HPC             | `--hpc-weights`  | 0.45  | 0.15   | 0.25  | 0.15  | Scientific computing, mixed CUDA |\n| Memory-bound    | `--mem-weights`  | 0.35  | 0.10   | 0.40  | 0.15  | Bandwidth-heavy workloads, stencil codes |\n\nCustom: `--weights 0.40,0.30,0.20,0.10` (values are normalized automatically).\n\n## How It Works\n\nKempnerPulse reads GPU metrics via one of two backends: **Prometheus**\n(dcgm-exporter HTTP endpoint, ~30 s update interval for profiling fields) or\n**DCGM direct** (`dcgmi dmon`, configurable down to ~100 ms).\nIt computes a **Real Utilization** score as a weighted combination of four\nDCGM profiling counters:\n\n```\nReal Util = clamp(0, 100,\n              W_sm    × SM_ACTIVE\n            + W_tensor × TENSOR_ACTIVE\n            + W_dram   × DRAM_ACTIVE\n            + W_gr     × GR_ENGINE_ACTIVE)\n```\n\nThis gives a more accurate picture of GPU utilization than `nvidia-smi`'s\n`GPU-Util` alone, which only reports kernel-launch duty cycle.\n\n## Workload Classification\n\nEach GPU is classified into one of **12 categories** every refresh cycle,\nbased on thresholds from\n[NVIDIA's DCGM profiling metric guidance](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling).\nCategories are evaluated in order and the first matching rule wins.\n\n| Status | Thresholds | Rationale |\n|--------|------------|-----------|\n| **idle** | Real Util \u003c 5 %, GR \u003c 5 %, DRAM \u003c 5 %, no I/O | Nothing running. |\n| **tensor-heavy compute** | Tensor ≥ 50 % and SM ≥ 60 % | DL training / large-scale inference. |\n| **tensor compute** | Tensor ≥ 15 % and SM ≥ 40 % | Mixed-precision, moderate tensor use. |\n| **FP64 / HPC compute** | FP64 ≥ 20 % and SM ≥ 50 % | Scientific double-precision workload. |\n| **I/O or data-loading** | Memcpy ≥ 40 % or PCIe ≥ 1 GB/s, SM \u003c 30 % | Heavy transfer; SMs idle. |\n| **memory-bound** | DRAM ≥ 50 % and SM \u003c 50 % | Bandwidth limited. |\n| **compute-heavy** | SM ≥ 80 % | Effective SM use (NVIDIA: ≥ 80 % needed). |\n| **compute-active** | SM ≥ 50 % | Moderate compute, no tensor dominance. |\n| **memory-active** | DRAM ≥ 40 % | Significant DRAM traffic. |\n| **busy, low SM use** | GR ≥ 40 % and SM \u003c 25 % | Overhead / sync / small kernels. |\n| **low utilization** | GR \u003c 15 %, SM \u003c 15 %, DRAM \u003c 15 % | Barely active. |\n| **mixed / moderate** | *(fallthrough)* | No single dominant pattern. |\n\nFull details, bottleneck color key, and NVIDIA reference points:\n[docs/classification.md](docs/classification.md)\n\n## Health Monitoring\n\n| Status | Condition | Meaning |\n|--------|-----------|---------|\n| **OK** | *(none of the below)* | Normal operation. |\n| **WARN** | PCIe replay rate \u003e 0/s | PCIe link retransmissions occurring. |\n| **HOT** | GPU or memory temp ≥ warning threshold | Approaching thermal throttling. |\n| **CRIT** | Row-remap failure \u003e 0 or uncorrectable remapped rows \u003e 0 | Hardware memory errors. Remove from production. |\n\nTemperature warning thresholds are per-model (A100: 93 °C, H100/H200: 95 °C,\nRTX 6000: 92 °C, default: 93 °C). Full threshold table:\n[docs/classification.md](docs/classification.md#temperature-thresholds-by-gpu-model)\n\n## CSV Export\n\nExport GPU metrics as CSV for offline analysis or terminal monitoring. Only GPUs\nwhere the current user has running compute processes are included.\n\n```bash\nkempnerpulse --export \u003e metrics.csv            # default columns\nkempnerpulse --export all \u003e metrics.csv        # all 34 columns\nkempnerpulse --export gpu_id,real_util_pct,power_w \u003e metrics.csv  # custom\nkempnerpulse --export --once                   # single snapshot\n```\n\nDefault columns: `timestamp, gpu_id, model, gpu_util_pct, mem_used_mib,\nreal_util_pct, sm_active_pct, tensor_active_pct, dram_active_pct`\n\nFull column reference and usage details:\n[docs/export.md](docs/export.md)\n\n## DCGM Metrics\n\nKempnerPulse consumes ~30 DCGM fields covering profiling counters, memory,\ntemperature, power, clocks, PCIe, NVLink, and error counters. The complete\nlist with descriptions and NVIDIA doc links:\n[docs/metrics.md](docs/metrics.md)\n\n## Performance Overhead\n\nKempnerPulse introduces minimal runtime overhead, using approximately 8.2% of a single CPU core on an AMD EPYC 9374F processor, with negligible memory usage (below the reporting resolution of `top`).\n\n## License\n\nMIT. See [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkempnerinstitute%2Fkempnerpulse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkempnerinstitute%2Fkempnerpulse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkempnerinstitute%2Fkempnerpulse/lists"}