https://github.com/ccamacho/bench
https://github.com/ccamacho/bench
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ccamacho/bench
- Owner: ccamacho
- Created: 2025-07-27T06:24:18.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-10-22T12:30:15.000Z (8 months ago)
- Last Synced: 2025-11-10T22:26:22.893Z (7 months ago)
- Language: Python
- Size: 1.26 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Instructions
The dockerfile in this repo is built and served in:
```
FROM ghcr.io/ccamacho/bench:latest
```
## Testing the bench container
```bash
oc new-project bench
```
We make sure we cleanup the environment first
```bash
# Begin Cleanup
oc delete job -n bench -l job-name=guidellm-benchmark --ignore-not-found
oc get pods -n bench --no-headers | grep guidellm-benchmark | awk '{print $1}' | xargs -r -n1 oc delete pod -n bench
oc delete secret hf-token-secret -n bench --ignore-not-found
# End Cleanup
# Make sure you have the HF key available
oc create secret generic hf-token-secret \
--from-file=token=$HOME/.keys/hf.key \
-n bench
```
Note: Make sure to update
`--target http://llm-d-inference-gateway-istio.llm-d.svc.cluster.local \`
with the actual endpoint you are testing in `guidellm-job.yml`.
VLLM_HOST=$(oc get route vllm -n my-vllm-runtime -o jsonpath='{.spec.host}')
curl http://$VLLM_HOST/health
Now let's run the job and fetch the results.
```bash
# Deploy the job
oc apply -f guidellm-job.yml
```
To get the logs and results run:
```bash
./fetch_logs.sh
```
This will give you both the json output and the logs locally,
now you can work on your results
## Debugging
```bash
kubectl get pods,svc -n istio-system
kubectl get pods,gateway -n llm-d
kubectl get pods,gateway -n llm-d-monitoring
```
# GPU Usage Plotting Script
A Python script to visualize GPU usage metrics from DCGM (Data Center GPU Manager) data files in Prometheus format.
## Features
- Parse DCGM metrics from Prometheus format files
- Plot GPU utilization, power usage, temperature, and memory utilization over time
- Support for multiple GPUs and hosts
- Interactive HTML plots using Plotly
- Combined dashboard view or individual metric plots
- Summary statistics for all metrics
## Installation
1. Install the required Python packages:
```bash
pip3 install -r requirements.txt
```
## Usage
### Basic Usage
Plot all GPU metrics in a combined dashboard:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt
```
### Specific Metrics
Plot only GPU utilization:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt --metric util
```
Plot only power usage:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt --metric power
```
Plot only temperature:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt --metric temp
```
Plot only memory utilization:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt --metric memory
```
### Custom Output
Specify a custom output file:
```bash
python3 gpu_plot.py results-dcgm-20250731-092704.txt --output my_gpu_metrics.html
```
## Data Format
The script expects DCGM metrics in Prometheus format with the following metrics:
- `DCGM_FI_DEV_GPU_UTIL`: GPU utilization (%)
- `DCGM_FI_DEV_POWER_USAGE`: Power usage (W)
- `DCGM_FI_DEV_GPU_TEMP`: GPU temperature (°C)
- `DCGM_FI_DEV_MEM_COPY_UTIL`: Memory utilization (%)
Example data format:
```
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-7bec0fc5...",device="nvidia0",modelName="NVIDIA L40S"} 85.5
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-7bec0fc5...",device="nvidia0",modelName="NVIDIA L40S"} 245.2
```
## Output
The script generates an interactive HTML file with:
- Time series plots for each metric
- Multiple GPUs shown with different colors
- Hover information with detailed values
- Summary statistics printed to console
## Examples
1. **Monitor GPU utilization during a benchmark:**
```bash
python3 gpu_plot.py benchmark_gpu_data.txt --metric util -o utilization_report.html
```
2. **Generate a complete GPU health dashboard:**
```bash
python3 gpu_plot.py monitoring_data.txt --metric all -o gpu_dashboard.html
```
3. **Check power consumption patterns:**
```bash
python3 gpu_plot.py power_monitoring.txt --metric power -o power_analysis.html
```
## Dependencies
- Python 3.6+
- plotly: Interactive plotting library
- pandas: Data manipulation
- numpy: Numerical operations
- kaleido: Static image export (optional)
## Similar to bench-plot
This script follows the same design patterns as the `bench-plot` script but is specifically tailored for GPU monitoring data:
- Uses Plotly for interactive visualizations
- Supports command-line arguments for different plot types
- Generates HTML output for easy sharing
- Provides summary statistics
- Handles multiple data series (GPUs) with distinct styling