https://github.com/nexusgpu/benchmark

TensorFusion Remote/Local vGPU Benchmark
https://github.com/nexusgpu/benchmark

Last synced: 12 months ago
JSON representation

TensorFusion Remote/Local vGPU Benchmark

Host: GitHub
URL: https://github.com/nexusgpu/benchmark
Owner: NexusGPU
License: apache-2.0
Created: 2025-05-09T09:34:36.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-03T16:54:18.000Z (12 months ago)
Last Synced: 2025-07-03T17:47:52.923Z (12 months ago)
Language: Smarty
Size: 17.6 KB
Stars: 4
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # TensorFusion Remote/Local vGPU Benchmark Helm Chart

This Helm chart deploys the TensorFusion Remote/Local vGPU Benchmark application, which includes a deployment for running the benchmark tests and a cronjob for automated testing.

## Benchmark Results

### TorchBenchmark Results (2025-07-06)

To run the TorchBenchmark tests:

```bash

cd benchmark

python3 test.py -k "test_${model_name}_eval_cuda" -t ${eval_times}

```

| Model | Native | NGPU Mode | Loss(NGPU) | Local | Loss(Local) | Same AZ | Loss(Same AZ) | Cross AZ | Loss(Cross AZ) |

|-------|---------|------------|------------|--------|-------------|---------|--------------|----------|----------------|

| basic_gnn_edgecnn | 41.15 s | 40.95 s | -0.49% | 43.48 s | 5.66% | 46.07 s | 11.96% | 54.97 s | 33.58% |

| BERT_pytorch | 249.02 s | 248.84 s | -0.07% | 251.26 s | 0.90% | 253.71 s | 1.88% | 261.62 s | 5.06% |

| basic_gnn_gcn | 15.05 s | 15.24 s | 1.26% | 19.63 s | 30.43% | 29.70 s | 97.34% | 64.39 s | 327.84% |

| basic_gnn_gin | 9.47 s | 9.53 s | 0.63% | 9.78 s | 3.27% | 12.66 s | 33.69% | 21.83 s | 130.52% |

| hf_Albert | 24.73 s | 24.00 s | -2.95% | 29.19 s | 18.03% | 39.19 s | 58.47% | 73.00 s | 195.19% |

| hf_Bart | 39.88 s | 38.68 s | -3.01% | 54.96 s | 37.81% | 94.17 s | 136.13% | 211.68 s | 430.79% |

| hf_Bert | 24.15 s | 24.35 s | 0.83% | 29.55 s | 22.36% | 42.00 s | 73.91% | 75.86 s | 214.12% |

| llama | 39.91 s | 41.20 s | 3.23% | 42.90 s | 7.49% | 45.80 s | 14.76% | 52.55 s | 31.67% |

| hf_distil_whisper | 170.61 s | 170.87 s | 0.15% | 172.16 s | 0.91% | 178.75 s | 4.77% | 189.45 s | 11.04% |

| hf_clip | 191.60 s | 191.70 s | 0.05% | 194.52 s | 1.52% | 197.51 s | 3.08% | 208.90 s | 9.03% |

| hf_Whisper | 58.98 s | 59.18 s | 0.34% | 63.50 s | 7.66% | 66.66 s | 13.02% | 72.63 s | 23.14% |

| **Average Loss** | - | - | **0.00%** | - | **12.37%** | - | **40.82%** | - | **128.36%** |

### MLPerf Results (2025-07-04)

To run the MLPerf benchmark:

```bash

mlcr run-mlperf,inference,_full,_r5.0-dev \

    --model=bert-99 \

    --implementation=reference \

    --framework=pytorch \

    --category=edge \

    --scenario=SingleStream \

    --execution_mode=valid \

    --device=cuda \

    --quiet --rerun

```

| Mode | Time | Loss |

|------|------|------|

| Native | 27.008 s | - |

| Local | 29.930 s | 10.82% |

| Same AZ | 33.341 s | 23.45% |

| Cross AZ | 41.597 s | 54.02% |

### Simulating AZ Latencies

To simulate different AZ (Availability Zone) network conditions, you can use the Linux Traffic Control (tc) tool to inject artificial network latency:

1. Inject network latency:

```bash

# For Same AZ simulation (0.3ms latency)

tc qdisc add dev lo root netem delay 0.3ms

# For Cross AZ simulation (1ms latency)

tc qdisc add dev lo root netem delay 1ms

```

2. Verify the latency:

```bash

ping target_host

```

3. Remove the artificial latency when done:

```bash

tc qdisc del dev lo root

```

## Prerequisites

- Kubernetes 1.19+

- Helm 3.2.0+

- PV provisioner support in the underlying infrastructure

- A GPU node with NVIDIA drivers installed

## Installing the Chart

To install the chart with the release name `my-release`:

```bash

helm install my-release ./helm/torchbench

```

The command deploys the benchmark application on the Kubernetes cluster with default configuration.

## Configuration

The following table lists the configurable parameters of the chart and their default values.

| Parameter | Description | Default |

|-----------|-------------|---------|

| `replicaCount` | Number of replicas | `1` |

| `image.repository` | Image repository | `crpi-wpzfqfci37r0ad3n.cn-hangzhou.personal.cr.aliyuncs.com/tensorfusionrobin/tensorfusionrobin` |

| `image.tag` | Image tag | `latest` |

| `image.pullPolicy` | Image pull policy | `Always` |

| `serviceAccount.create` | Create service account | `true` |

| `serviceAccount.name` | Service account name | `cronjob-sa` |

| `podAnnotations` | Pod annotations | See values.yaml |

| `podLabels` | Pod labels | See values.yaml |

| `resources` | Pod resource requests and limits | See values.yaml |

| `nodeSelector` | Node selector | `kubernetes.io/hostname: gpu-2` |

| `cronjob.schedule` | Cronjob schedule | `0 0 * * *` |

| `cronjob.concurrencyPolicy` | Cronjob concurrency policy | `Allow` |

| `cronjob.successfulJobsHistoryLimit` | Number of successful jobs to keep | `3` |

| `cronjob.failedJobsHistoryLimit` | Number of failed jobs to keep | `1` |

## Usage

### Running the Benchmark

The benchmark will run automatically according to the cronjob schedule. You can also manually trigger a benchmark run by:

1. Finding the cronjob:

```bash

kubectl get cronjob

```

2. Creating a job from the cronjob:

```bash

kubectl create job --from=cronjob/my-release-torchbench-test-runner manual-run

```

### Viewing Results

To view the benchmark results:

```bash

kubectl logs -l app=my-release-torchbench

```

### Customizing the Configuration

To customize the configuration, create a custom values file:

```bash

helm install my-release ./helm/torchbench -f custom-values.yaml

```

## Uninstalling the Chart

To uninstall/delete the deployment:

```bash

helm uninstall my-release

```

## Troubleshooting

If you encounter any issues:

1. Check the pod status:

```bash

kubectl get pods -l app=my-release-torchbench

```

2. Check the pod logs:

```bash

kubectl logs -l app=my-release-torchbench

```

3. Check the cronjob status:

```bash

kubectl get cronjob

kubectl get jobs

```

4. Check the service account:

```bash

kubectl get serviceaccount

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nexusgpu/benchmark

Awesome Lists containing this project

README