https://github.com/nerdalert/vllm-bench-automation

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/nerdalert/vllm-bench-automation
Owner: nerdalert
Created: 2025-05-30T19:50:11.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-30T05:29:07.000Z (12 months ago)
Last Synced: 2025-10-19T00:53:19.757Z (8 months ago)
Language: Shell
Size: 93.8 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README-minikube.md

Awesome Lists containing this project

README

## Minikube Deploy

The script `e2e-bench-control.sh` is very minikube focused since the original intent was e2e smoke-testing.

- Setup your cluster

```bash
minikube start \
--driver docker \
--container-runtime docker \
--gpus all \
--memory no-limit
```

Modify your `--values-file` node selector section and decode replicas to force all pods onto your node's GPUs. For example, on a node with 4xL40S you would use something like this:

```yaml
sampleApplication:
baseConfigMapRefName: basic-gpu-preset
model:
modelArtifactURI: hf://meta-llama/Llama-3.2-3B-Instruct
modelName: "meta-llama/Llama-3.2-3B-Instruct"
prefill:
replicas: 0
decode:
replicas: 4
redis:
enabled: false
modelservice:
epp:
defaultEnvVarsOverride:
- name: ENABLE_KVCACHE_AWARE_SCORER
value: "true"
- name: ENABLE_PREFIX_AWARE_SCORER
value: "true"
- name: ENABLE_LOAD_AWARE_SCORER
value: "true"
- name: ENABLE_SESSION_AWARE_SCORER
value: "false"
- name: PD_ENABLED
value: "false"
- name: PD_PROMPT_LEN_THRESHOLD
value: "10"
- name: PREFILL_ENABLE_KVCACHE_AWARE_SCORER
value: "false"
- name: PREFILL_ENABLE_LOAD_AWARE_SCORER
value: "false"
- name: PREFILL_ENABLE_PREFIX_AWARE_SCORER
value: "false"
- name: PREFILL_ENABLE_SESSION_AWARE_SCORER
value: "false"
prefill:
nodeSelector:
kubernetes.io/hostname: minikube
decode:
nodeSelector:
kubernetes.io/hostname: minikube

```

Use [slim values file](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart/examples) deployments in the examples directory if you are using an NVIDIA L4 since it will not be able to load a Llama model.

Deploy with:

```bash
git clone https://github.com/llm-d/llm-d-deployer.git
cd quickstart
# no features
./llmd-installer.sh --minikube --values-file examples/no-features/slim/no-features-slim.yaml
# base (prefix scoring)
./llmd-installer.sh --minikube --values-file examples/base/base.yaml
# kvcache (kvcache aware scoring)
./llmd-installer.sh --minikube --values-file examples/kvcache/kvcache.yaml
```

### Run a deployment batch

Automate running multiple deployments in a batch with `e2e-bench-control.sh`. Example ENVs below will run two deployments with the input/output/request rates overriding the scripts defaults:

```yaml
ENV_DEPLOYMENT_VALUES_FILES="examples/no-features/slim/no-features-slim.yaml examples/base/slim/base-slim.yaml" \
ENV_BENCH_INPUT_LEN="512" \
ENV_BENCH_OUTPUT_LEN="1024" \
ENV_BENCH_REQUEST_RATES="5,10,inf" \
./e2e-bench-control.sh --model meta-llama/Llama-3.1-8B-Instruct
```

Example output [here](https://gist.github.com/nerdalert/d985a6ea3a6c416771900a78e98b64f8)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nerdalert/vllm-bench-automation

Awesome Lists containing this project

README