https://github.com/outerbounds/vllm-ws-setup
https://github.com/outerbounds/vllm-ws-setup
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/outerbounds/vllm-ws-setup
- Owner: outerbounds
- Created: 2025-08-08T07:31:45.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-09T02:19:01.000Z (10 months ago)
- Last Synced: 2025-08-09T04:10:52.270Z (10 months ago)
- Language: Dockerfile
- Size: 340 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Step 1. Create a vllm-enabled workstation
To run a 32B model, use a compute pool with a 4 GPU instance, such as `g5.12xlarge` on AWS.
Notice a few things:
1. The setting for shared memory is 10GB, the default is insufficient for IPC across GPU cards with vLLM.
2. Use an image that has Nvidia GPU drivers installed. This repository contains an [example image](./Dockerfile) that pre-installs vllm, PyTorch, and other dependencies. A public image is hosted at `docker.io/eddieob/vllm-flashinfer-metaflow` for demo purposes.


## Step 2. Run vLLM
The image mentioned in the previous section already has `vllm` installed.
If you opt to bring your own image, please ensure you have `vllm` installed in the active environment.
### Run the OpenAI-compatible server
Choose your model and [inference server parameters](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
```bash
vllm serve Qwen/Qwen3-32B --tensor-parallel-size 4
```
Gated HuggingFace models will require setting the `HF_TOKEN` environment variable to pull.
The initial load and model compilation can take around 10 minutes for larger models.
### Query the server
```
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
```