https://github.com/avarga1/vllm-hb
vLLM-compatible inference runtime in pure Rust. Zero Python. Zero libtorch. CUDA via candle.
https://github.com/avarga1/vllm-hb
candle cuda inference llm openai-api rust tokio vllm
Last synced: 3 months ago
JSON representation
vLLM-compatible inference runtime in pure Rust. Zero Python. Zero libtorch. CUDA via candle.
- Host: GitHub
- URL: https://github.com/avarga1/vllm-hb
- Owner: avarga1
- License: other
- Created: 2026-04-07T04:09:58.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-07T21:18:49.000Z (3 months ago)
- Last Synced: 2026-04-07T22:00:16.639Z (3 months ago)
- Topics: candle, cuda, inference, llm, openai-api, rust, tokio, vllm
- Language: Rust
- Size: 218 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# vllm-hb
**vLLM-compatible inference runtime in pure Rust.**
No Python. No libtorch. No C++ interop. CUDA via [cudarc](https://github.com/coreylowman/cudarc).
```
vllm-hb serve --model /path/to/Llama-3.1-8B-Instruct --port 8000
```
Drop-in replacement for the vLLM OpenAI-compatible server. Any client that
talks to vLLM talks to vllm-hb.
---
## Why
Python vLLM's scheduler runs under the GIL. At high concurrency the
scheduling loop becomes the bottleneck — GC pauses, dynamic dispatch, and
Python object overhead add up.
vllm-hb rewrites the hot path in Rust:
| Component | vLLM | vllm-hb |
|---|---|---|
| Scheduler | Python (GIL) | Rust (`parking_lot::Mutex`, zero-alloc hot path) |
| Tokenizer | Python wrapper | Rust (`tokenizers` crate — the canonical HuggingFace impl) |
| Model forward | Python → libtorch C++ | Rust → CUDA directly via `candle` |
| HTTP server | Python (uvicorn) | Rust (`axum` + `tokio`) |
| Runtime | CPython + pip | Single static binary |
---
## Features
- **OpenAI-compatible API** — `/v1/chat/completions` (streaming + non-streaming), `/v1/models`, `/health`
- **Streaming** — server-sent events, token-by-token
- **Top-p + temperature sampling** — proper nucleus filtering with `rand`
- **Auto chat template** — reads `tokenizer_config.json` and detects Llama-3 / ChatML / Mistral-v1
- **Auto EOS tokens** — reads from `config.json`, no hardcoded token IDs
- **Built-in benchmark** — `vllm-hb bench` measures TTFT, throughput, p50/p99
---
## Getting started
### Requirements
- Rust 1.75+ (`rustup install stable`)
- CUDA 12.x toolkit (`nvcc` in PATH, or `CUDA_ROOT` set)
- NVIDIA GPU (tested on V100, A100, RTX 4090)
### Build
```bash
git clone https://github.com/avarga1/vllm-hb
cd vllm-hb
cargo build --release
```
If `nvcc` is not in your default PATH:
```bash
export CUDA_ROOT=/path/to/cuda
export PATH=$CUDA_ROOT/bin:$PATH
cargo build --release
```
### Run
```bash
# Serve a local HuggingFace model directory
./target/release/vllm-hb serve \
--model /models/Meta-Llama-3.1-8B-Instruct \
--port 8000
# Test it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64,
"stream": false
}'
```
### Benchmark
```bash
# Start the server first, then in another terminal:
./target/release/vllm-hb bench \
--base-url http://localhost:8000 \
--n 100 \
--max-tokens 128
```
Example output (NousResearch/Hermes-3-Llama-3.1-8B, NVIDIA V100 32GB, 50 requests, 128 tokens):
```
vllm-hb benchmark
─────────────────────────────────────
server : http://localhost:8000
model : Hermes-3-Llama-3.1-8B
requests : 50
max_tokens : 128
Warmup complete. Running benchmark…
Results
─────────────────────────────────────
throughput : 37.0 tok/s
total tokens : 6400 (50 requests)
TTFT (time to first token)
mean : 3459 ms
p50 : 3458 ms
p99 : 3468 ms
End-to-end latency
p50 : 3458 ms
p99 : 3468 ms
```
> 37 tok/s ≈ 66 % of V100 HBM2 peak bandwidth utilised during single-sequence
> decode — no batching, no flash-attn. TTFT reflects full generation time in
> non-streaming mode; streaming TTFT (time to first token) is typically < 200 ms.
---
## Supported models
Any model in HuggingFace safetensors format with a supported architecture:
| Family | Status |
|---|---|
| Meta Llama 3 / 3.1 / 3.2 | ✅ |
| Mistral 7B / 7B-Instruct | ✅ |
| Mixtral 8x7B / 8x22B | 🔜 planned |
| Qwen 2.5 | 🔜 planned |
| Microsoft Phi-3 | 🔜 planned |
> Additional architectures are gated only by candle-transformers model
> implementations — contributions welcome.
---
## Architecture
```
HTTP request
│
▼
axum handler (src/server.rs)
│ tokenize + submit WorkItem
▼
WorkerHandle ──────► Worker task (src/worker.rs)
│ unbounded channel │
│ │ prefill → decode loop
│ │ sample() per step
│ │
◄───── GenerationEvent ◄┘ token stream back to handler
│
▼
SSE stream / JSON response
```
The `Worker` owns the `Engine` (src/model.rs) and serialises GPU work.
Multiple HTTP handlers can submit requests concurrently; the worker
processes them in order, keeping the GPU saturated.
Sampling lives in `src/sampling.rs` — temperature scaling, softmax, and
top-p nucleus filtering are pure functions over a logit slice.
---
## Roadmap
- [ ] Multi-sequence continuous batching (paged KV cache)
- [ ] Tensor parallelism (multi-GPU)
- [ ] Mistral / Qwen native architecture emitters
- [ ] `--dtype bf16` flag
- [ ] Prometheus metrics endpoint
- [ ] Docker image
---
## License
Apache 2.0