https://github.com/GeeeekExplorer/nano-vllm
Nano vLLM
https://github.com/GeeeekExplorer/nano-vllm
Last synced: 3 months ago
JSON representation
Nano vLLM
- Host: GitHub
- URL: https://github.com/GeeeekExplorer/nano-vllm
- Owner: GeeeekExplorer
- License: mit
- Created: 2025-06-09T16:22:14.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-27T10:51:00.000Z (3 months ago)
- Last Synced: 2025-06-27T11:44:38.920Z (3 months ago)
- Language: Python
- Homepage:
- Size: 34.2 KB
- Stars: 4,266
- Watchers: 43
- Forks: 473
- Open Issues: 18
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cuda-and-hpc - Nano-vLLM - vllm?style=social"/> : A lightweight vLLM implementation built from scratch. (Frameworks)
- StarryDivineSky - GeeeekExplorer/nano-vllm
- awesome-llm-and-aigc - Nano-vLLM - vllm?style=social"/> : A lightweight vLLM implementation built from scratch. (Summary)
- awesome - GeeeekExplorer/nano-vllm - Nano vLLM (Python)
README
# Nano-vLLM
A lightweight vLLM implementation built from scratch.
## Key Features
* 🚀 **Fast offline inference** - Comparable inference speeds to vLLM
* 📖 **Readable codebase** - Clean implementation in ~ 1,200 lines of Python code
* âš¡ **Optimization Suite** - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.## Installation
```bash
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
```## Manual Download
If you prefer to download the model weights manually, use the following command:
```bash
huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
--local-dir ~/huggingface/Qwen3-0.6B/ \
--local-dir-use-symlinks False
```## Quick Start
See `example.py` for usage. The API mirrors vLLM's interface with minor differences in the `LLM.generate` method:
```python
from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]
```## Benchmark
See `bench.py` for benchmark.
**Test Configuration:**
- Hardware: RTX 4070 Laptop (8GB)
- Model: Qwen3-0.6B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100–1024 tokens
- Output Length: Randomly sampled between 100–1024 tokens**Performance Results:**
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|----------------|-------------|----------|-----------------------|
| vLLM | 133,966 | 98.37 | 1361.84 |
| Nano-vLLM | 133,966 | 93.41 | 1434.13 |## Star History
[](https://www.star-history.com/#GeeeekExplorer/nano-vllm&Date)