https://github.com/GeeeekExplorer/nano-vllm

Nano vLLM
https://github.com/GeeeekExplorer/nano-vllm

Last synced: 5 months ago
JSON representation

Nano vLLM

Host: GitHub
URL: https://github.com/GeeeekExplorer/nano-vllm
Owner: GeeeekExplorer
License: mit
Created: 2025-06-09T16:22:14.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-06-27T10:51:00.000Z (6 months ago)
Last Synced: 2025-06-27T11:44:38.920Z (6 months ago)
Language: Python
Homepage:
Size: 34.2 KB
Stars: 4,266
Watchers: 43
Forks: 473
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-cuda-and-hpc - Nano-vLLM - vllm?style=social"/> : A lightweight vLLM implementation built from scratch. (Frameworks)
StarryDivineSky - GeeeekExplorer/nano-vllm
awesome-llm-and-aigc - Nano-vLLM - vllm?style=social"/> : A lightweight vLLM implementation built from scratch. (Summary)
AiTreasureBox - GeeeekExplorer/nano-vllm - 11-02_7547_140](https://img.shields.io/github/stars/GeeeekExplorer/nano-vllm.svg)|Nano vLLM| (Repos)
awesome - GeeeekExplorer/nano-vllm - Nano vLLM (Python)
awesome-repositories - GeeeekExplorer/nano-vllm - Nano vLLM (Python)

README

          # Nano-vLLM

A lightweight vLLM implementation built from scratch.

## Key Features

* 🚀 **Fast offline inference** - Comparable inference speeds to vLLM

* 📖 **Readable codebase** - Clean implementation in ~ 1,200 lines of Python code

* ⚡ **Optimization Suite** - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

## Installation

```bash

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

```

## Manual Download

If you prefer to download the model weights manually, use the following command:

```bash

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \

  --local-dir ~/huggingface/Qwen3-0.6B/ \

  --local-dir-use-symlinks False

```

## Quick Start

See `example.py` for usage. The API mirrors vLLM's interface with minor differences in the `LLM.generate` method:

```python

from nanovllm import LLM, SamplingParams

llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)

prompts = ["Hello, Nano-vLLM."]

outputs = llm.generate(prompts, sampling_params)

outputs[0]["text"]

```

## Benchmark

See `bench.py` for benchmark.

**Test Configuration:**

- Hardware: RTX 4070 Laptop (8GB)

- Model: Qwen3-0.6B

- Total Requests: 256 sequences

- Input Length: Randomly sampled between 100–1024 tokens

- Output Length: Randomly sampled between 100–1024 tokens

**Performance Results:**

| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |

|----------------|-------------|----------|-----------------------|

| vLLM           | 133,966     | 98.37    | 1361.84               |

| Nano-vLLM      | 133,966     | 93.41    | 1434.13               |

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=GeeeekExplorer/nano-vllm&type=Date)](https://www.star-history.com/#GeeeekExplorer/nano-vllm&Date)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/GeeeekExplorer/nano-vllm

Awesome Lists containing this project

README