https://github.com/ifuryst/nanollmserve
🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.
https://github.com/ifuryst/nanollmserve
ai-infra continuous-batching fastapi inference kv-cache llm llm-engine llm-inference llm-infra llm-serving lora openai-compatible paged-attention pytorch quantization sglang speculative-decoding teaching transformers vllm
Last synced: 7 days ago
JSON representation
🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.
- Host: GitHub
- URL: https://github.com/ifuryst/nanollmserve
- Owner: iFurySt
- License: mit
- Created: 2026-05-16T04:09:53.000Z (27 days ago)
- Default Branch: main
- Last Pushed: 2026-05-28T06:29:07.000Z (15 days ago)
- Last Synced: 2026-05-28T08:14:43.308Z (15 days ago)
- Topics: ai-infra, continuous-batching, fastapi, inference, kv-cache, llm, llm-engine, llm-inference, llm-infra, llm-serving, lora, openai-compatible, paged-attention, pytorch, quantization, sglang, speculative-decoding, teaching, transformers, vllm
- Language: Python
- Homepage:
- Size: 273 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: CODEOWNERS
- Security: SECURITY.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# nanoLLMServe
[](./README.md)
[](./README.zh-CN.md)
`nanoLLMServe` is a tiny, readable LLM serving engine.
The goal is to build a small vLLM/SGLang-style system that can actually run,
while keeping each production serving idea easy to inspect:
- OpenAI-compatible API
- KV cache decode
- batching and continuous batching
- block KV cache management
- prefix cache
- chunked prefill
- metrics, structured output, speculative decoding, LoRA, quantization, and distributed serving
It is not trying to be faster than vLLM. It is trying to make the serving stack
understandable.
## Roadmap
Milestones live in [`docs/exec-plans/active/milestones`](./docs/exec-plans/active/milestones).
Start with [`v0.0-naive-single-request`](./docs/exec-plans/active/milestones/v0.0-naive-single-request.md),
expose it through the [`v0.1 OpenAI-compatible server`](./docs/exec-plans/active/milestones/v0.1-openai-compatible-server.md),
then switch generation to [`v0.2 KV cache decode`](./docs/exec-plans/active/milestones/v0.2-kv-cache-decode.md).
## Quick Start
Install the package in a virtual environment:
```bash
uv venv --python python3
source .venv/bin/activate
uv pip install -e ".[dev]"
```
Generate text for one prompt:
```bash
nanollm-generate \
--model Qwen/Qwen3-1.7B \
--prompt "Explain KV cache in one sentence." \
--max-new-tokens 32 \
--temperature 0 \
--show-stats
```
On the A100 harness machine, use the cached local weights:
```bash
nanollm-generate \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--prompt "Explain KV cache in one sentence." \
--max-new-tokens 32 \
--temperature 0 \
--show-stats
```
## v0.2 KV Cache Decode
The default generation path now runs one prefill pass over the prompt and then
uses Hugging Face `past_key_values` so each later decode step only receives the
last generated token.
Run the benchmark with a v0.0-style naive baseline comparison:
```bash
python benchmarks/benchmark_generate.py \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--runs 3 \
--warmup 1
```
The JSON output includes `kv_cache_decode.mean_ttft_seconds`,
`kv_cache_decode.mean_tpot_seconds`, and a `comparison` section with elapsed and
TPOT speedup against the naive full-sequence loop.
## v0.3 Static Batching (Teaching-Scale)
You can benchmark fixed-size static batching with `--batch-size`:
```bash
python benchmarks/benchmark_generate.py \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--batch-size 4 \
--runs 5 \
--warmup 2
```
The output will include `static_batch`, including batch elapsed time and mean
row-level token throughput for the fixed-size group.
## v0.4 Continuous Batching (Teaching-Scale)
When `--batch-size` is greater than 1, the benchmark now also runs a
teaching-scale continuous batching path. It admits requests over scheduler
steps, rebuilds the active batch each step, and reports active batch sizes:
```bash
python benchmarks/benchmark_generate.py \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--batch-size 4 \
--runs 5 \
--warmup 2 \
--skip-naive-baseline
```
The output includes `continuous_batch.active_batch_sizes` and
`continuous_batch.mean_active_batch_size`. This milestone intentionally
recomputes full active rows; paged KV cache for dynamic rows belongs to v0.5.
## v0.5 Block KV Cache Manager
v0.5 adds allocator metadata for fixed-size KV blocks: a free block pool,
request-to-block tables, allocation/release hooks in generation, and observable
fragmentation metrics. It demonstrates the memory-management motivation behind
PagedAttention without adding a custom GPU kernel yet.
Run the synthetic fragmentation benchmark:
```bash
python benchmarks/benchmark_block_manager.py \
--block-size 16 \
--total-blocks 64 \
--request-tokens 9,17,33,5,41,12
```
The output compares block allocation against a contiguous fixed-slot baseline
and reports `internal_fragmentation_tokens`, `block_utilization`, and
`fragmentation_tokens_saved_vs_contiguous`.
## v0.6 Prefix Cache
v0.6 adds a teaching-scale prefix cache for the single-request generation path.
It hashes block-aligned prompt prefixes, stores sliced Hugging Face
`past_key_values`, tracks hit/miss/ref-count/LRU state, and lets later requests
with the same prefix prefill only the uncached suffix.
Run the repeated-prefix benchmark:
```bash
python benchmarks/benchmark_prefix_cache.py \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--runs 3 \
--warmup 1
```
The output compares `no_prefix_cache` with `prefix_cache` and reports
`cache_hits`, `cache_misses`, `mean_ttft_seconds`, `mean_prefill_seconds`, and
TTFT/prefill speedup ratios.
## v0.7 Chunked Prefill
v0.7 adds a decode-first chunked prefill scheduler for mixed long/short prompt
workloads. Long prompts are split by `max_prefill_tokens_per_step`, while short
remaining prefills can run before the long prompt consumes another scheduler
step.
Run the mixed workload benchmark:
```bash
python benchmarks/benchmark_chunked_prefill.py \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--local-files-only \
--max-prefill-tokens-per-step 64
```
The output compares an arrival-order monolithic prefill baseline with
`chunked_prefill`, including `prefill_tokens_per_step`,
`short_first_token_step`, and `short_time_to_first_token_speedup`.
## v0.1 OpenAI-Compatible Server
Serve one local or Hugging Face causal LM:
```bash
nanollm-serve \
--model /data2/nanoLLMServe/models/Qwen3-1.7B \
--served-model-name Qwen3-1.7B \
--local-files-only \
--host 127.0.0.1 \
--port 8000
```
List models:
```bash
curl http://127.0.0.1:8000/v1/models
```
Call the recommended Responses endpoint:
```bash
curl http://127.0.0.1:8000/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen3-1.7B",
"instructions": "Answer in one sentence.",
"input": "Explain KV cache.",
"max_output_tokens": 32,
"temperature": 0
}'
```
Call the chat completions endpoint:
```bash
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen3-1.7B",
"messages": [{"role": "user", "content": "Explain KV cache in one sentence."}],
"max_tokens": 32,
"temperature": 0
}'
```
Streaming uses OpenAI-style server-sent events:
```bash
curl -N http://127.0.0.1:8000/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen3-1.7B",
"input": "KV cache is",
"max_output_tokens": 16,
"temperature": 0,
"stream": true
}'
```
## License
[MIT](LICENSE)