https://github.com/elinx/llm-mem-calculator
Interactive KV cache memory calculator for LLMs — supports MLA, GQA, hybrid attention, sliding window, and linear attention architectures. Estimate GPU memory for serving any model at any context length.
https://github.com/elinx/llm-mem-calculator
calculator gpu-memory kv-cache llm llm-serving vllm
Last synced: 3 days ago
JSON representation
Interactive KV cache memory calculator for LLMs — supports MLA, GQA, hybrid attention, sliding window, and linear attention architectures. Estimate GPU memory for serving any model at any context length.
- Host: GitHub
- URL: https://github.com/elinx/llm-mem-calculator
- Owner: elinx
- Created: 2026-06-06T12:49:48.000Z (6 days ago)
- Default Branch: main
- Last Pushed: 2026-06-06T15:01:13.000Z (6 days ago)
- Last Synced: 2026-06-06T15:08:15.138Z (6 days ago)
- Topics: calculator, gpu-memory, kv-cache, llm, llm-serving, vllm
- Language: JavaScript
- Homepage: https://elinx.github.io/llm-mem-calculator/
- Size: 493 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# KV Cache Calculator
A web-based tool for estimating LLM KV cache memory requirements. Supports modern architectures including MLA, GQA, hybrid attention, sliding window, and linear attention models.
**Live Demo**: [elinx.github.io/llm-mem-calculator](https://elinx.github.io/llm-mem-calculator/)
## Calculator
Calculate KV cache size for a single model with customizable parameters — context length, batch size, KV precision, and more.

## Compare
Compare KV cache memory across multiple models side-by-side with an interactive chart.

## Supported Architectures
| Architecture | Example Models |
|---|---|
| Standard GQA | Qwen3, Llama 3.x, Qwen2.5, MiniMax M2.x |
| MLA (Multi-head Latent Attention) | DeepSeek V3, DeepSeek R1, Kimi K2.5/K2.6 |
| DSA+MLA (DeepSeek V4 Hybrid) | DeepSeek V4 Pro, DeepSeek V4 Flash, DeepSeek V3.2, GLM-5/5.1 |
| Mixed Full + Sliding Window | Gemma 4, Cohere Command, MiMo-V2.5 |
| Linear + Full Hybrid | Qwen3.5, Qwen3.6 |
## Features
- **Precision options**: BF16/FP16, FP8/INT8, FP4/INT4
- **Draft KV cache**: Account for MTP/draft model KV layers
- **Linear attention KV**: Include linear attention layer contributions
- **Context presets**: Quick-select from 1K to 1M tokens
- **Breakdown view**: Detailed per-layer KV cache breakdown
- **Formula display**: Shows the exact formula used for each model
- **Dark mode**: Toggle between light and dark themes
- **Chart export**: Download comparison charts as PNG or copy to clipboard
## Development
No build step required — just open `index.html` in a browser or serve the directory with any static file server.
```bash
# Quick local server
python3 -m http.server 8765
```
## License
MIT