https://github.com/back2matching/kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers. Supports Ollama + llama.cpp.
https://github.com/back2matching/kvcache-bench

benchmark gpu inference kv-cache llama-cpp llm local-llm ollama quantization vram

Last synced: 12 days ago
JSON representation

Benchmark every KV cache compression method on your GPU. One command, real numbers. Supports Ollama + llama.cpp.

Host: GitHub
URL: https://github.com/back2matching/kvcache-bench
Owner: back2matching
License: other
Created: 2026-03-25T07:25:47.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-26T03:21:33.000Z (3 months ago)
Last Synced: 2026-03-26T11:52:43.443Z (3 months ago)
Topics: benchmark, gpu, inference, kv-cache, llama-cpp, llm, local-llm, ollama, quantization, vram
Language: Python
Size: 160 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers.

```

kvcache-bench --model qwen3.5:9b

```

```

| KV Type | Context | Prompt    | Gen tok/s | Prefill tok/s | VRAM +MB | Quality |

|---------|---------|-----------|-----------|---------------|----------|---------|

| f16     | 4096    | short     | 86.9      | 509.5         | +80      | PASS    |

| f16     | 16384   | short     | 78.8      | 784.6         | +316     | PASS    |

| q8_0    | 4096    | short     | 86.7      | 793.1         | +48      | PASS    |

| q8_0    | 16384   | short     | 87.2      | 741.9         | +219     | PASS    |

| q4_0    | 4096    | short     | 86.7      | 798.0         | +59      | PASS    |

| q4_0    | 16384   | short     | 86.7      | 522.7         | +156     | PASS    |

──────────────────────────────────────────────────

RECOMMENDATION

──────────────────────────────────────────────────

  Use q8_0 (8-bit KV cache)

  Speed: 87 tok/s (-0.1% vs f16)

  VRAM: saves 97 MB vs f16 (2x compression)

  Quality: negligible loss (+0.004 perplexity)

  Set: OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1

```

*Real output from Qwen3.5-9B on RTX 4080 16GB.*

## Why

When you run a local LLM, the KV cache eats your VRAM. Ollama and llama.cpp support different KV cache quantization types (f16, q8_0, q4_0), but nobody tells you what the actual tradeoff is on YOUR hardware.

Current state of the world:

- You Google "ollama kv cache quantization" and find forum posts with conflicting advice

- You manually test each config, eyeball nvidia-smi, and guess

- No tool compares them systematically

kvcache-bench fixes this. It tests every KV cache type on your GPU and gives you a comparison table with speed, VRAM, and quality.

## Install

```bash

pip install kvcache-bench

```

## Usage

```bash

# Auto-detect your first model, test all KV types

kvcache-bench

# Specific model

kvcache-bench --model qwen3.5:9b

# Test at multiple context lengths (where KV savings matter most)

kvcache-bench --model llama3.1:8b --context 4096,8192,16384

# Include tool calling test

kvcache-bench --model qwen3.5:9b --prompts short,code,reasoning,tool_call

# Save results as JSON

kvcache-bench --model qwen3.5:9b --json results.json

# Just show GPU info

kvcache-bench --gpu

# List available models

kvcache-bench --list-models

```

## What It Tests

For each KV cache type (f16, q8_0, q4_0), it measures:

| Metric | How |

|--------|-----|

| **Generation speed** | Tokens per second during generation |

| **Prefill speed** | Tokens per second processing the prompt |

| **VRAM delta** | Extra VRAM used beyond model weights (measured via nvidia-smi) |

| **Quality** | Auto-checked against expected answers (Paris, code structure, reasoning) |

## How It Works

1. Detects your GPU and Ollama installation

2. For each KV cache type: restarts Ollama with `OLLAMA_KV_CACHE_TYPE=`, warms up the model, runs benchmark prompts

3. Measures VRAM before and during inference via nvidia-smi

4. Extracts timing from Ollama's API response (prompt_eval_duration, eval_duration)

5. Checks response quality with simple auto-graders

6. Produces a markdown table (and optional JSON)

## What the Research Says

Based on llama.cpp community benchmarks and our testing:

| KV Type | VRAM Savings | Perplexity Impact | Best For |

|---------|-------------|-------------------|----------|

| f16 | Baseline | None | When you have VRAM to spare |

| q8_0 | 2x | +0.004 (negligible) | **Default recommendation.** Free VRAM, zero quality cost. |

| q4_0 | 4x | +0.2 (noticeable) | When you need max context length or are VRAM-constrained |

The sweet spot for most users: **q8_0**. Halves your KV cache VRAM with essentially zero quality loss.

## Requirements

- Python 3.10+

- NVIDIA GPU with nvidia-smi

- Ollama installed and running

## Roadmap

- [ ] Mixed K/V types (q8 keys + q4 values)

- [ ] Context length sweep charts

- [ ] HuggingFace backend (vLLM, TGI)

- [ ] TurboQuant integration

- [ ] Multi-model matrix

- [ ] HuggingFace Spaces leaderboard

- [ ] Community result submissions

## License

Apache 2.0

## Related

- [turboquant](https://github.com/back2matching/turboquant) -- TurboQuant KV cache compression (sub-4-bit)

- [NVIDIA kvpress](https://github.com/NVIDIA/kvpress) -- KV cache eviction/pruning methods

- [llama.cpp](https://github.com/ggml-org/llama.cpp) -- Where KV cache quantization lives

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/back2matching/kvcache-bench

Awesome Lists containing this project

README