https://github.com/togethercomputer/saw-int4
Official implementation of Paper "System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving"
https://github.com/togethercomputer/saw-int4
Last synced: 2 months ago
JSON representation
Official implementation of Paper "System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving"
- Host: GitHub
- URL: https://github.com/togethercomputer/saw-int4
- Owner: togethercomputer
- License: mit
- Created: 2026-04-13T20:42:59.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-17T01:09:56.000Z (3 months ago)
- Last Synced: 2026-04-17T03:08:11.109Z (3 months ago)
- Language: Shell
- Size: 99.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# saw-int4
saw-int4 is the official implementation of
**<>**
This repository implements Block Diagonal Rotation (BDR) for KV-cache quantization, along with system-level optimizations that seamlessly integrate into SGLang. The resulting system achieves near-BF16 accuracy while preserving the end-to-end performance benefits of INT4.
## Contents
- [Introduction](#introduction)
- [How to run BDR](#how-to-run-bdr)
- [Get the code](#get-the-code)
- [Server requirements](#server-requirements)
- [Install BDR (sglang-fast-rotation)](#install-bdr-sglang-fast-rotation)
- [Run BDR](#run-bdr)
- [Quick demo (verify your install)](#quick-demo-verify-your-install)
- [Primary accuracy and throughput](#primary-accuracy-and-throughput)
- [Accuracy (primary)](#accuracy-primary)
- [Prepare](#prepare)
- [RUN-GPQA](#run-gpqa)
- [Accuracy results (primary)](#accuracy-results-primary)
- [Throughput and latency (primary)](#throughput-and-latency-primary)
- [Prepare (genai-bench)](#prepare-genai-bench)
- [Speed results (primary)](#speed-results-primary)
- [Ablation study (k-means, k-means + rotation)](#ablation-study-k-means-k-means--rotation)
- [Install sglang-kmeans](#install-sglang-kmeans)
- [KV calibration (ablation only)](#kv-calibration-ablation-only)
- [Ablation method matrix](#ablation-method-matrix)
- [Accuracy results (ablation)](#accuracy-results-ablation)
- [Repository layout](#repository-layout)
- [Full reproduction](#full-reproduction)
- [License](#license)
## Introduction
This work studies **4-bit KV-cache quantization** under **real serving constraints** such as paged memory layouts, regular memory access, and fused attention execution. Our primary method, **BDR (block-diagonal rotation)**, applies a **block-diagonal Hadamard rotation** to the KV cache before **token-wise INT4 KV-cache quantization**, implemented directly inside a **fork of [SGLang](https://github.com/sgl-project/sglang)**.
We ship two submodule branches on the same fork remote:
- **[third_party/sglang-fast-rotation](third_party/sglang-fast-rotation)** — **Our proposed BDR implementation:** fused block-diagonal rotation + INT4 KV-cache write. Use this fork for **both accuracy and throughput** on **BF16**, **INT4**, and **BDR** (the main paper numbers).
- **[third_party/sglang-kmeans](third_party/sglang-kmeans)** — **Ablation study for kmeans, kmeans+rotation:** KV dump, k-means centroids, and k-means + rotation variants. Not required to reproduce the core BDR vs BF16 vs INT4 story.
Pinned commits: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).
## How to run BDR
This section covers everything needed to run BDR on **`third_party/sglang-fast-rotation`**: get the code, install, and launch a server.
### Get the code
```bash
git clone --recurse-submodules https://github.com/togethercomputer/saw-int4.git
cd saw-int4
```
If you cloned without submodules: `git submodule update --init third_party/sglang-fast-rotation`.
### Server requirements
The BDR implementation is built on top of the SGLang codebase and currently assumes the following setup:
- **MHA models only** — **MLA** and other non-MHA layouts are **not supported** for these KV / BDR settings.
- **Prefill backend:** **`fa3`**.
- **Decode backend:** **`triton`**.
### Install BDR
```bash
cd third_party/sglang-fast-rotation/python
pip install -e ".[all]"
pip install --no-build-isolation "git+https://github.com/Dao-AILab/fast-hadamard-transform.git"
```
### Run BDR
**BF16 KV (baseline)**
```bash
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype auto
```
**Original INT4 KV**
```bash
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype int4
```
**BDR (block diagnoal rotation on K)**
```bash
HADAMARD=1 HADAMARD_ORDER=128 python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype int4
```
For the full env variable reference, and the complete mode matrix, see [docs/bdr_env_vars.md](docs/bdr_env_vars.md).
### Quick demo (verify your install)
With the server running in **any** of the three modes above, run the smoke-test script from the repository root:
```bash
pip install openai # if not already installed
python scripts/bdr_smoke_test.py --port 30001 --model Qwen/Qwen3-4B-Thinking-2507
```
The script sends a **GPQA sample question** to the server and streams the response.
```
Server : http://0.0.0.0:30000/v1
Model : Qwen/Qwen3-4B-Thinking-2507
--- Prompt (GPQA sample) ---
Answer the following multiple choice question.....
...
--- Response ---
```
## Primary accuracy and throughput
**Accuracy** (simple-evals / GPQA) and **throughput** ([genai-bench](https://github.com/sgl-project/genai-bench)) both use **`third_party/sglang-fast-rotation`**; server setup is in [How to run BDR](#how-to-run-bdr). **Accuracy model:** **`Qwen/Qwen3-4B-Thinking-2507`**. **Throughput model:** **`Qwen/Qwen3-8B`** (override `MODEL_PATH` in scripts if you align checkpoints).
### Accuracy (primary)
#### Prepare
**Prerequisite (GPQA client):** **[openai/simple-evals](https://github.com/openai/simple-evals)** is included as a submodule at **`third_party/simple-evals`**.
```bash
git submodule update --init --checkout third_party/simple-evals
cd third_party/simple-evals
mkdir -p simple_evals
touch simple_evals/__init__.py
pip install openai pandas requests jinja2 tqdm numpy
```
Add a local model alias once in `third_party/simple-evals/simple_evals.py` inside the `models = { ... }` dictionary so `simple-evals` and set max_tokens=32768:
```python
"qwen3_4b": ChatCompletionSampler(
model="Qwen/Qwen3-4B-Thinking-2507",
system_message=OPENAI_SYSTEM_MESSAGE_API,
max_tokens=32768,
),
```
#### RUN-GPQA
With **simple-evals** installed and the SGLang server already up (start it in the desired mode from [Run BDR](#run-bdr), using **`Qwen/Qwen3-4B-Thinking-2507`** as the model), point the client at **`http://127.0.0.1:/v1`** and run GPQA:
```bash
cd third_party/simple-evals
export OPENAI_BASE_URL="http://127.0.0.1:30000/v1"
export OPENAI_API_KEY="dummy"
python -m simple-evals.simple_evals --model qwen3_4b --eval gpqa --n-repeats 3
```
#### Accuracy results (primary, temp=0.6, seq=32k and top=0.95)
| Model | Method | Benchmark | Score |
|-------|--------|-----------|-------|
| Qwen/Qwen3-4B-Thinking-2507 | BF16 | GPQA | 66.6667 |
| Qwen/Qwen3-4B-Thinking-2507 | INT4 | GPQA | 0 |
| Qwen/Qwen3-4B-Thinking-2507 | BDR (K-only) | GPQA | 65.8249 |
### Throughput and latency (primary)
Speed results use **sglang-fast-rotation** (fused INT4 + BDR kernels) with **`Qwen/Qwen3-8B`**, driven by **[genai-bench](https://github.com/sgl-project/genai-bench)** against the server’s OpenAI-compatible HTTP API. Helper: [scripts/run_genai_bench_example.sh](scripts/run_genai_bench_example.sh) (default `MODEL_PATH`). Full CLI, traffic scenarios, Excel/plots: [GenAI Bench docs](https://docs.sglang.ai/genai-bench/getting-started/) and [Run benchmark](https://docs.sglang.ai/genai-bench/user-guide/run-benchmark/).
#### Prepare (genai-bench)
**Prerequisite (throughput client):** install genai-bench (separate from the SGLang venv if you prefer):
```bash
pip install genai-bench
```
Optional (quieter HF logs during tokenizer load): `export TRANSFORMERS_VERBOSITY=error`. For Docker / dev installs, see the upstream [installation guide](https://docs.sglang.ai/genai-bench/getting-started/installation/).
**Terminal 1 — server** (example BF16 KV):
```bash
cd third_party/sglang-fast-rotation/python
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-8B" \
--port 30000 \
--kv-cache-dtype int4
```
**Terminal 2 — client** (after `pip install genai-bench`; matches ~256 input / 32 output tokens and concurrency 16 — see [traffic scenarios](https://docs.sglang.ai/genai-bench/user-guide/scenario-definition/)):
```bash
genai-bench benchmark --api-backend sglang \
--api-base "http://127.0.0.1:30000" \
--api-key "dummy" \
--api-model-name "Qwen/Qwen3-8B" \
--model-tokenizer "Qwen/Qwen3-8B" \
--task text-to-text \
--traffic-scenario "D(256,32)" \
--num-concurrency 16 \
--max-time-per-run 5 \
--max-requests-per-run 200 \
--server-engine "SGLang" \
--server-gpu-type "local" \
--server-version "custom" \
--server-gpu-count 1
```
Tune `--max-time-per-run`, `--max-requests-per-run`, `--num-concurrency`, and `--traffic-scenario` using `genai-bench benchmark --help` and the docs above. Label runs with accurate `--server-gpu-type` / `--server-version` when publishing numbers.
**Sweep BF16 vs INT4 vs BDR:** restart the server with the right env and `--kv-cache-dtype`, then rerun **genai-bench** with **identical** client flags.
| Config | Env | `--kv-cache-dtype` |
|--------|-----|-------------------|
| BF16 KV | `HADAMARD=0` | `auto` |
| INT4 KV | `HADAMARD=0` | `int4` |
| BDR + INT4 | `HADAMARD=1` `ROTATE_V=0` `HADAMARD_ORDER=128` | `int4` |
SGLang’s built-in `bench_serving` ([bench_serving](https://github.com/sgl-project/sglang/blob/main/docs/developer_guide/bench_serving.md)) is optional; this repo standardizes on **genai-bench** for comparable sweeps and reporting.
**Hub:** [eval_speed/](eval_speed/)
**Helper:** [scripts/run_genai_bench_example.sh](scripts/run_genai_bench_example.sh)
#### Speed results (primary)
Hardware: 1× H100 80 GB, TP=1. Model: `Qwen/Qwen3-8B`.
Client: [genai-bench](https://github.com/sgl-project/genai-bench). Metric definitions: [eval_speed/metrics.md](eval_speed/metrics.md).
**Short context — `D(256, 1024)` (256 input / 1024 output tokens)**
Cap: 5 min or 256 requests. Results: [eval_speed/results/20260416_203040/](eval_speed/results/20260416_203040/)
| KV config | Conc | output_tps(job) | mean_input_tps(req) | mean_output_tps(req) | mean_ttft(req) (ms) | E2E mean(req) (s) | E2E p75(req) (s) | E2E p90(req) (s) | total requests | Wall (s) |
|-----------|-----:|----------------:|--------------------:|---------------------:|--------------------:|------------------:|-----------------:|-----------------:|---------------:|---------:|
| BF16 | 32 | 3,795 | 1,573 | 122.1 | 196 | 8.57 | 8.60 | 8.62 | 256 | 69 |
| INT4 | 32 | 3,687 | 1,380 | 120.9 | 225 | 8.69 | 8.71 | 8.75 | 256 | 71 |
| INT4 + BDR (K-only, ord=128) | 32 | 3,689 | 1,379 | 120.2 | 226 | 8.74 | 8.74 | 8.76 | 256 | 71 |
| BF16 | 64 | 5,950 | 796 | 98.7 | 369 | 10.74 | 10.78 | 10.82 | 256 | 44 |
| INT4 | 64 | 6,371 | 774 | 105.0 | 370 | 10.11 | 10.16 | 10.20 | 256 | 41 |
| INT4 + BDR (K-only, ord=128) | 64 | 6,235 | 755 | 104.3 | 377 | 10.19 | 10.24 | 10.26 | 256 | 42 |
| BF16 | 128 | 8,410 | 455 | 71.8 | 657 | 14.92 | 15.00 | 15.11 | 256 | 31 |
| INT4 | 128 | 9,544 | 437 | 81.0 | 665 | 13.30 | 13.38 | 13.45 | 256 | 28 |
| INT4 + BDR (K-only, ord=128) | 128 | 9,350 | 458 | 80.1 | 655 | 13.43 | 13.51 | 13.60 | 256 | 28 |
| BF16 | 256 | 11,195 | 242 | 49.3 | 1,224 | 22.00 | 22.15 | 22.24 | 256 | 23 |
| INT4 | 256 | 11,624 | 225 | 51.1 | 1,237 | 21.25 | 21.50 | 21.57 | 256 | 23 |
| INT4 + BDR (K-only, ord=128) | 256 | 11,732 | 266 | 51.6 | 1,148 | 20.99 | 21.12 | 21.19 | 256 | 22 |
**Long context — `D(16384, 1024)` (16 384 input / 1024 output tokens)**
Cap: 20 min or 64–256 requests (varies by concurrency). Results: [eval_speed/results/20260416_214449/](eval_speed/results/20260416_214449/) (conc 8–64), [eval_speed/results/20260416_233035/](eval_speed/results/20260416_233035/) (conc 128)
| KV config | Conc | output_tps(job) | mean_input_tps(req) | mean_output_tps(req) | mean_ttft(req) (ms) | E2E mean(req) (s) | E2E p75(req) (s) | E2E p90(req) (s) | total requests | Wall (s) |
|-----------|-----:|----------------:|--------------------:|---------------------:|--------------------:|------------------:|-----------------:|-----------------:|---------------:|---------:|
| BF16 | 8 | 414 | 8,311 | 61.4 | 2,636 | 19.37 | 19.53 | 19.65 | 64 | 158 |
| INT4 | 8 | 458 | 8,391 | 69.2 | 2,631 | 17.50 | 17.67 | 17.77 | 64 | 143 |
| INT4 + BDR (K-only, ord=128) | 8 | 457 | 8,784 | 68.7 | 2,523 | 17.50 | 17.69 | 17.78 | 64 | 143 |
| BF16 | 16 | 481 | 4,413 | 36.7 | 5,104 | 33.14 | 33.48 | 33.65 | 64 | 136 |
| INT4 | 16 | 571 | 4,672 | 45.4 | 4,956 | 27.74 | 28.04 | 28.28 | 64 | 115 |
| INT4 + BDR (K-only, ord=128) | 16 | 568 | 4,083 | 44.8 | 4,875 | 27.94 | 28.30 | 28.54 | 64 | 116 |
| BF16 | 32 | 570 | 1,741 | 32.9 | 18,047 | 49.58 | 73.20 | 73.64 | 64 | 115 |
| INT4 | 32 | 618 | 2,147 | 25.4 | 9,568 | 50.45 | 51.11 | 51.49 | 64 | 106 |
| INT4 + BDR (K-only, ord=128) | 32 | 616 | 2,215 | 25.1 | 9,350 | 50.57 | 51.23 | 51.62 | 64 | 107 |
| BF16 | 64 | 471 | 806 | 32.7 | 44,798 | 76.91 | 112.33 | 113.22 | 64 | 139 |
| INT4 | 64 | 666 | 1,114 | 14.7 | 19,398 | 90.46 | 91.70 | 92.51 | 64 | 98 |
| INT4 + BDR (K-only, ord=128) | 64 | 663 | 1,150 | 14.4 | 18,371 | 90.78 | 92.06 | 92.83 | 64 | 99 |
| BF16 | 128 | 559 | 310 | 32.9 | 113,583 | 145.96 | 220.85 | 221.91 | 148 | 271 |
| INT4 | 128 | 701 | 527 | 12.3 | 57,654 | 142.19 | 208.11 | 210.82 | 153 | 224 |
| INT4 + BDR (K-only, ord=128) | 128 | 701 | 535 | 12.3 | 57,054 | 142.09 | 208.05 | 210.73 | 153 | 224 |
## Ablation study (k-means, k-means + rotation)
Use **`third_party/sglang-kmeans`**: KV dump for calibration, [tools/fit_kv_centroids.py](tools/fit_kv_centroids.py), then `SGLANG_KV_CENTROIDS_PATH` for **k-means + INT4** and **k-means + BDR** (optional `HADAMARD` / `ROTATE_V`). Accuracy still uses **simple-evals** from **`third_party/simple-evals`** ([Prepare](#prepare); run GPQA per upstream docs).
### Install sglang-kmeans
Not needed for primary BF16 / INT4 / BDR ([How to run BDR](#how-to-run-bdr)). Initialize the submodule (skipped by default), then install:
```bash
git submodule update --init third_party/sglang-kmeans
cd third_party/sglang-kmeans/python
pip install -e ".[all]"
pip install "flash-kmeans @ git+https://github.com/jindajia/flash-kmeans.git"
```
### KV calibration (ablation only)
Primary BF16 / INT4 / BDR does **not** need this step.
**1. Dump KV activations** — run from **sglang-kmeans** with a **BF16 KV cache** (`auto`) so dumps are in calibration space:
```bash
cd third_party/sglang-kmeans/python
export DUMP_KVCACHE=true
export DUMP_KVCACHE_TOKENS=512
export DUMP_KVCACHE_DIR=/path/to/kv_dumps
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-8B" \
--port 30000 \
--kv-cache-dtype auto
```
Drive enough traffic so each layer hits the threshold at least once. Files appear as `kv_calibration_layer_.pt` (dict with `k`, `v`, `indices` on CPU; see `triton_backend.py` in the submodule for selection logic).
**2. Fit centroids offline** — from the **repository root**:
```bash
python tools/fit_kv_centroids.py \
--dump-dir /path/to/kv_dumps \
--out-dir /path/to/centroids_out \
--n-clusters 16 \
--seed 0
```
This writes `k_layer_L_clusters__centers.pt` and `v_layer_L_clusters__centers.pt` per global layer `L`, shaped `(N, num_kv_heads_global * head_dim)`, for loading in the submodule.
**3. Run INT4 + k-means inference**
```bash
export N_CLUSTERS=16
export SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-8B" \
--port 30000 \
--kv-cache-dtype int4
```
**K-means + BDR:** keep `SGLANG_KV_CENTROIDS_PATH`, set `HADAMARD=1`, optional `ROTATE_V`, and `HADAMARD_ORDER` consistent with head dimension (same as primary BDR).
### Ablation method matrix
| Method | `HADAMARD` | `ROTATE_V` | `HADAMARD_ORDER` | `--kv-cache-dtype` | `SGLANG_KV_CENTROIDS_PATH` | `N_CLUSTERS` |
|--------|------------|------------|------------------|---------------------|----------------------------|--------------|
| K-means + INT4 | `0` | `0` | n/a | `int4` | required | match files |
| K-means + BDR | `1` | `0` or `1` | set | `int4` | required | match files |
**K-means + INT4 example:**
```bash
cd third_party/sglang-kmeans/python
export OPENAI_API_KEY=dummy
export N_CLUSTERS=16
export SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out
export HADAMARD=0
export ROTATE_V=0
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-8B" --port 30000 --kv-cache-dtype int4
```
**K-means + BDR example:**
```bash
export HADAMARD=1
export ROTATE_V=0
export HADAMARD_ORDER=16
export N_CLUSTERS=16
export SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out
python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-8B" --port 30000 --kv-cache-dtype int4
```
**Hub:** [eval_accuracy/](eval_accuracy/)
**Helper:** `CENTROIDS=/path/to/centroids_out ./scripts/run_eval_matrix.sh kmeans` or `kmeans_bdr`.
#### Accuracy results (ablation)
| Model | Method | Benchmark | Score |
|-------|--------|-----------|-------|
| — | K-means + INT4 | — | — |
| — | K-means + BDR | — | — |
Fill from [eval_accuracy/results/](eval_accuracy/results/).
## Repository layout
| Path | Role |
|------|------|
| [third_party/sglang-fast-rotation/](third_party/sglang-fast-rotation/) | **Primary** BF16 / INT4 / BDR — accuracy + speed |
| [third_party/sglang-kmeans/](third_party/sglang-kmeans/) | **Ablation** k-means KV + dump / centroids |
| [third_party/simple-evals/](third_party/simple-evals/) | **GPQA accuracy client** (openai/simple-evals submodule; no separate clone needed) |
| [docs/bdr_env_vars.md](docs/bdr_env_vars.md) | Full BDR env variable reference and mode matrix |
| [scripts/](scripts/) | `bdr_smoke_test.py` (install smoke test), `run_primary_eval_matrix.sh`, `run_eval_matrix.sh`, `run_genai_bench_example.sh`, `clone_submodules.sh` |
| [tools/](tools/) | `fit_kv_centroids.py` (ablation calibration) |
| [eval_primary/](eval_primary/) | Primary **accuracy** logs / tables |
| [eval_speed/](eval_speed/) | Primary **throughput** logs / tables |
| [eval_accuracy/](eval_accuracy/) | Ablation **accuracy** logs / tables |
## Full reproduction
Large raw bundles may live outside this repo.
- **Full reproduction bundle:** *TBD — add URL*
Submodule SHAs: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).
## License
See [LICENSE](LICENSE).