{"id":49308725,"url":"https://github.com/togethercomputer/saw-int4","last_synced_at":"2026-04-26T11:03:08.491Z","repository":{"id":351852766,"uuid":"1209838820","full_name":"togethercomputer/saw-int4","owner":"togethercomputer","description":"Official implementation of Paper \"System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving\"","archived":false,"fork":false,"pushed_at":"2026-04-17T01:09:56.000Z","size":102,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-17T03:08:11.109Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/togethercomputer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-13T20:42:59.000Z","updated_at":"2026-04-17T01:01:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/togethercomputer/saw-int4","commit_stats":null,"previous_names":["togethercomputer/system-aware-4-bit-kv-cache-quantization","togethercomputer/sys-aware-kv-int4","togethercomputer/saw-int4"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/togethercomputer/saw-int4","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2Fsaw-int4","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2Fsaw-int4/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2Fsaw-int4/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2Fsaw-int4/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/togethercomputer","download_url":"https://codeload.github.com/togethercomputer/saw-int4/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2Fsaw-int4/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32294592,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T09:34:17.070Z","status":"ssl_error","status_checked_at":"2026-04-26T09:34:00.993Z","response_time":129,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-26T11:03:06.320Z","updated_at":"2026-04-26T11:03:08.469Z","avatar_url":"https://github.com/togethercomputer.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# saw-int4\n\nsaw-int4 is the official implementation of  \n**\u003c\u003cSAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving\u003e\u003e**\n\nThis repository implements Block Diagonal Rotation (BDR) for KV-cache quantization, along with system-level optimizations that seamlessly integrate into SGLang. The resulting system achieves near-BF16 accuracy while preserving the end-to-end performance benefits of INT4.\n\n## Contents\n\n- [Introduction](#introduction)\n- [How to run BDR](#how-to-run-bdr)\n  - [Get the code](#get-the-code)\n  - [Server requirements](#server-requirements)\n  - [Install BDR (sglang-fast-rotation)](#install-bdr-sglang-fast-rotation)\n  - [Run BDR](#run-bdr)\n  - [Quick demo (verify your install)](#quick-demo-verify-your-install)\n- [Primary accuracy and throughput](#primary-accuracy-and-throughput)\n  - [Accuracy (primary)](#accuracy-primary)\n    - [Prepare](#prepare)\n    - [RUN-GPQA](#run-gpqa)\n    - [Accuracy results (primary)](#accuracy-results-primary)\n  - [Throughput and latency (primary)](#throughput-and-latency-primary)\n    - [Prepare (genai-bench)](#prepare-genai-bench)\n    - [Speed results (primary)](#speed-results-primary)\n- [Ablation study (k-means, k-means + rotation)](#ablation-study-k-means-k-means--rotation)\n  - [Install sglang-kmeans](#install-sglang-kmeans)\n  - [KV calibration (ablation only)](#kv-calibration-ablation-only)\n  - [Ablation method matrix](#ablation-method-matrix)\n    - [Accuracy results (ablation)](#accuracy-results-ablation)\n- [Repository layout](#repository-layout)\n- [Full reproduction](#full-reproduction)\n- [License](#license)\n\n## Introduction\n\nThis work studies **4-bit KV-cache quantization** under **real serving constraints** such as paged memory layouts, regular memory access, and fused attention execution. Our primary method, **BDR (block-diagonal rotation)**, applies a **block-diagonal Hadamard rotation** to the KV cache before **token-wise INT4 KV-cache quantization**, implemented directly inside a **fork of [SGLang](https://github.com/sgl-project/sglang)**.\n\nWe ship two submodule branches on the same fork remote:\n\n- **[third_party/sglang-fast-rotation](third_party/sglang-fast-rotation)** — **Our proposed BDR implementation:** fused block-diagonal rotation + INT4 KV-cache write. Use this fork for **both accuracy and throughput** on **BF16**, **INT4**, and **BDR** (the main paper numbers).\n- **[third_party/sglang-kmeans](third_party/sglang-kmeans)** — **Ablation study for kmeans, kmeans+rotation:** KV dump, k-means centroids, and k-means + rotation variants. Not required to reproduce the core BDR vs BF16 vs INT4 story.\n\nPinned commits: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).\n\n## How to run BDR\n\nThis section covers everything needed to run BDR on **`third_party/sglang-fast-rotation`**: get the code, install, and launch a server.\n\n### Get the code\n\n```bash\ngit clone --recurse-submodules https://github.com/togethercomputer/saw-int4.git\ncd saw-int4\n```\n\nIf you cloned without submodules: `git submodule update --init third_party/sglang-fast-rotation`.\n\n### Server requirements\n\n\nThe BDR implementation is built on top of the SGLang codebase and currently assumes the following setup:\n\n- **MHA models only** — **MLA** and other non-MHA layouts are **not supported** for these KV / BDR settings.\n- **Prefill backend:** **`fa3`**.\n- **Decode backend:** **`triton`**.\n\n### Install BDR\n\n```bash\ncd third_party/sglang-fast-rotation/python\npip install -e \".[all]\"\npip install --no-build-isolation \"git+https://github.com/Dao-AILab/fast-hadamard-transform.git\"\n```\n\n\n### Run BDR\n\n**BF16 KV (baseline)**\n```bash\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-4B-Thinking-2507\" \\\n  --port 30000 \\\n  --kv-cache-dtype auto\n```\n\n**Original INT4 KV**\n```bash\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-4B-Thinking-2507\" \\\n  --port 30000 \\\n  --kv-cache-dtype int4\n```\n\n**BDR (block diagnoal rotation on K)**\n```bash\nHADAMARD=1 HADAMARD_ORDER=128 python -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-4B-Thinking-2507\" \\\n  --port 30000 \\\n  --kv-cache-dtype int4\n```\n\nFor the full env variable reference, and the complete mode matrix, see [docs/bdr_env_vars.md](docs/bdr_env_vars.md). \n\n### Quick demo (verify your install)\n\nWith the server running in **any** of the three modes above, run the smoke-test script from the repository root:\n\n```bash\npip install openai   # if not already installed\npython scripts/bdr_smoke_test.py --port 30001 --model Qwen/Qwen3-4B-Thinking-2507\n```\n\nThe script sends a **GPQA sample question** to the server and streams the response. \n\n```\nServer : http://0.0.0.0:30000/v1\nModel  : Qwen/Qwen3-4B-Thinking-2507\n\n--- Prompt (GPQA sample) ---\nAnswer the following multiple choice question.....\n...\n\n--- Response ---\n\u003cmodel reasoning and answer streamed here\u003e\n```\n\n\n## Primary accuracy and throughput\n\n**Accuracy** (simple-evals / GPQA) and **throughput** ([genai-bench](https://github.com/sgl-project/genai-bench)) both use **`third_party/sglang-fast-rotation`**; server setup is in [How to run BDR](#how-to-run-bdr). **Accuracy model:** **`Qwen/Qwen3-4B-Thinking-2507`**. **Throughput model:** **`Qwen/Qwen3-8B`** (override `MODEL_PATH` in scripts if you align checkpoints).\n\n### Accuracy (primary)\n\n#### Prepare\n\n**Prerequisite (GPQA client):** **[openai/simple-evals](https://github.com/openai/simple-evals)** is included as a submodule at **`third_party/simple-evals`**.\n\n```bash\ngit submodule update --init --checkout third_party/simple-evals\ncd third_party/simple-evals\nmkdir -p simple_evals\ntouch simple_evals/__init__.py\npip install openai pandas requests jinja2 tqdm numpy\n```\n\nAdd a local model alias once in `third_party/simple-evals/simple_evals.py` inside the `models = { ... }` dictionary so `simple-evals` and set max_tokens=32768:\n\n```python\n\"qwen3_4b\": ChatCompletionSampler(\n    model=\"Qwen/Qwen3-4B-Thinking-2507\",\n    system_message=OPENAI_SYSTEM_MESSAGE_API,\n    max_tokens=32768,\n),\n```\n\n#### RUN-GPQA\nWith **simple-evals** installed and the SGLang server already up (start it in the desired mode from [Run BDR](#run-bdr), using **`Qwen/Qwen3-4B-Thinking-2507`** as the model), point the client at **`http://127.0.0.1:\u003cport\u003e/v1`** and run GPQA:\n\n```bash\ncd third_party/simple-evals\nexport OPENAI_BASE_URL=\"http://127.0.0.1:30000/v1\" \nexport OPENAI_API_KEY=\"dummy\"\npython -m simple-evals.simple_evals --model qwen3_4b --eval gpqa --n-repeats 3\n```\n\n\n#### Accuracy results (primary, temp=0.6, seq=32k and top=0.95)\n\n| Model | Method | Benchmark | Score |\n|-------|--------|-----------|-------|\n| Qwen/Qwen3-4B-Thinking-2507 | BF16 | GPQA | 66.6667 |\n| Qwen/Qwen3-4B-Thinking-2507 | INT4 | GPQA | 0 |\n| Qwen/Qwen3-4B-Thinking-2507 | BDR (K-only) | GPQA | 65.8249 |\n\n\n### Throughput and latency (primary)\n\nSpeed results use **sglang-fast-rotation** (fused INT4 + BDR kernels) with **`Qwen/Qwen3-8B`**, driven by **[genai-bench](https://github.com/sgl-project/genai-bench)** against the server’s OpenAI-compatible HTTP API. Helper: [scripts/run_genai_bench_example.sh](scripts/run_genai_bench_example.sh) (default `MODEL_PATH`). Full CLI, traffic scenarios, Excel/plots: [GenAI Bench docs](https://docs.sglang.ai/genai-bench/getting-started/) and [Run benchmark](https://docs.sglang.ai/genai-bench/user-guide/run-benchmark/).\n\n#### Prepare (genai-bench)\n\n**Prerequisite (throughput client):** install genai-bench (separate from the SGLang venv if you prefer):\n\n```bash\npip install genai-bench\n```\n\nOptional (quieter HF logs during tokenizer load): `export TRANSFORMERS_VERBOSITY=error`. For Docker / dev installs, see the upstream [installation guide](https://docs.sglang.ai/genai-bench/getting-started/installation/).\n\n**Terminal 1 — server** (example BF16 KV):\n\n```bash\ncd third_party/sglang-fast-rotation/python\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-8B\" \\\n  --port 30000 \\\n  --kv-cache-dtype int4\n```\n\n**Terminal 2 — client** (after `pip install genai-bench`; matches ~256 input / 32 output tokens and concurrency 16 — see [traffic scenarios](https://docs.sglang.ai/genai-bench/user-guide/scenario-definition/)):\n\n```bash\ngenai-bench benchmark --api-backend sglang \\\n  --api-base \"http://127.0.0.1:30000\" \\\n  --api-key \"dummy\" \\\n  --api-model-name \"Qwen/Qwen3-8B\" \\\n  --model-tokenizer \"Qwen/Qwen3-8B\" \\\n  --task text-to-text \\\n  --traffic-scenario \"D(256,32)\" \\\n  --num-concurrency 16 \\\n  --max-time-per-run 5 \\\n  --max-requests-per-run 200 \\\n  --server-engine \"SGLang\" \\\n  --server-gpu-type \"local\" \\\n  --server-version \"custom\" \\\n  --server-gpu-count 1\n```\n\nTune `--max-time-per-run`, `--max-requests-per-run`, `--num-concurrency`, and `--traffic-scenario` using `genai-bench benchmark --help` and the docs above. Label runs with accurate `--server-gpu-type` / `--server-version` when publishing numbers.\n\n**Sweep BF16 vs INT4 vs BDR:** restart the server with the right env and `--kv-cache-dtype`, then rerun **genai-bench** with **identical** client flags.\n\n| Config | Env | `--kv-cache-dtype` |\n|--------|-----|-------------------|\n| BF16 KV | `HADAMARD=0` | `auto` |\n| INT4 KV | `HADAMARD=0` | `int4` |\n| BDR + INT4 | `HADAMARD=1` `ROTATE_V=0` `HADAMARD_ORDER=128` | `int4` |\n\nSGLang’s built-in `bench_serving` ([bench_serving](https://github.com/sgl-project/sglang/blob/main/docs/developer_guide/bench_serving.md)) is optional; this repo standardizes on **genai-bench** for comparable sweeps and reporting.\n\n**Hub:** [eval_speed/](eval_speed/)  \n**Helper:** [scripts/run_genai_bench_example.sh](scripts/run_genai_bench_example.sh)\n\n#### Speed results (primary)\n\nHardware: 1× H100 80 GB, TP=1. Model: `Qwen/Qwen3-8B`.  \nClient: [genai-bench](https://github.com/sgl-project/genai-bench). Metric definitions: [eval_speed/metrics.md](eval_speed/metrics.md).\n\n**Short context — `D(256, 1024)` (256 input / 1024 output tokens)**  \nCap: 5 min or 256 requests. Results: [eval_speed/results/20260416_203040/](eval_speed/results/20260416_203040/)\n\n| KV config | Conc | output_tps(job) | mean_input_tps(req) | mean_output_tps(req) | mean_ttft(req) (ms) | E2E mean(req) (s) | E2E p75(req) (s) | E2E p90(req) (s) | total requests | Wall (s) |\n|-----------|-----:|----------------:|--------------------:|---------------------:|--------------------:|------------------:|-----------------:|-----------------:|---------------:|---------:|\n| BF16      |  32 |  3,795 | 1,573 | 122.1 |    196 |  8.57 |  8.60 |  8.62 | 256 |  69 |\n| INT4      |  32 |  3,687 | 1,380 | 120.9 |    225 |  8.69 |  8.71 |  8.75 | 256 |  71 |\n| INT4 + BDR (K-only, ord=128) |  32 |  3,689 | 1,379 | 120.2 |    226 |  8.74 |  8.74 |  8.76 | 256 |  71 |\n| BF16      |  64 |  5,950 |   796 |  98.7 |    369 | 10.74 | 10.78 | 10.82 | 256 |  44 |\n| INT4      |  64 |  6,371 |   774 | 105.0 |    370 | 10.11 | 10.16 | 10.20 | 256 |  41 |\n| INT4 + BDR (K-only, ord=128) |  64 |  6,235 |   755 | 104.3 |    377 | 10.19 | 10.24 | 10.26 | 256 |  42 |\n| BF16      | 128 |  8,410 |   455 |  71.8 |    657 | 14.92 | 15.00 | 15.11 | 256 |  31 |\n| INT4      | 128 |  9,544 |   437 |  81.0 |    665 | 13.30 | 13.38 | 13.45 | 256 |  28 |\n| INT4 + BDR (K-only, ord=128) | 128 |  9,350 |   458 |  80.1 |    655 | 13.43 | 13.51 | 13.60 | 256 |  28 |\n| BF16      | 256 | 11,195 |   242 |  49.3 |  1,224 | 22.00 | 22.15 | 22.24 | 256 |  23 |\n| INT4      | 256 | 11,624 |   225 |  51.1 |  1,237 | 21.25 | 21.50 | 21.57 | 256 |  23 |\n| INT4 + BDR (K-only, ord=128) | 256 | 11,732 |   266 |  51.6 |  1,148 | 20.99 | 21.12 | 21.19 | 256 |  22 |\n\n**Long context — `D(16384, 1024)` (16 384 input / 1024 output tokens)**  \nCap: 20 min or 64–256 requests (varies by concurrency). Results: [eval_speed/results/20260416_214449/](eval_speed/results/20260416_214449/) (conc 8–64), [eval_speed/results/20260416_233035/](eval_speed/results/20260416_233035/) (conc 128)\n\n| KV config | Conc | output_tps(job) | mean_input_tps(req) | mean_output_tps(req) | mean_ttft(req) (ms) | E2E mean(req) (s) | E2E p75(req) (s) | E2E p90(req) (s) | total requests | Wall (s) |\n|-----------|-----:|----------------:|--------------------:|---------------------:|--------------------:|------------------:|-----------------:|-----------------:|---------------:|---------:|\n| BF16      |   8 |   414 |  8,311 | 61.4 |  2,636 | 19.37 | 19.53 | 19.65 | 64 | 158 |\n| INT4      |   8 |   458 |  8,391 | 69.2 |  2,631 | 17.50 | 17.67 | 17.77 | 64 | 143 |\n| INT4 + BDR (K-only, ord=128) |   8 |   457 |  8,784 | 68.7 |  2,523 | 17.50 | 17.69 | 17.78 | 64 | 143 |\n| BF16      |  16 |   481 |  4,413 | 36.7 |  5,104 | 33.14 | 33.48 | 33.65 | 64 | 136 |\n| INT4      |  16 |   571 |  4,672 | 45.4 |  4,956 | 27.74 | 28.04 | 28.28 | 64 | 115 |\n| INT4 + BDR (K-only, ord=128) |  16 |   568 |  4,083 | 44.8 |  4,875 | 27.94 | 28.30 | 28.54 | 64 | 116 |\n| BF16      |  32 |   570 |  1,741 | 32.9 | 18,047 | 49.58 | 73.20 | 73.64 | 64 | 115 |\n| INT4      |  32 |   618 |  2,147 | 25.4 |  9,568 | 50.45 | 51.11 | 51.49 | 64 | 106 |\n| INT4 + BDR (K-only, ord=128) |  32 |   616 |  2,215 | 25.1 |  9,350 | 50.57 | 51.23 | 51.62 | 64 | 107 |\n| BF16      |  64 |   471 |    806 | 32.7 | 44,798 | 76.91 | 112.33 | 113.22 | 64 | 139 |\n| INT4      |  64 |   666 |  1,114 | 14.7 | 19,398 | 90.46 | 91.70 | 92.51 | 64 |  98 |\n| INT4 + BDR (K-only, ord=128) |  64 |   663 |  1,150 | 14.4 | 18,371 | 90.78 | 92.06 | 92.83 | 64 |  99 |\n| BF16      | 128 |   559 |    310 | 32.9 | 113,583 | 145.96 | 220.85 | 221.91 | 148 | 271 |\n| INT4      | 128 |   701 |    527 | 12.3 |  57,654 | 142.19 | 208.11 | 210.82 | 153 | 224 |\n| INT4 + BDR (K-only, ord=128) | 128 |   701 |    535 | 12.3 |  57,054 | 142.09 | 208.05 | 210.73 | 153 | 224 |\n\n## Ablation study (k-means, k-means + rotation)\n\nUse **`third_party/sglang-kmeans`**: KV dump for calibration, [tools/fit_kv_centroids.py](tools/fit_kv_centroids.py), then `SGLANG_KV_CENTROIDS_PATH` for **k-means + INT4** and **k-means + BDR** (optional `HADAMARD` / `ROTATE_V`). Accuracy still uses **simple-evals** from **`third_party/simple-evals`** ([Prepare](#prepare); run GPQA per upstream docs).\n\n### Install sglang-kmeans\n\nNot needed for primary BF16 / INT4 / BDR ([How to run BDR](#how-to-run-bdr)). Initialize the submodule (skipped by default), then install:\n\n```bash\ngit submodule update --init third_party/sglang-kmeans\ncd third_party/sglang-kmeans/python\npip install -e \".[all]\"\npip install \"flash-kmeans @ git+https://github.com/jindajia/flash-kmeans.git\"\n```\n\n### KV calibration (ablation only)\n\nPrimary BF16 / INT4 / BDR does **not** need this step.\n\n**1. Dump KV activations** — run from **sglang-kmeans** with a **BF16 KV cache** (`auto`) so dumps are in calibration space:\n\n```bash\ncd third_party/sglang-kmeans/python\n\nexport DUMP_KVCACHE=true\nexport DUMP_KVCACHE_TOKENS=512\nexport DUMP_KVCACHE_DIR=/path/to/kv_dumps\n\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-8B\" \\\n  --port 30000 \\\n  --kv-cache-dtype auto\n```\n\nDrive enough traffic so each layer hits the threshold at least once. Files appear as `kv_calibration_layer_\u003clayer_id\u003e.pt` (dict with `k`, `v`, `indices` on CPU; see `triton_backend.py` in the submodule for selection logic).\n\n**2. Fit centroids offline** — from the **repository root**:\n\n```bash\npython tools/fit_kv_centroids.py \\\n  --dump-dir /path/to/kv_dumps \\\n  --out-dir /path/to/centroids_out \\\n  --n-clusters 16 \\\n  --seed 0\n```\n\nThis writes `k_layer_L_clusters_\u003cN\u003e_centers.pt` and `v_layer_L_clusters_\u003cN\u003e_centers.pt` per global layer `L`, shaped `(N, num_kv_heads_global * head_dim)`, for loading in the submodule.\n\n**3. Run INT4 + k-means inference**\n\n```bash\nexport N_CLUSTERS=16\nexport SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out\n\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-8B\" \\\n  --port 30000 \\\n  --kv-cache-dtype int4\n```\n\n**K-means + BDR:** keep `SGLANG_KV_CENTROIDS_PATH`, set `HADAMARD=1`, optional `ROTATE_V`, and `HADAMARD_ORDER` consistent with head dimension (same as primary BDR).\n\n### Ablation method matrix\n\n| Method | `HADAMARD` | `ROTATE_V` | `HADAMARD_ORDER` | `--kv-cache-dtype` | `SGLANG_KV_CENTROIDS_PATH` | `N_CLUSTERS` |\n|--------|------------|------------|------------------|---------------------|----------------------------|--------------|\n| K-means + INT4 | `0` | `0` | n/a | `int4` | required | match files |\n| K-means + BDR | `1` | `0` or `1` | set | `int4` | required | match files |\n\n**K-means + INT4 example:**\n\n```bash\ncd third_party/sglang-kmeans/python\nexport OPENAI_API_KEY=dummy\nexport N_CLUSTERS=16\nexport SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out\nexport HADAMARD=0\nexport ROTATE_V=0\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-8B\" --port 30000 --kv-cache-dtype int4\n```\n\n**K-means + BDR example:**\n\n```bash\nexport HADAMARD=1\nexport ROTATE_V=0\nexport HADAMARD_ORDER=16\nexport N_CLUSTERS=16\nexport SGLANG_KV_CENTROIDS_PATH=/path/to/centroids_out\npython -m sglang.launch_server \\\n  --prefill-attention-backend fa3 \\\n  --decode-attention-backend triton \\\n  --model-path \"Qwen/Qwen3-8B\" --port 30000 --kv-cache-dtype int4\n```\n\n**Hub:** [eval_accuracy/](eval_accuracy/)  \n**Helper:** `CENTROIDS=/path/to/centroids_out ./scripts/run_eval_matrix.sh kmeans` or `kmeans_bdr`.\n\n#### Accuracy results (ablation)\n\n| Model | Method | Benchmark | Score |\n|-------|--------|-----------|-------|\n| — | K-means + INT4 | — | — |\n| — | K-means + BDR | — | — |\n\nFill from [eval_accuracy/results/](eval_accuracy/results/).\n\n## Repository layout\n\n| Path | Role |\n|------|------|\n| [third_party/sglang-fast-rotation/](third_party/sglang-fast-rotation/) | **Primary** BF16 / INT4 / BDR — accuracy + speed |\n| [third_party/sglang-kmeans/](third_party/sglang-kmeans/) | **Ablation** k-means KV + dump / centroids |\n| [third_party/simple-evals/](third_party/simple-evals/) | **GPQA accuracy client** (openai/simple-evals submodule; no separate clone needed) |\n| [docs/bdr_env_vars.md](docs/bdr_env_vars.md) | Full BDR env variable reference and mode matrix |\n| [scripts/](scripts/) | `bdr_smoke_test.py` (install smoke test), `run_primary_eval_matrix.sh`, `run_eval_matrix.sh`, `run_genai_bench_example.sh`, `clone_submodules.sh` |\n| [tools/](tools/) | `fit_kv_centroids.py` (ablation calibration) |\n| [eval_primary/](eval_primary/) | Primary **accuracy** logs / tables |\n| [eval_speed/](eval_speed/) | Primary **throughput** logs / tables |\n| [eval_accuracy/](eval_accuracy/) | Ablation **accuracy** logs / tables |\n\n## Full reproduction\n\nLarge raw bundles may live outside this repo.\n\n- **Full reproduction bundle:** *TBD — add URL*\n\nSubmodule SHAs: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).\n\n## License\n\nSee [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftogethercomputer%2Fsaw-int4","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftogethercomputer%2Fsaw-int4","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftogethercomputer%2Fsaw-int4/lists"}