https://github.com/iro96/turboquant-h

A more deep research about TurboQuant algorithms
https://github.com/iro96/turboquant-h

algorithms llm llm-quantization machine-learning turboquant

Last synced: 3 months ago
JSON representation

A more deep research about TurboQuant algorithms

Host: GitHub
URL: https://github.com/iro96/turboquant-h
Owner: Iro96
Created: 2026-04-06T14:44:52.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-06T23:05:36.000Z (3 months ago)
Last Synced: 2026-04-07T01:10:11.323Z (3 months ago)
Topics: algorithms, llm, llm-quantization, machine-learning, turboquant
Language: Python
Homepage:
Size: 40 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# TurboQuant-H

TurboQuant-H is a research-oriented benchmark harness for experimenting with hierarchical KV-cache compression. A stronger hybrid design that keeps the mathematical core of TurboQuant. This algorithm uses data-oblivious, two-stage vector quantization scheme for large language model key–value (KV) cache compression, combining random rotation plus scalar quantization with a 1-bit Quantized Johnson–Lindenstrauss (QJL) residual to preserve both mean-squared error (MSE) and inner-product fidelity.

## New Contributor Docs

Start with the guides in [`docs/`](./docs/README.md):

- [`docs/project-map.md`](./docs/project-map.md)
- [`docs/runtime-flow.md`](./docs/runtime-flow.md)
- [`docs/compression-guide.md`](./docs/compression-guide.md)

## Project Layout

## Quick Start

Run the original script name:

```bash
python turboquant_h_smollm_benchmark.py --prompt "Explain KV cache compression."
```

Or use the package CLI with `src` on `PYTHONPATH`:

```bash
$env:PYTHONPATH = "src"
python -m turboquant_h.cli --prompt "Explain KV cache compression."
```

If you install the project:

```bash
pip install -e .
turboquant-h-benchmark --prompt "Explain KV cache compression."
```

## Example Options

```bash
python turboquant_h_smollm_benchmark.py \
--model HuggingFaceTB/SmolLM-135M-Instruct \
--quant_bits_old 2 \
--rotation_mode hadamard \
--correction_type low_rank \
--prompt "Explain why KV cache compression matters in one paragraph."
```

## Module Guide

- `config.py`: benchmark configuration, validation, and result dataclasses.
- `benchmark.py`: model loading, generation loop, and end-to-end benchmark execution.
- `reporting.py`: terminal-friendly result formatting.
- `compression/packing.py`: bit-packing and unpacking utilities.
- `compression/rotation.py`: Hadamard-based rotation helpers.
- `compression/quantization.py`: tensor quantization and dequantization.
- `compression/correction.py`: low-rank and QJL-style residual correction logic.
- `compression/cache.py`: compressed KV-cache block, segment, and cache classes.
- `compression/attention.py`: compressed attention execution and model patching.

## Validation

The included tests are lightweight and focus on deterministic utilities:

```bash
python -m unittest discover -s tests
```

## Notes

- This is still a research harness, not a production CUDA implementation.
- The benchmark currently targets transformer attention modules with Llama-style projections.
- `TurboQuant-H_paper_draft.md` remains as supporting design context.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/iro96/turboquant-h

Awesome Lists containing this project

README