https://github.com/LeanModels/DFloat11

DFloat11: Lossless LLM Compression for Efficient GPU Inference
https://github.com/LeanModels/DFloat11
compression gpu llm lossless-compression-algorithm
Last synced: 6 months ago
JSON representation
DFloat11: Lossless LLM Compression for Efficient GPU Inference
Host: GitHub
URL: https://github.com/LeanModels/DFloat11
Owner: LeanModels
License: apache-2.0
Created: 2025-04-15T16:39:29.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-11-24T09:46:56.000Z (8 months ago)
Last Synced: 2025-11-27T22:14:29.081Z (8 months ago)
Topics: compression, gpu, llm, lossless-compression-algorithm
Language: Python
Homepage:
Size: 66.4 KB
Stars: 564
Watchers: 12
Forks: 33
Open Issues: 21
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-ai-efficiency - DFloat11: Lossless Compression of LLMs and Diffusion Models for Efficient GPU Inference - purple)</b></sub> | (Research Articles 📄)
README

          # DFloat11: Lossless Compression of LLMs and Diffusion Models for Efficient GPU Inference

[![PyPI version](https://img.shields.io/pypi/v/dfloat11.svg?color=blue)](https://pypi.org/project/dfloat11/)

[![arXiv](https://img.shields.io/badge/arXiv-2504.11651-b31b1b.svg)](https://arxiv.org/abs/2504.11651)

[![Hugging Face](https://img.shields.io/badge/Model-%F0%9F%A4%97-yellow.svg)](https://huggingface.co/DFloat11)

**DFloat11** is a lossless compression framework that reduces the size of Large Language Models (LLMs) and diffusion models (e.g. FLUX.1, Qwen-Image, etc.) by approximately **30%** while preserving **bit-for-bit identical outputs** to the original model. It enables efficient GPU inference on resource-constrained hardware without sacrificing any accuracy.

## 📰 News

- [09/18/2025] Our research paper is accepted to NeurIPS 2025! Hope to see you at the San Diego Convention Center in December!

- [08/24/2025] Compression code released!

  * Reduce the size of any model by 30% with DFloat11 compression.

  * Get started here: [examples/compress_flux1](https://github.com/LeanModels/DFloat11/tree/master/examples/compress_flux1).

- [07/29/2025] Efficient CPU Offloading Now Supported!

  * Our latest update enables highly memory-efficient inference by keeping only one transformer block in GPU memory at a time. For example, CPU offloading reduces peak GPU memory for inference of **FLUX.1-Krea-dev from 17.5 to 9.8 GB, Qwen3-8B from 12.4 to 2.3 GB, and HiDream-I1-Full from 26.4 to 9.6 GB**.

  * An example of using CPU offloading with FLUX.1-Krea-dev-DF11 can be found [here](https://huggingface.co/DFloat11/FLUX.1-Krea-dev-DF11).

  * To enable CPU offloading, simply set `cpu_offload=True` when calling `DFloat11Model.from_pretrained(...)`.

- [05/23/2025] **Wan2.1** support is now live! [`DFloat11/Wan2.1-T2V-14B-Diffusers-DF11`](https://huggingface.co/DFloat11/Wan2.1-T2V-14B-Diffusers-DF11)

  * Text-to-video generation with DFloat11 *Wan2.1 14B* using only 24GB VRAM!

  * Get started here: [examples/wan2.1](https://github.com/LeanModels/DFloat11/tree/master/examples/wan2.1).

- [05/06/2025] **DFloat11 now supports [`FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev)**

  * 🖼️ Generate stunning text-to-image results on GPUs with **less than 24GB VRAM** --- no quality lost!

  * 📂 Get started here: [examples/flux.1](https://github.com/LeanModels/DFloat11/tree/master/examples/flux.1).

- [05/05/2025] The `dfloat11` pip package has been upgraded to `v0.2.0`! Run `pip install -U dfloat11[cuda12]` to upgrade to the latest version. We have made the following important changes:

  * We added support for Qwen 3, Gemma 3, and Phi 4!

  * The GPU decompression kernel is now 20-40% faster! We achieved it by improving thread occupancy and implementing tons of optimizations.

  * The DFloat11 models are now stored in safetensors format for better safety and loading performance.

  * When using a DFloat11 model, only the compressed model is downloaded, not the original model.

## 📦 Installation

Requires a CUDA-compatible GPU (with CUDA 12) and [PyTorch](https://pytorch.org/get-started/locally/) installed.

To install from PyPI:

```bash

pip install -U dfloat11[cuda12]

```

[Optional] To compile the GPU kernel and install locally:

```bash

nvcc -O3 -ptx dfloat11/decode.cu -o dfloat11/decode.ptx

pip install .[cuda12]

```

## 🔍 How It Works

DFloat11 compresses model weights using **Huffman coding** of BFloat16 exponent bits, combined with **hardware-aware algorithmic designs** that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are **decompressed just before matrix multiplications**, then **immediately discarded after use** to minimize memory footprint.

Key benefits:

* **No CPU decompression or host-device data transfer**: all operations are handled entirely on the GPU.

* **Decompression overhead is constant** per forward pass and **independent of batch size**, making DFloat11 increasingly efficient at larger batch sizes.

* DFloat11 is **much faster than CPU-offloading approaches**, enabling practical deployment in memory-constrained environments.

* At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches.

* The compression is **fully lossless**, guaranteeing that the model’s outputs are **bit-for-bit identical** to those of the original model.

## 🚀 Quick Start

1. Install the `dfloat11` pip package. See [Installation](#-installation).

2. Run the following code in Python, which automatically downloads the DFloat11 `Qwen3-8B` model and generates a response.

  ```python

  import torch

  from dfloat11 import DFloat11Model

  from transformers import AutoTokenizer

  model_id = "DFloat11/Qwen3-8B-DF11"

  model = DFloat11Model.from_pretrained(model_id, device_map="auto")

  tokenizer = AutoTokenizer.from_pretrained(model_id)

  tokenizer.pad_token = tokenizer.eos_token

  prompt = "Question: What is a binary tree and its applications? Answer:"

  inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)

  with torch.no_grad():

      output = model.generate(

          **inputs,

          max_new_tokens=256,

          do_sample=True,

      )

  print(tokenizer.batch_decode(output, skip_special_tokens=True))

  ```

3. Replace the `model_id` in the script above with any pre-compressed model in the [Model Hub](#-model-hub).

## 🏎️ Benchmarking Performance

To test the speed and memory consumption a DFloat11 LLM during inference:

```bash

CUDA_VISIBLE_DEVICES=0 python inference.py \

  --model_name_or_path DFloat11/Qwen3-8B-DF11 \

  --prompt "Question: What is a binary tree and its applications? Answer:" \

  --num_tokens 512 \

  --batch_size 1

```

> 💡 **Tip**: If you specify multiple CUDA devices (e.g., `CUDA_VISIBLE_DEVICES=0,1`), the model will be automatically distributed across them using 🤗 Accelerate's `device_map="auto"`.

### Arguments

- `--model_name_or_path`: HuggingFace name or local path of the DFloat11 model (e.g., `DFloat11/Qwen3-8B-DF11`). See the [Model Hub](#-model-hub) section for a list of available DFloat11 models.

- `--bf16`: *(Optional)* Turn on this flag when passing a BFloat16 model to `--model_name_or_path`

- `--prompt`: Input prompt string for text generation

- `--num_tokens`: Number of new tokens to generate per sample

- `--batch_size`: Number of prompts to process in parallel

- `--seed`: *(Optional)* Random seed for reproducible results

### Output

The script prints:

- Generated responses

- Total decoding latency

- Tokens per second (throughput)

- GPU memory usage (allocated and peak)

## 📚 Model Hub

| Model | DFloat11 Link |

|-------|---------------|

| Wan2.1 T2V 14B (see [examples/wan2.1](https://github.com/LeanModels/DFloat11/tree/master/examples/wan2.1)) | [DFloat11/Wan2.1-T2V-14B-Diffusers-DF11](https://huggingface.co/DFloat11/Wan2.1-T2V-14B-Diffusers-DF11) |

| FLUX.1 dev (see [examples/flux.1](https://github.com/LeanModels/DFloat11/tree/master/examples/flux.1)) | [DFloat11/FLUX.1-dev-DF11](https://huggingface.co/DFloat11/FLUX.1-dev-DF11) |

| Qwen 3 32B | [DFloat11/Qwen3-32B-DF11](https://huggingface.co/DFloat11/Qwen3-32B-DF11) |

| Qwen 3 14B | [DFloat11/Qwen3-14B-DF11](https://huggingface.co/DFloat11/Qwen3-14B-DF11) |

| Qwen 3 8B | [DFloat11/Qwen3-8B-DF11](https://huggingface.co/DFloat11/Qwen3-8B-DF11) |

| Qwen 3 4B | [DFloat11/Qwen3-4B-DF11](https://huggingface.co/DFloat11/Qwen3-4B-DF11) |

| Phi 4 Reasoning Plus | [DFloat11/Phi-4-reasoning-plus-DF11](https://huggingface.co/DFloat11/Phi-4-reasoning-plus-DF11) |

| Gemma 3 27B Instruct | [DFloat11/gemma-3-27b-it-DF11](https://huggingface.co/DFloat11/gemma-3-27b-it-DF11) |

| Gemma 3 12B Instruct | [DFloat11/gemma-3-12b-it-DF11](https://huggingface.co/DFloat11/gemma-3-12b-it-DF11) |

| Gemma 3 4B Instruct  | [DFloat11/gemma-3-4b-it-DF11](https://huggingface.co/DFloat11/gemma-3-4b-it-DF11) |

| Llama 3.1 8B Instruct | [DFloat11/Llama-3.1-8B-Instruct-DF11](https://huggingface.co/DFloat11/Llama-3.1-8B-Instruct-DF11) |

| DeepSeek R1 Distill Qwen 32B | [DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11) |

| DeepSeek R1 Distill Qwen 14B | [DFloat11/DeepSeek-R1-Distill-Qwen-14B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-14B-DF11) |

| DeepSeek R1 Distill Qwen 7B  | [DFloat11/DeepSeek-R1-Distill-Qwen-7B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-7B-DF11) |

| DeepSeek R1 Distill Llama 8B | [DFloat11/DeepSeek-R1-Distill-Llama-8B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Llama-8B-DF11) |

| ... | [Discover more models on our HF page!](https://huggingface.co/DFloat11) |

### How to Use a DFloat11 Model

1. Download a model using the HuggingFace command line tool:

  ```bash

  huggingface-cli download \

    DFloat11/Llama-3.1-8B-Instruct-DF11 \     # DFloat11 model name

    --local-dir ./Llama-3.1-8B-Instruct-DF11  # local path to download the DFloat11 model

  ```

2. Run the following in Python to load the model and tokenizer:

  ```python

  from dfloat11 import DFloat11Model

  from transformers import AutoTokenizer

  model_path = "./Llama-3.1-8B-Instruct-DF11"

  model = DFloat11Model.from_pretrained(model_path, device_map="auto")

  tokenizer = AutoTokenizer.from_pretrained(model_path)

  ```

## 🗜️ Compressing Models (BFloat16 → DFloat11)

The DFloat11 compression utility is exposed via the `compress_model` function.

Check [examples/compress_flux1](https://github.com/LeanModels/DFloat11/tree/master/examples/compress_flux1) for a detailed example on compressing the FLUX.1 model.

## 🔗 Links

👉 Explore pre-compressed DFloat11 models ready to use on HuggingFace: **[https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)**

📂 Official Code Repository: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)

## 🧠 Contributions

This work is brought to you by the team at Rice University and [xMAD.ai](https://xmad.ai/).

The GPU kernel was designed and implemented by [Tianyi Zhang](https://github.com/tonyzhang617).

## 📚 Citation

If you found our work useful or interesting, please consider citing our paper:

```bibtex

@inproceedings{

  zhang2025,

  title={70\% Size, 100\% Accuracy: Lossless {LLM} Compression for Efficient {GPU} Inference via Dynamic-Length Float ({DF}loat11)},

  author={Tianyi Zhang and Mohsen Hariri and Shaochen Zhong and Vipin Chaudhary and Yang Sui and Xia Hu and Anshumali Shrivastava},

  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},

  year={2025},

  url={https://openreview.net/forum?id=xdNAVP7TGy}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/LeanModels/DFloat11

Awesome Lists containing this project

README