https://github.com/11sshukla/model_quantization
Quantizing TinyLlama to 8-bit
https://github.com/11sshukla/model_quantization
accelerator bitsandbytes touch transformer
Last synced: 4 months ago
JSON representation
Quantizing TinyLlama to 8-bit
- Host: GitHub
- URL: https://github.com/11sshukla/model_quantization
- Owner: 11SShukla
- Created: 2025-09-06T08:23:00.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-09-06T09:23:45.000Z (5 months ago)
- Last Synced: 2025-09-06T10:15:36.613Z (5 months ago)
- Topics: accelerator, bitsandbytes, touch, transformer
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# model_quantization
## TinyLlama 8-bit Quantization Guide
## 📌 Introduction
Quantization is a technique used to reduce the memory footprint and improve inference speed of large language models (LLMs) by representing weights with lower precision (e.g., 8-bit integers instead of 16-bit floating point numbers).
In this project, we successfully **quantized TinyLlama-1.1B-Chat** from FP16 (16-bit floating point) to 8-bit using the `transformers` library and `bitsandbytes`.
This guide explains:
Why quantization is important
How to quantize TinyLlama to 8-bit
How to **save and reuse** the quantized model
How to evaluate performance (loss & perplexity)
Why this approach is useful for others
---
## Why Quantization?
Quantization provides several key benefits:
- Memory Efficiency
16-bit models require more VRAM/RAM. Converting to 8-bit halves the memory requirements, allowing larger models to fit on smaller GPUs.
- Faster Inference
8-bit models often have faster inference since they load fewer bytes per weight.
- Accessibility
People with lower-end GPUs (e.g., 4GB/6GB VRAM) can run models that otherwise wouldn’t fit.
- Cost Efficiency
Lower memory usage = cheaper cloud instances.
**Tradeoff:** Quantization introduces *tiny precision loss*, but for most inference/chat use cases, the difference is negligible.
---
## Requirements
### Install Dependencies
Make sure you have Python 3.9+ and install:
```bash
pip install torch transformers bitsandbytes accelerate