https://github.com/11sshukla/model_quantization

Quantizing TinyLlama to 8-bit
https://github.com/11sshukla/model_quantization

accelerator bitsandbytes touch transformer

Last synced: 4 months ago
JSON representation

Quantizing TinyLlama to 8-bit

Host: GitHub
URL: https://github.com/11sshukla/model_quantization
Owner: 11SShukla
Created: 2025-09-06T08:23:00.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-09-06T09:23:45.000Z (5 months ago)
Last Synced: 2025-09-06T10:15:36.613Z (5 months ago)
Topics: accelerator, bitsandbytes, touch, transformer
Language: Python
Homepage:
Size: 9.77 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# model_quantization

## TinyLlama 8-bit Quantization Guide

## 📌 Introduction
Quantization is a technique used to reduce the memory footprint and improve inference speed of large language models (LLMs) by representing weights with lower precision (e.g., 8-bit integers instead of 16-bit floating point numbers).

In this project, we successfully **quantized TinyLlama-1.1B-Chat** from FP16 (16-bit floating point) to 8-bit using the `transformers` library and `bitsandbytes`.

This guide explains:
Why quantization is important
How to quantize TinyLlama to 8-bit
How to **save and reuse** the quantized model
How to evaluate performance (loss & perplexity)
Why this approach is useful for others

---

## Why Quantization?
Quantization provides several key benefits:

- Memory Efficiency
16-bit models require more VRAM/RAM. Converting to 8-bit halves the memory requirements, allowing larger models to fit on smaller GPUs.

- Faster Inference
8-bit models often have faster inference since they load fewer bytes per weight.

- Accessibility
People with lower-end GPUs (e.g., 4GB/6GB VRAM) can run models that otherwise wouldn’t fit.

- Cost Efficiency
Lower memory usage = cheaper cloud instances.

**Tradeoff:** Quantization introduces *tiny precision loss*, but for most inference/chat use cases, the difference is negligible.

---

## Requirements

### Install Dependencies
Make sure you have Python 3.9+ and install:

```bash
pip install torch transformers bitsandbytes accelerate

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/11sshukla/model_quantization

Awesome Lists containing this project

README