https://github.com/varun0157/quantisation

Experiments in quantisation consisting of quantisation from scratch, bitsandbytes, and llama.cpp. [Assignment 4 of Advanced Natural Language Processing, IIIT-H Monsoon '24]
https://github.com/varun0157/quantisation

bitsandbytes llamacpp natural-language-processing quantisation

Last synced: 4 months ago
JSON representation

Experiments in quantisation consisting of quantisation from scratch, bitsandbytes, and llama.cpp. [Assignment 4 of Advanced Natural Language Processing, IIIT-H Monsoon '24]

Host: GitHub
URL: https://github.com/varun0157/quantisation
Owner: Varun0157
License: mit
Created: 2024-11-10T08:43:43.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-11T16:20:25.000Z (about 1 year ago)
Last Synced: 2025-03-31T05:45:58.641Z (11 months ago)
Topics: bitsandbytes, llamacpp, natural-language-processing, quantisation
Language: Python
Homepage:
Size: 395 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# quantisation
*Assignment 4* of *Advanced Natural Language Processing* (IIIT-Hyderabad, Monsoon '24)

Experiments in quantisation, consisting of quantisation from scratch (whole model and selective) as well as `bitsandbytes` integration, with quantisation to 4 bit and 8 bit formats and `nf4` quantisation.

In addition, we deploy a model onto our local device using `llama.cpp`, quantise it, and upload it to the hugging face hub.

## Custom Quantisation

### dependencies
Refer to the [env file](./docs/envs.yml) to install the dependencies using `conda`.
```sh
conda env create -f docs/envs.yml
```

### quantisation

Quantize `gpt-neo` using your method of choice using:
```sh
python -m src.quantize --q_type
```

Types include `custom_whole`, `custom_selective`, `bnb_4`, `bnb_8`, `bnb_nf4`
and `none`.

`custom_whole` takes a lot of memory during inference and may have to be run with the `--cpu` flag.

The model gets saved to `quantized`. Run it the same way you did before, instead on the evaluate model, to evaluate:
```sh
python -m src.evaluate --q_type .
```

Quantised models can be found here: https://drive.google.com/drive/folders/1lHQnaPGtltS_SNNqdw4MLhvGHB0xKP1l?usp=sharing

## llama.cpp
**reference**: https://github.com/ggerganov/llama.cpp/discussions/2948

Set up the `llama.cpp` submodule stored in the [llama.cpp](./llama.cpp/) directory as below:
```sh
git submodule init
git submodule update
```

The remaining code assumes you're in the `llama.cpp` directory.
```sh
cd llama.cpp
```

Build the executables by referring to the [original directory](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md).

> [!NOTE]
> `huggingface-hub` is required to download and upload models

Download `hf-smol-135m` from huggingface to quantise:
```sh
python download.py
```

Quantise the model using `llama.cpp`:
```sh
python llama.cpp/convert_hf_to_gguf.py hf-smol \
--outfile hf-smol.gguf \
--outtype q8_0
```

Prompt the model with whatever input you want using the `llama-cli` executable:
```sh
./llama.cpp/build/bin/llama-cli -m hf-smol.gguf -p "What is life?"
```

If you want, upload the model to hugging-face by referring to and modifying `upload.py` as required:
```sh
python upload.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/varun0157/quantisation

Awesome Lists containing this project

README