https://github.com/elphinkuo/llamaqt.c
Clean C language version of quantizing llama2 model and running quantized llama2 model
https://github.com/elphinkuo/llamaqt.c
google-colab large-language-models quantization quantization-algorithms quantization-efficient-network
Last synced: 4 months ago
JSON representation
Clean C language version of quantizing llama2 model and running quantized llama2 model
- Host: GitHub
- URL: https://github.com/elphinkuo/llamaqt.c
- Owner: elphinkuo
- License: apache-2.0
- Created: 2023-08-17T19:06:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-08T17:14:32.000Z (over 1 year ago)
- Last Synced: 2025-01-01T20:07:05.455Z (5 months ago)
- Topics: google-colab, large-language-models, quantization, quantization-algorithms, quantization-efficient-network
- Language: C
- Homepage:
- Size: 455 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# llama2qt.c
Clean C language version of quantizing llama2 model and running quantized llama2 model.The code contains some modifications (mainly about quantization and running quantized model) based on [llama.c](https://github.com/karpathy/llama2.c) (Inference Llama 2 in one file of pure C) from Andrej Karpathy.
Simple instructions:
## 8bit quantization, grouping per layer, without block:
gcc -O3 -o quantize quantize_8bit.c -lm
./quantize {model_name}.bin
## Inference 8bit quantization
gcc -O3 -march=native runq.c -o runq -lm
./runq llama2_7b_8bit.bin -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"
## 8bit quantization, grouping by 64 * 64 block:
gcc -O3 -o quantize quantize_8bit_64block.c -lm
# A quick test, using the Google colab:
[](https://colab.research.google.com/github/elphinkuo/llamaqt.c/blob/master/quantization_8bit_demo.ipynb)
More details can be found in the [README.md](README.md) .