https://github.com/aahouzi/llama2-chatbot-cpu

A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
https://github.com/aahouzi/llama2-chatbot-cpu

4-bit-cpu bfloat16 chatbot chatbot-memory chatgpt cpu huggingface int8 intel ipex langchain llama llama2 meta meta-ai neural-compression numa optimization smooth-quantization streamlit

Last synced: 8 months ago
JSON representation

A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.

Host: GitHub
URL: https://github.com/aahouzi/llama2-chatbot-cpu
Owner: aahouzi
License: mit
Created: 2023-08-10T13:32:14.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-27T08:52:56.000Z (over 1 year ago)
Last Synced: 2024-10-18T23:15:54.589Z (about 1 year ago)
Topics: 4-bit-cpu, bfloat16, chatbot, chatbot-memory, chatgpt, cpu, huggingface, int8, intel, ipex, langchain, llama, llama2, meta, meta-ai, neural-compression, numa, optimization, smooth-quantization, streamlit
Language: Python
Homepage:
Size: 30.3 MB
Stars: 13
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# LLaMA2 chatbot on CPU

## :monocle_face: Description
- This project is a Streamlit chatbot with Langchain deploying a **LLaMA2-7b-chat** model on **Intel® Server and Client CPUs**.
- The chatbot has a memory that **remembers every part of the speech**, and allows users to optimize the model using **Intel® Extension for PyTorch (IPEX) in bfloat16 with graph mode** or **smooth quantization** (A new quantization technique specifically designed for LLMs: [ArXiv link](https://arxiv.org/pdf/2211.10438.pdf)), or **4-bit quantization**. The user can expect **up to 4.3x speed-up** compared to stock PyTorch in default mode.

- **IMPORTANT:** The CPU needs to support bfloat16 ops in order to be able to use such optimization. On top of software optimizations, I also introduced some hardware optimizations like non-uniform memory access (NUMA). User needs to **ask for access to LLaMA2** models by following this [link](https://huggingface.co/meta-llama#:~:text=Welcome%20to%20the%20official%20Hugging,processed%20within%201%2D2%20days). When getting approval from Meta, you can generate an authentification token from your HuggingFace account, and use it to load the model.

## :scroll: Getting started

1. Start by cloning the repository:
```bash
git clone https://github.com/aahouzi/llama2-chatbot-cpu.git
cd llama2-chatbot-cpu
```
2. Create a Python 3.9 conda environment:
```bash
conda create -y -n llama2-chat python=3.9
```
3. Activate the environment:
```bash
conda activate llama2-chat
```
4. Install requirements for NUMA:
```bash
conda install -y gperftools -c conda-forge
conda install -y intel-openmp
sudo apt install numactl
```
5. Install the app requirements:
```bash
pip install -r requirements.txt
```

## :rocket: Start the app

- Default mode (no optimizations):
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token=
```

- IPEX in graph mode with FP32:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --ipex --jit
```

- IPEX in graph mode with bfloat16:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --dtype=bfloat16 --ipex --jit
```

- Smooth quantization:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --sq
```

- 4-bit quantization:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --int4
```

## :computer: Chatbot demo

![](static/llama2-chat-demo.gif)

## :mailbox_closed: Contact
For any information, feedback or questions, please [contact me][anas-email]

[anas-email]: mailto:ahouzi2000@hotmail.fr

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aahouzi/llama2-chatbot-cpu

Awesome Lists containing this project

README