Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aahouzi/llama2-chatbot-cpu

A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
https://github.com/aahouzi/llama2-chatbot-cpu

4-bit-cpu bfloat16 chatbot chatbot-memory chatgpt cpu huggingface int8 intel ipex langchain llama llama2 meta meta-ai neural-compression numa optimization smooth-quantization streamlit

Last synced: 2 months ago
JSON representation

A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.

Awesome Lists containing this project

README

        

# LLaMA2 chatbot on CPU

## :monocle_face: Description
- This project is a Streamlit chatbot with Langchain deploying a **LLaMA2-7b-chat** model on **Intel® Server and Client CPUs**.
- The chatbot has a memory that **remembers every part of the speech**, and allows users to optimize the model using **Intel® Extension for PyTorch (IPEX) in bfloat16 with graph mode** or **smooth quantization** (A new quantization technique specifically designed for LLMs: [ArXiv link](https://arxiv.org/pdf/2211.10438.pdf)), or **4-bit quantization**. The user can expect **up to 4.3x speed-up** compared to stock PyTorch in default mode.

- **IMPORTANT:** The CPU needs to support bfloat16 ops in order to be able to use such optimization. On top of software optimizations, I also introduced some hardware optimizations like non-uniform memory access (NUMA). User needs to **ask for access to LLaMA2** models by following this [link](https://huggingface.co/meta-llama#:~:text=Welcome%20to%20the%20official%20Hugging,processed%20within%201%2D2%20days). When getting approval from Meta, you can generate an authentification token from your HuggingFace account, and use it to load the model.

## :scroll: Getting started

1. Start by cloning the repository:
```bash
git clone https://github.com/aahouzi/llama2-chatbot-cpu.git
cd llama2-chatbot-cpu
```
2. Create a Python 3.9 conda environment:
```bash
conda create -y -n llama2-chat python=3.9
```
3. Activate the environment:
```bash
conda activate llama2-chat
```
4. Install requirements for NUMA:
```bash
conda install -y gperftools -c conda-forge
conda install -y intel-openmp
sudo apt install numactl
```
5. Install the app requirements:
```bash
pip install -r requirements.txt
```

## :rocket: Start the app

- Default mode (no optimizations):
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token=
```

- IPEX in graph mode with FP32:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --ipex --jit
```

- IPEX in graph mode with bfloat16:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --dtype=bfloat16 --ipex --jit
```

- Smooth quantization:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --sq
```

- 4-bit quantization:
```bash
bash launcher.sh --script=app/app.py --port= --physical_cores= --auth_token= --int4
```

## :computer: Chatbot demo


![](static/llama2-chat-demo.gif)

## :mailbox_closed: Contact
For any information, feedback or questions, please [contact me][anas-email]

[anas-email]: mailto:[email protected]