Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/b4rtaz/distributed-llama
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
https://github.com/b4rtaz/distributed-llama
distributed-computing distributed-llm llama2 llama3 llm llm-inference llms neural-network open-llm
Last synced: 2 days ago
JSON representation
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
- Host: GitHub
- URL: https://github.com/b4rtaz/distributed-llama
- Owner: b4rtaz
- License: mit
- Created: 2023-12-04T23:36:06.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-14T18:57:11.000Z (2 months ago)
- Last Synced: 2024-10-29T15:18:03.839Z (about 2 months ago)
- Topics: distributed-computing, distributed-llm, llama2, llama3, llm, llm-inference, llms, neural-network, open-llm
- Language: C++
- Homepage:
- Size: 2.94 MB
- Stars: 1,452
- Watchers: 30
- Forks: 100
- Open Issues: 35
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - b4rtaz/distributed-llama - 它负责加载模型和权重并将它们转发给工作线程。此外,它还同步神经网络的状态。根节点也是一个工作节点,它处理神经网络的自己的切片。工作节点 - 它处理神经网络的自己的切片。它不需要与模型相关的任何配置。您始终需要根节点,您可以添加 2^n - 1 个工作节点来加快推理速度。神经网络的 RAM 使用量在所有节点上分配。根节点需要的 RAM 比工作节点多一点。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
![Distributed Llama](.github/cover.png)
# Distributed Llama
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/b4rtaz/distributed-llama/.github%2Fworkflows%2Fmain.yml?style=flat-square)](https://github.com/b4rtaz/distributed-llama/actions) [![License: MIT](https://img.shields.io/github/license/mashape/apistatus.svg?style=flat-square)](/LICENSE) [![Support this project](https://img.shields.io/github/sponsors/b4rtaz?style=flat-square&label=support%20this%20project&color=green)](https://github.com/sponsors/b4rtaz) [![Discord](https://discordapp.com/api/guilds/1245814812353495070/widget.png?style=shield)](https://discord.com/widget?id=1245814812353495070&theme=dark)
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.
> [!TIP]
> Check out the new article: [🌳 How to Run Llama 3.1 405B on Home Devices? Build AI Cluster!](https://medium.com/@b4rtaz/how-to-run-llama-3-405b-on-home-devices-build-ai-cluster-ad0d5ad3473b)### 🔥 Setup Root Node by Single Command
Python 3 and C++ compiler required. The command will download the model and the tokenizer.
| Model | Purpose | Size | Command |
| --------------------------- | --------- | -------- | --------------------------------------------- |
| TinyLlama 1.1B 3T Q40 | Benchmark | 844 MB | `python launch.py tinyllama_1_1b_3t_q40` |
| Llama 3 8B Q40 | Benchmark | 6.32 GB | `python launch.py llama3_8b_q40` |
| Llama 3 8B Instruct Q40 | Chat, API | 6.32 GB | `python launch.py llama3_8b_instruct_q40` |
| Llama 3.1 8B Instruct Q40 | Chat, API | 6.32 GB | `python launch.py llama3_1_8b_instruct_q40` |
| Llama 3.1 405B Instruct Q40 | Chat, API | 238 GB | `python launch.py llama3_1_405b_instruct_q40` |
| Llama 3.2 1B Instruct Q40 | Chat, API | 1.7 GB | `python launch.py llama3_2_1b_instruct_q40` |
| Llama 3.2 3B Instruct Q40 | Chat, API | 3.4 GB | `python launch.py llama3_2_3b_instruct_q40` |
| Llama 3.3 70B Instruct Q40 | Chat, API | 40 GB | `python launch.py llama3_3_70b_instruct_q40` |### 🛠️ Convert Model Manually
Supported architectures: Llama, Mixtral
* [How to Convert Llama 2, Llama 3, Llama 3.1](./docs/LLAMA.md)
* [How to Convert Hugging Face Model](./docs/HUGGINGFACE.md)### 🚧 Known Limitations
* You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
* The maximum number of nodes is equal to the number of KV heads in the model [#70](https://github.com/b4rtaz/distributed-llama/issues/70).
* CPU support only, GPU support is planned, optimized for (weights format × buffer format):
* ARM CPUs
* ✅ F32 × F32
* ❌ F16 × F32
* ✅ Q40 × F32
* ✅ Q40 × Q80
* x86_64 AVX2 CPUs
* ✅ F32 × F32
* ❌ F16 × F32
* ✅ Q40 × F32
* ✅ Q40 × Q80### 👷 Architecture
The project is split up into two parts:
* **Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
* **Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model.You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.
### 🎹 Commands
* `dllama inference` - run the inference with a simple benchmark,
* `dllama chat` - run the CLI chat,
* `dllama worker` - run the worker node,
* `dllama-api` - run the API server.Inference, Chat, API
| Argument | Description | Example |
| ---------------------------- | ---------------------------------------------------------------- | -------------------------------------- |
| `--model ` | Path to model. | `dllama_model_meta-llama-3-8b_q40.m` |
| `--tokenizer ` | Tokenizer to model. | `dllama_tokenizer_llama3.t` |
| `--buffer-float-type ` | Float precision of synchronization. | `q80` |
| `--workers ` | Addresses of workers (ip:port), separated by space. | `10.0.0.1:9991 10.0.0.2:9991` |
| `--max-seq-len ` | The maximum sequence length, it helps to reduce the RAM usage. | `4096` |Inference, Chat, Worker, API
| Argument | Description | Example |
| ---------------------------- | --------------------------------------------------------------------- | ----------------------------------- |
| `--nthreads ` | Amount of threads. Don't set a higher value than number of CPU cores. | `4` |Worker, API
| Argument | Description | Example |
| ---------------------------- | --------------------------------- | ----------------- |
| `--port ` | Binding port. | `9999` |Inference
| Argument | Description | Example |
| ---------------------------- | ------------------------------ | ------------------ |
| `--prompt ` | Initial prompt. | `"Hello World"` |
| `--steps ` | Number of tokens to generate. | `256` |## 📊 Measurements
### Average Token Generation Time
I - inference time of the root node, T - network transfer time of the root node.
**Raspberry Pi 5 8GB**
Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.3.1 version
| Model | 1 x RasPi 5 8 GB | 2 x RasPi 5 8 GB | 4 x RasPi 5 8 GB |
|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|
| Llama 2 7B | **441.09 ms**, 2.26 t/s
I: 434.84 ms, T: 5.25 ms | **341.46 ms**, 2.92 t/s
I: 257.78 ms, T: 83.27 ms | **219.08 ms**, 4.56 t/s 🔥
I: 163.42 ms, T: 55.25 ms |
| Llama 3 8B | **564.31 ms**, 1.77 t/s
I: 556.67 ms, T: 6.17 ms | **444.27 ms**, 2.25 t/s
I: 362.73 ms, T: 80.11 ms | **331.47 ms**, 3.01 t/s 🔥
I: 267.62 ms, T: 62.34 ms |**Raspberry Pi 4B 8 GB**
8 x Raspberry Pi 4B 8GB
Distributed Llama running Llama 2 70B Q40 on 8 Raspberry Pi 4B devicesWeights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.1.0 version
| Model | 1 x RasPi 4B 8 GB | 2 x RasPi 4B 8 GB | 4 x RasPi 4B 8 GB | 8 x RasPi 4B 8 GB |
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| Llama 2 7B | **1312.50 ms**
I: 1307.94 ms, T: 1.81 ms | **793.69 ms**
I: 739.00 ms, T: 52.50 ms | **494.00 ms** 🔥
I: 458.81 ms, T: 34.06 ms | **588.19 ms**
I: 296.69 ms, T: 289.75 ms |
| Llama 2 13B | Not enough RAM | **1497.19 ms**
I: 1465.06 ms, T: 30.88 ms | **848.19 ms** 🔥
I: 746.88 ms, T: 99.50 ms | **1114.88 ms**
I: 460.8 ms, T: 652.88 ms |
| Llama 2 70B | Not enough RAM | Not enough RAM | Not enough RAM | **4842.81 ms** 🔥
I: 2121.94 ms, T: 2719.62 ms |**x86_64 CPU Cloud Server**
Weights = Q40, Buffer = Q80, nSamples = 16, VMs = [c3d-highcpu-30](https://github.com/b4rtaz/distributed-llama/discussions/9), tested on 0.1.0 version
| Model | 1 x VM | 2 x VM | 4 x VM |
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| Llama 2 7B | **101.81 ms**
I: 101.06 ms, T: 0.19 ms | **69.69 ms**
I: 61.50 ms, T: 7.62 ms | **53.69 ms** 🔥
I: 40.25 ms, T: 12.81 ms |
| Llama 2 13B | **184.19 ms**
I: 182.88 ms, T: 0.69 ms | **115.38 ms**
I: 107.12 ms, T: 7.81 ms | **86.81 ms** 🔥
I: 66.25 ms, T: 19.94 ms |
| Llama 2 70B | **909.69 ms**
I: 907.25 ms, T: 1.75 ms | **501.38 ms**
I: 475.50 ms, T: 25.00 ms | **293.06 ms** 🔥
I: 264.00 ms, T: 28.50 ms |### Network Transfer for Generating Token
**F32 Buffer**
| Model | 2 devices | 4 devices | 8 devices |
|-------------|----------------|---------------|---------------|
| Llama 3 8B | **2048 kB** | **6144 kB** | **14336 kB** |**Q80 Buffer**
| Model | 2 devices | 4 devices | 8 devices |
|-------------|--------------|---------------|----------------|
| Llama 3 8B | **544 kB** | **1632 kB** | **3808 kB** |## 📟 Setup Raspberry Pi Devices
1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment.
2. Connect all devices to your switch or router.
3. Connect to all devices via SSH.
```
ssh [email protected]
ssh [email protected]
```
4. Install Git:
```sh
sudo apt install git
```
5. Clone this repository and compile Distributed Llama on all devices:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```
6. Transfer weights and the tokenizer file to the root device.
7. Optional: assign static IP addresses.
```sh
sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device
```
8. Run worker nodes on worker devices:
```sh
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
```
9. Run root node on the root device:
```sh
sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998
```To add more worker nodes, just add more addresses to the `--workers` argument.
```
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
```## 💻 Setup computers with MacOS, Linux, or Windows
You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.
#### MacOS or Linux
The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.
1. Install Git and GCC:
```sh
sudo apt install git build-essential
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```Continue to point 3.
#### Windows
1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
```powershell
choco install mingw
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```Continue to point 3.
#### Run Cluster
3. Transfer weights and the tokenizer file to the root computer.
4. Run worker nodes on worker computers:
```sh
./dllama worker --port 9998 --nthreads 4
```
5. Run root node on the root computer:
```sh
./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
```To add more worker nodes, just add more addresses to the `--workers` argument.
```
./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
```## ✋ Contribution
Feel free to contribute to this project. For small changes, simply create a new merge request. For larger changes, please create an issue to discuss your plans. Please follow these guidelines when contributing:
* Make only minimal changes and avoid modifying files that are not necessary.
* Ensure the code is compatible across all supported systems and CPUs.
* This repository is maintained in English.## 💡 License
This project is released under the MIT license.
## 📖 Citation
```
@misc{dllama,
author = {Bartłomiej Tadych},
title = {Distributed Llama},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}
```