https://github.com/yottalabsai/bloombee
Decentralized LLMs fine-tuning and inference with offloading
https://github.com/yottalabsai/bloombee
deep-learning distributed-systems llama machine-learning pipeline-parallelism pytorch tensor-parallelism
Last synced: about 2 months ago
JSON representation
Decentralized LLMs fine-tuning and inference with offloading
- Host: GitHub
- URL: https://github.com/yottalabsai/bloombee
- Owner: ai-decentralized
- License: apache-2.0
- Created: 2025-01-23T21:57:49.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-03-14T03:04:18.000Z (2 months ago)
- Last Synced: 2025-04-07T13:06:48.395Z (about 2 months ago)
- Topics: deep-learning, distributed-systems, llama, machine-learning, pipeline-parallelism, pytorch, tensor-parallelism
- Language: Python
- Homepage:
- Size: 36.6 MB
- Stars: 87
- Watchers: 10
- Forks: 13
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Run large language models in a heterogeneous decentralized environment with offloading.
![]()
![]()
![]()
The rapid rise of generative AI has boosted demand for large language model (LLM) inference and fine-tuining services. While proprietary models are still favored, advancements in open-source LLMs have made them competitive. However, high costs and limited GPU resources hinder deployment. This work introduces BloomBee, a decentralized offline serving system that leverages idle GPU resources to provide cost-effective access to LLMs.
We rely on global GPU sharing, which includes more consumer-grade GPUs. If your GPU can only manage a small portion of a large language model, like the Llama3.1 (405B) model, you can connect to a network of servers that load different parts of the model. In this network, you can request inference or fine-tuning services.
🚀  Try now in Colab## Installation
#### From Pypi
```
pip install bloombee
```
#### From Source
```bash
git clone https://github.com/ai-decentralized/BloomBee.git
cd BloomBee
pip install .
```
## How to use BloomBee(Try now in Colab)
#### 1. Start the main server
```
python -m bloombee.cli.run_dht --host_maddrs /ip4/0.0.0.0/tcp/31340 --identity_path bootstrapp1.id```
Now you will get the BloomBee's main server location:
```
Mon 00 01:23:45.678 [INFO] Running a DHT instance. To connect other peers to this one, use --initial_peers /ip4/YOUR_IP_ADDRESS/tcp/31340/p2p/QmefxzDL1DaJ7TcrZjLuz7Xs9sUVKpufyg7f5276ZHFjbQ
```
You can provide this address as --initial_peers to workers or other backbone servers.If you want your swarm to be accessible outside of your local network, ensure that you have a **public IP address** or set up **port forwarding** correctly, so that your peer is reachable from the outside.
#### 2. Connect the workers to the main bloombee server
Here is the BloomBee Server location:
```
export BBSERVER=/ip4/10.52.2.249/tcp/31340/p2p/QmefxzDL1DaJ7TcrZjLuz7Xs9sUVKpufyg7f5276ZHFjbQ```
Start one worker to hold 16 blocks (16 tranformer layers)
```
python -m bloombee.cli.run_server huggyllama/llama-7b --initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_1.id
```
Start second worker to hold another 16 blocks (16 tranformer layers)
```
python -m bloombee.cli.run_server huggyllama/llama-7b --initial_peers $BBSERVER --num_blocks 16 --identity_path bootstrap_1.id
```#### 3. Run inference or finetune jobs
#### Inference
```
cd BloombBee/
python benchmarks/benchmark_inference.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --seq_len 128
```#### Finetune
```
cd BloomBee/
python benchmarks/benchmark_training.py --model huggyllama/llama-7b --initial_peers $BBSERVER --torch_dtype float32 --n_steps 20 --batch_size 32 --seq_len 128
```## Acknowledgements
BloomBee is built upon a few popular libraries:
- [Hivemind](https://github.com/learning-at-home/hivemind) - A PyTorch library for decentralized deep learning across the Internet.
- [FlexLLMGen](https://github.com/FMInference/FlexLLMGen) - An offloading-based system running on weak GPUs.
- [Petals](https://github.com/bigscience-workshop/petals) - A library for decentralized LLMs fine-tuning and inference without offloading.