{"id":13754139,"url":"https://github.com/b4rtaz/distributed-llama","last_synced_at":"2025-04-13T01:55:16.195Z","repository":{"id":218234721,"uuid":"727470807","full_name":"b4rtaz/distributed-llama","owner":"b4rtaz","description":"Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.","archived":false,"fork":false,"pushed_at":"2025-04-12T12:01:24.000Z","size":3315,"stargazers_count":2014,"open_issues_count":30,"forks_count":145,"subscribers_count":41,"default_branch":"main","last_synced_at":"2025-04-13T01:54:52.167Z","etag":null,"topics":["distributed-computing","distributed-llm","llama2","llama3","llm","llm-inference","llms","neural-network","open-llm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/b4rtaz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-04T23:36:06.000Z","updated_at":"2025-04-10T23:02:14.000Z","dependencies_parsed_at":"2024-01-30T23:45:39.616Z","dependency_job_id":"a0d8b65e-bdaa-48d1-a91b-3a62e29c10d1","html_url":"https://github.com/b4rtaz/distributed-llama","commit_stats":{"total_commits":273,"total_committers":3,"mean_commits":91.0,"dds":0.04395604395604391,"last_synced_commit":"d10699fec2e4b9303065807555e3898e69f6cf18"},"previous_names":["b4rtaz/distributed-llama"],"tags_count":37,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/b4rtaz%2Fdistributed-llama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/b4rtaz%2Fdistributed-llama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/b4rtaz%2Fdistributed-llama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/b4rtaz%2Fdistributed-llama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/b4rtaz","download_url":"https://codeload.github.com/b4rtaz/distributed-llama/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248654051,"owners_count":21140235,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-computing","distributed-llm","llama2","llama3","llm","llm-inference","llms","neural-network","open-llm"],"created_at":"2024-08-03T09:01:42.712Z","updated_at":"2025-04-13T01:55:16.168Z","avatar_url":"https://github.com/b4rtaz.png","language":"C++","funding_links":["https://github.com/sponsors/b4rtaz"],"categories":["A01_文本生成_文本对话","Local Inference","C++","Other","Inference engines","8. Inference Engines","3. Inference Engines \u0026 Serving"],"sub_categories":["大语言对话模型及数据","Desktop / Local"],"readme":"![Distributed Llama](.github/cover.png)\n\n# Distributed Llama\n\n[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/b4rtaz/distributed-llama/.github%2Fworkflows%2Fmain.yml?style=flat-square)](https://github.com/b4rtaz/distributed-llama/actions) [![License: MIT](https://img.shields.io/github/license/mashape/apistatus.svg?style=flat-square)](/LICENSE) [![Support this project](https://img.shields.io/github/sponsors/b4rtaz?style=flat-square\u0026label=support%20this%20project\u0026color=green)](https://github.com/sponsors/b4rtaz) [![Discord](https://discordapp.com/api/guilds/1245814812353495070/widget.png?style=shield)](https://discord.com/widget?id=1245814812353495070\u0026theme=dark)\n\nConnect home devices into a powerful cluster to accelerate LLM inference. More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.\n\nSupports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.\n\n**News**\n- 23 Mar 2025 - [🌋 Experimental Vulkan support](https://github.com/b4rtaz/distributed-llama/releases/tag/v0.13.0)\n- 12 Feb 2025 - 🚧 Merged the [fundamental codebase refactor](https://github.com/b4rtaz/distributed-llama/releases/tag/v0.12.0)\n- 9 Jan 2025 - [🍎 Llama 3.3 70B on 4 x Mac Mini M4 Pro 24GB RAM](https://github.com/b4rtaz/distributed-llama/discussions/147)\n- 28 Jul 2024 - [🌳 How to Run Llama 3.1 405B on Home Devices? Build AI Cluster!](https://medium.com/@b4rtaz/how-to-run-llama-3-405b-on-home-devices-build-ai-cluster-ad0d5ad3473b)\n\n\n### 🔥 Setup Root Node by Single Command\n\nPython 3 and C++ compiler required. The command will download the model and the tokenizer.\n\n| Model                             | Size     | Command                                              |\n| --------------------------------- | -------- | ---------------------------------------------------- |\n| Llama 3.1 8B Instruct Q40         | 6.32 GB  | `python launch.py llama3_1_8b_instruct_q40`          |\n| Llama 3.1 405B Instruct Q40.      | 238 GB   | `python launch.py llama3_1_405b_instruct_q40`.       |\n| Llama 3.2 1B Instruct Q40         | 1.7 GB   | `python launch.py llama3_2_1b_instruct_q40`          |\n| Llama 3.2 3B Instruct Q40         | 3.4 GB   | `python launch.py llama3_2_3b_instruct_q40`          |\n| Llama 3.3 70B Instruct Q40        | 40 GB    | `python launch.py llama3_3_70b_instruct_q40`         |\n| DeepSeek R1 Distill Llama 8B Q40  | 6.32 GB  | `python launch.py deepseek_r1_distill_llama_8b_q40`  |\n\n### 🛠️ Convert Model Manually\n\nSupported architectures: Llama.\n\n* [How to Convert Llama 3.1](./docs/LLAMA.md)\n* [How to Convert Hugging Face Model](./docs/HUGGINGFACE.md)\n\n### 🚧 Known Limitations\n\n* You can run Distributed Llama only on 1, 2, 4... 2^n nodes.\n* The maximum number of nodes is equal to the number of KV heads in the model [#70](https://github.com/b4rtaz/distributed-llama/issues/70).\n* Only the following quantizations are supported [#183](https://github.com/b4rtaz/distributed-llama/issues/183):\n  * `q40` model with `q80` `buffer-float-type`\n  * `f32` model with `f32` `buffer-float-type`\n\n### 👷 Architecture\n\nThe project is split up into two parts:\n* **Root node** - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.\n* **Worker node** - it processes own slice of the neural network. It doesn't require any configuration related to the model.\n\nYou always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.\n\n### 🎹 Commands\n\n* `dllama inference` - run the inference with a simple benchmark,\n* `dllama chat` - run the CLI chat,\n* `dllama worker` - run the worker node,\n* `dllama-api` - run the API server.\n\n\u003cdetails\u003e\n\n\u003csummary\u003e🎹 Supported Arguments\u003c/summary\u003e\n\n\u003cbr /\u003eInference, Chat, API\n\n| Argument                     | Description                                                      | Example                                |\n| ---------------------------- | ---------------------------------------------------------------- | -------------------------------------- |\n| `--model \u003cpath\u003e`             | Path to model.                                                   | `dllama_model_meta-llama-3-8b_q40.m`   |\n| `--tokenizer \u003cpath\u003e`         | Tokenizer to model.                                              | `dllama_tokenizer_llama3.t`            |\n| `--buffer-float-type \u003ctype\u003e` | Float precision of synchronization.                              | `q80`                                  |\n| `--workers \u003cworkers\u003e`        | Addresses of workers (ip:port), separated by space.              | `10.0.0.1:9999 10.0.0.2:9999`          |\n| `--max-seq-len \u003cn\u003e`          | The maximum sequence length, it helps to reduce the RAM usage.   | `4096`                                 |\n\nInference, Chat, Worker, API\n\n| Argument                     | Description                                                           | Example                             |\n| ---------------------------- | --------------------------------------------------------------------- | ----------------------------------- |\n| `--nthreads \u003cn\u003e`             | Amount of threads. Don't set a higher value than number of CPU cores. | `4`                                 |\n\nWorker, API\n\n| Argument                     | Description                       | Example           |\n| ---------------------------- | --------------------------------- | ----------------- |\n| `--port \u003cport\u003e`              | Binding port.                     | `9999`            |\n\nInference\n\n| Argument                     | Description                    | Example            |\n| ---------------------------- | ------------------------------ | ------------------ |\n| `--prompt \u003cprompt\u003e`          | Initial prompt.                | `\"Hello World\"`    |\n| `--steps \u003csteps\u003e`            | Number of tokens to generate.  | `256`              |\n\n\u003c/details\u003e\n\n## 📊 Measurements\n\nPlease check the [discussions](https://github.com/b4rtaz/distributed-llama/discussions) section, where many measurements were published on different configurations.\n\n## 🚀 Setup\n\nSelect and expand one of the sections below:\n\n\u003cdetails\u003e\n\n\u003csummary\u003e💻 MacOS, Linux, or Windows\u003c/summary\u003e\n\n\u003cbr /\u003eYou need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.\n\n#### MacOS or Linux\n\nThe below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.\n\n1. Install Git and GCC:\n```sh\nsudo apt install git build-essential\n```\n2. Clone this repository and compile Distributed Llama on all computers:\n```sh\ngit clone https://github.com/b4rtaz/distributed-llama.git\ncd distributed-llama\nmake dllama\nmake dllama-api\n```\n\nContinue to point 3.\n\n#### Windows\n\n1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):\n```powershell\nchoco install mingw\n```\n2. Clone this repository and compile Distributed Llama on all computers:\n```sh\ngit clone https://github.com/b4rtaz/distributed-llama.git\ncd distributed-llama\nmake dllama\nmake dllama-api\n```\n\nContinue to point 3.\n\n#### Run Cluster\n\n3. Transfer weights and the tokenizer file to the root computer.\n4. Run worker nodes on worker computers:\n```sh\n./dllama worker --port 9999 --nthreads 4\n```\n5. Run root node on the root computer:\n```sh\n./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt \"Hello world\" --steps 16 --nthreads 4 --workers 192.168.0.1:9999\n```\n\nTo add more worker nodes, just add more addresses to the `--workers` argument.\n\n```\n./dllama inference ... --workers 192.168.0.1:9999 192.168.0.2:9999 192.168.0.3:9999\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e📟 Raspberry Pi\u003c/summary\u003e\n\n\u003cbr /\u003e\n\n1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment.\n2. Connect all devices to your switch or router.\n3. Connect to all devices via SSH.\n```\nssh user@raspberrypi1.local\nssh user@raspberrypi2.local\n```\n4. Install Git:\n```sh\nsudo apt install git\n```\n5. Clone this repository and compile Distributed Llama on all devices:\n```sh\ngit clone https://github.com/b4rtaz/distributed-llama.git\ncd distributed-llama\nmake dllama\nmake dllama-api\n```\n6. Transfer weights and the tokenizer file to the root device.\n7. Optional: assign static IP addresses.\n```sh\nsudo ip addr add 10.0.0.1/24 dev eth0 # 1th device\nsudo ip addr add 10.0.0.2/24 dev eth0 # 2th device\n```\n8. Run worker nodes on worker devices:\n```sh\nsudo nice -n -20 ./dllama worker --port 9999 --nthreads 4\n```\n9. Run root node on the root device:\n```sh\nsudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt \"Hello world\" --steps 16 --nthreads 4 --workers 10.0.0.2:9999\n```\n\nTo add more worker nodes, just add more addresses to the `--workers` argument.\n\n```\n./dllama inference ... --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999\n```\n\n\u003c/details\u003e\n\n## ✋ Contribution\n\nFeel free to contribute to this project. For small changes, simply create a new merge request. For larger changes, please create an issue to discuss your plans. Please follow these guidelines when contributing:\n\n* Make only minimal changes and avoid modifying files that are not necessary.\n* Ensure the code is compatible across all supported systems and CPUs.\n* This repository is maintained in English.\n\n## 💡 License\n\nThis project is released under the MIT license.\n\n## 📖 Citation\n\n```\n@misc{dllama,\n  author = {Bartłomiej Tadych},\n  title = {Distributed Llama},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/b4rtaz/distributed-llama}},\n  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fb4rtaz%2Fdistributed-llama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fb4rtaz%2Fdistributed-llama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fb4rtaz%2Fdistributed-llama/lists"}