Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dvmazur/mixtral-offloading
Run Mixtral-8x7B models in Colab or consumer desktops
https://github.com/dvmazur/mixtral-offloading
colab-notebook deep-learning google-colab language-model llm mixture-of-experts offloading pytorch quantization
Last synced: about 22 hours ago
JSON representation
Run Mixtral-8x7B models in Colab or consumer desktops
- Host: GitHub
- URL: https://github.com/dvmazur/mixtral-offloading
- Owner: dvmazur
- License: mit
- Created: 2023-12-15T03:32:35.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-04-08T08:40:22.000Z (9 months ago)
- Last Synced: 2024-12-13T18:07:44.074Z (8 days ago)
- Topics: colab-notebook, deep-learning, google-colab, language-model, llm, mixture-of-experts, offloading, pytorch, quantization
- Language: Python
- Homepage:
- Size: 261 KB
- Stars: 2,290
- Watchers: 29
- Forks: 228
- Open Issues: 26
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-adaptive-computation - official code
- AiTreasureBox - dvmazur/mixtral-offloading - 12-20_2290_0](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg)|Run Mixtral-8x7B models in Colab or consumer desktops| (Repos)
- StarryDivineSky - dvmazur/mixtral-offloading - 8x7B 模型,通过多种技术的组合实现了对Mixtral-8x7B模型的高效推理:使用 HQQ 进行混合量化,我们为注意力层和专家应用单独的量化方案,以将模型拟合到组合的 GPU 和 CPU 内存中。MoE 卸载策略,每层的每个专家都单独卸载,仅在需要时将背包带到 GPU,我们将活跃的 EA 存储在 LRU 缓存中,以减少在计算相邻令牌的激活时 GPU-RAM 通信。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Mixtral offloading
This project implements efficient inference of [Mixtral-8x7B models](https://mistral.ai/news/mixtral-of-experts/).
## How does it work?
In summary, we achieve efficient inference of Mixtral-8x7B models through a combination of techniques:
* **Mixed quantization with HQQ**. We apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.
* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought pack to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.For more detailed information about our methods and results, please refer to our [tech-report](https://arxiv.org/abs/2312.17238).
## Running
To try this demo, please use the demo notebook: [./notebooks/demo.ipynb](./notebooks/demo.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dvmazur/mixtral-offloading/blob/master/notebooks/demo.ipynb)
For now, there is no command-line script available for running the model locally. However, you can create one using the demo notebook as a reference. That being said, contributions are welcome!
## Work in progress
Some techniques described in our technical report are not yet available in this repo. However, we are actively working on adding support for them in the near future.
Some of the upcoming features are:
* Support for other quantization methods
* Speculative expert prefetching