https://github.com/dvmazur/mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops
https://github.com/dvmazur/mixtral-offloading

colab-notebook deep-learning google-colab language-model llm mixture-of-experts offloading pytorch quantization

Last synced: about 1 month ago
JSON representation

Run Mixtral-8x7B models in Colab or consumer desktops

Host: GitHub
URL: https://github.com/dvmazur/mixtral-offloading
Owner: dvmazur
License: mit
Created: 2023-12-15T03:32:35.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-04-08T08:40:22.000Z (about 1 year ago)
Last Synced: 2025-04-14T15:02:57.923Z (2 months ago)
Topics: colab-notebook, deep-learning, google-colab, language-model, llm, mixture-of-experts, offloading, pytorch, quantization
Language: Python
Homepage:
Size: 261 KB
Stars: 2,303
Watchers: 28
Forks: 233
Open Issues: 27
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - dvmazur/mixtral-offloading - 8x7B 模型，通过多种技术的组合实现了对Mixtral-8x7B模型的高效推理：使用 HQQ 进行混合量化，我们为注意力层和专家应用单独的量化方案，以将模型拟合到组合的 GPU 和 CPU 内存中。MoE 卸载策略，每层的每个专家都单独卸载，仅在需要时将背包带到 GPU，我们将活跃的 EA 存储在 LRU 缓存中，以减少在计算相邻令牌的激活时 GPU-RAM 通信。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-adaptive-computation - official code
AiTreasureBox - dvmazur/mixtral-offloading - 06-19_2312_0](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg)|Run Mixtral-8x7B models in Colab or consumer desktops| (Repos)

README

# Mixtral offloading

This project implements efficient inference of [Mixtral-8x7B models](https://mistral.ai/news/mixtral-of-experts/).

## How does it work?

In summary, we achieve efficient inference of Mixtral-8x7B models through a combination of techniques:

* **Mixed quantization with HQQ**. We apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.
* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought pack to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.

For more detailed information about our methods and results, please refer to our [tech-report](https://arxiv.org/abs/2312.17238).

## Running

To try this demo, please use the demo notebook: [./notebooks/demo.ipynb](./notebooks/demo.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dvmazur/mixtral-offloading/blob/master/notebooks/demo.ipynb)

For now, there is no command-line script available for running the model locally. However, you can create one using the demo notebook as a reference. That being said, contributions are welcome!

## Work in progress

Some techniques described in our technical report are not yet available in this repo. However, we are actively working on adding support for them in the near future.

Some of the upcoming features are:
* Support for other quantization methods
* Speculative expert prefetching

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dvmazur/mixtral-offloading

Awesome Lists containing this project

README