https://github.com/VITA-MLLM/Long-VITA
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
https://github.com/VITA-MLLM/Long-VITA
long-context mllm vision-language-model
Last synced: about 1 year ago
JSON representation
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
- Host: GitHub
- URL: https://github.com/VITA-MLLM/Long-VITA
- Owner: VITA-MLLM
- License: other
- Created: 2024-12-14T03:48:19.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-20T05:23:49.000Z (about 1 year ago)
- Last Synced: 2025-03-20T06:26:29.509Z (about 1 year ago)
- Topics: long-context, mllm, vision-language-model
- Language: Python
- Homepage:
- Size: 3.85 MB
- Stars: 255
- Watchers: 15
- Forks: 29
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - VITA-MLLM/Long-VITA - VITA是一个旨在将大型多模态模型扩展到100万token,同时保持领先的短上下文准确性的项目。它通过引入视觉token聚合(VITA)方法,显著降低了长上下文多模态模型的计算成本。VITA的核心思想是逐步将视觉token聚合到更少的“视觉地标”中,从而减少后续Transformer层的处理量。该项目声称在长上下文多模态基准测试中实现了最先进的性能,同时在短上下文任务中保持了竞争力。Long-VITA的训练效率很高,可以在单个GPU上进行微调。项目提供了代码、模型权重和演示,方便用户尝试和复现结果。它支持多种视觉编码器和LLM,具有良好的灵活性。Long-VITA的出现为构建更高效、更强大的长上下文多模态模型提供了新的思路。该项目特别关注长上下文推理能力,并努力在长文本和图像处理方面取得平衡。 (多模态大模型 / 资源传输下载)
README
# Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
## :fire: News
* **`2025.02.27`** 🌟 We have an [Oneline Demo](https://huggingface.co/spaces/shenyunhang/Long-VITA) now.
* **`2025.02.27`** 🌟 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) of OpenCompass has supported our Long-VITA.
* **`2025.02.17`** 🌟 We support training on **Nvidia GPU with DeepSpeed** and inference on **Nvidia GPU with Transformer**.
* **`2025.02.09`** 🌟 We support training and inference on **Nvidia GPU with Megatron**.
* **`2025.02.05`** 🌟 We release training code, **training log**, deployment code, and model weights, which support **Ascend NPU with MindSpeed**.
* **`2024.02.05`** 🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.
## Contents
- [Highlights](#-highlights)
- [Experimental Results](#-experimental-results)
- [Models](#-models)
- [Training, Inference and Evaluation](#-training-inference-and-evaluation)
## ✨ Highlights
- **Long Context**. Long-VITA can process more than **4K** frames or over **1M** visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
- **Open Source**. Long-VITA is trained on **open-source data** only, consisting of a mix of 17M samples that are publicly available.
- **Strong Performance**. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.
## 📈 Experimental Results
- **Comparison of image understanding**.


- **Comparison of video understanding**.


- **Effectiveness of Logits-Masked LM Head**.

## 🐍 Models
Model | LLM Size | Training Context | Training Frames | MindSpeed Weights | Megatron Weights | Huggingface Weights
---------------:|---------:|-----------------:|----------------:|------------------------------------------------:|---------------------------------------------------:|---------------------------------------------------:
Long-VITA-16K | 14B | 16,384 | 64 | https://huggingface.co/VITA-MLLM/Long-VITA-16K | https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K | 14B | 131,072 | 512 | https://huggingface.co/VITA-MLLM/Long-VITA-128K | https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M | 14B | 1,048,576 | 4,096 | https://huggingface.co/VITA-MLLM/Long-VITA-1M | https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG | https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF
## ⭐ Training, Inference and Evaluation
We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.
- [Data Preparation for Training](https://github.com/VITA-MLLM/Long-VITA/blob/main/DATA.md)
- [Ascend NPU with MindSpeed](https://github.com/VITA-MLLM/Long-VITA/blob/main/NPU_MindSpeed.md)
- [Nvidia GPU with Megatron](https://github.com/VITA-MLLM/Long-VITA/blob/main/GPU_Megatron.md)
- [Nvidia GPU with DeepSpeed](https://github.com/VITA-MLLM/Long-VITA/blob/main/GPU_DeepSpeed.md)