https://github.com/VITA-MLLM/Long-VITA

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
https://github.com/VITA-MLLM/Long-VITA

long-context mllm vision-language-model

Last synced: about 1 year ago
JSON representation

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Host: GitHub
URL: https://github.com/VITA-MLLM/Long-VITA
Owner: VITA-MLLM
License: other
Created: 2024-12-14T03:48:19.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-20T05:23:49.000Z (about 1 year ago)
Last Synced: 2025-03-20T06:26:29.509Z (about 1 year ago)
Topics: long-context, mllm, vision-language-model
Language: Python
Homepage:
Size: 3.85 MB
Stars: 255
Watchers: 15
Forks: 29
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - VITA-MLLM/Long-VITA - VITA是一个旨在将大型多模态模型扩展到100万token，同时保持领先的短上下文准确性的项目。它通过引入视觉token聚合（VITA）方法，显著降低了长上下文多模态模型的计算成本。VITA的核心思想是逐步将视觉token聚合到更少的“视觉地标”中，从而减少后续Transformer层的处理量。该项目声称在长上下文多模态基准测试中实现了最先进的性能，同时在短上下文任务中保持了竞争力。Long-VITA的训练效率很高，可以在单个GPU上进行微调。项目提供了代码、模型权重和演示，方便用户尝试和复现结果。它支持多种视觉编码器和LLM，具有良好的灵活性。Long-VITA的出现为构建更高效、更强大的长上下文多模态模型提供了新的思路。该项目特别关注长上下文推理能力，并努力在长文本和图像处理方面取得平衡。 (多模态大模型 / 资源传输下载)

README

# Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

## :fire: News

* **`2025.02.27`** 🌟 We have an [Oneline Demo](https://huggingface.co/spaces/shenyunhang/Long-VITA) now.
* **`2025.02.27`** 🌟 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) of OpenCompass has supported our Long-VITA.
* **`2025.02.17`** 🌟 We support training on **Nvidia GPU with DeepSpeed** and inference on **Nvidia GPU with Transformer**.
* **`2025.02.09`** 🌟 We support training and inference on **Nvidia GPU with Megatron**.
* **`2025.02.05`** 🌟 We release training code, **training log**, deployment code, and model weights, which support **Ascend NPU with MindSpeed**.
* **`2024.02.05`** 🌟 We are proud to launch Long-VITA, a strong long-context visual language model supporting over one million tokens.

## Contents

- [Highlights](#-highlights)
- [Experimental Results](#-experimental-results)
- [Models](#-models)
- [Training, Inference and Evaluation](#-training-inference-and-evaluation)

## ✨ Highlights

- **Long Context**. Long-VITA can process more than **4K** frames or over **1M** visual tokens. It achieves state-of-the-art performance on Video-MME under 20B models.
- **Open Source**. Long-VITA is trained on **open-source data** only, consisting of a mix of 17M samples that are publicly available.
- **Strong Performance**. Long-VITA achieves competitive results on image and video understanding benchmarks among cutting-edge models under 20B parameters.

## 📈 Experimental Results
- **Comparison of image understanding**.

![image](https://github.com/user-attachments/assets/235bdb0e-37e6-4a5f-b20b-21b0bb83278a)
![image](https://github.com/user-attachments/assets/72250c5b-7d33-4dba-98ab-0539bae08703)

- **Comparison of video understanding**.

![image](https://github.com/user-attachments/assets/7f09662b-bd53-4504-927a-0e45214a049d)

![image](https://github.com/user-attachments/assets/87bd2f4d-baf5-4a63-8002-151e30f52147)

- **Effectiveness of Logits-Masked LM Head**.

![image](https://github.com/user-attachments/assets/94389a9f-3134-4fd6-9531-62f626d38e39)

## 🐍 Models

Model | LLM Size | Training Context | Training Frames | MindSpeed Weights | Megatron Weights | Huggingface Weights
---------------:|---------:|-----------------:|----------------:|------------------------------------------------:|---------------------------------------------------:|---------------------------------------------------:
Long-VITA-16K | 14B | 16,384 | 64 | https://huggingface.co/VITA-MLLM/Long-VITA-16K | https://huggingface.co/VITA-MLLM/Long-VITA-16K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-16K_HF
Long-VITA-128K | 14B | 131,072 | 512 | https://huggingface.co/VITA-MLLM/Long-VITA-128K | https://huggingface.co/VITA-MLLM/Long-VITA-128K_MG | https://huggingface.co/VITA-MLLM/Long-VITA-128K_HF
Long-VITA-1M | 14B | 1,048,576 | 4,096 | https://huggingface.co/VITA-MLLM/Long-VITA-1M | https://huggingface.co/VITA-MLLM/Long-VITA-1M_MG | https://huggingface.co/VITA-MLLM/Long-VITA-1M_HF

## ⭐ Training, Inference and Evaluation

We originally implemented Long-VITA on Ascend NPU and will adapt to Nvidia GPU.

- [Data Preparation for Training](https://github.com/VITA-MLLM/Long-VITA/blob/main/DATA.md)

- [Ascend NPU with MindSpeed](https://github.com/VITA-MLLM/Long-VITA/blob/main/NPU_MindSpeed.md)

- [Nvidia GPU with Megatron](https://github.com/VITA-MLLM/Long-VITA/blob/main/GPU_Megatron.md)

- [Nvidia GPU with DeepSpeed](https://github.com/VITA-MLLM/Long-VITA/blob/main/GPU_DeepSpeed.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/VITA-MLLM/Long-VITA

Awesome Lists containing this project

README