https://github.com/liuting20/MustDrop

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
https://github.com/liuting20/MustDrop

Last synced: 5 months ago
JSON representation

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Host: GitHub
URL: https://github.com/liuting20/MustDrop
Owner: liuting20
Created: 2024-11-15T08:02:21.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-01-08T02:45:10.000Z (5 months ago)
Last Synced: 2025-01-08T03:26:39.601Z (5 months ago)
Language: Python
Homepage: https://arxiv.org/abs/2411.10803
Size: 3.5 MB
Stars: 9
Watchers: 3
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-token-merge-for-mllms - [Code

README

        


 Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model 


 

[Ting Liu]()^1*,

[Liangtao Shi]()^2*,

[Richang Hong]()²,

[Yue Hu]()¹,\

[Quanjun Yin]()^1✉️,

[Linfeng Zhang]()^3✉️

¹National University of Defense Technology, ²Hefei University of Technology,\

³Shanghai Jiao Tong University

















## 👀 Overview

The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their in ference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during in ference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from **the whole lifecycle**, including the vision encoding stage, prefilling stage, and decoding stage. Comparison of vision token dropping methods: (a) methods that only drop tokens during the vision encoding stage, i.e., PruMerge and ToMe, (b) methods that remove tokens limited to the prefilling phase, i.e., FastV and SparseVLM, and (c) our Mustdrop approach, which gradually removes invalid tokens during the vision encoding, prefilling, and decoding stages.







## 👨 Preparation

1. Clone this repository.

```bash

git clone https://github.com/liuting20/MustDrop.git

cd MustDrop

```

2. Install necessary package

```Shell

 conda create -n MustDrop python=3.10 -y

 conda activate MustDrop

 pip install -e .

```

3. Download Multimodal Benchmark

Please follow the detailed instruction in [LLaVA-Evaluation](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md).

4. Download LLaVa and put it under ./liuhaotian/llava-v1.5-7b.

   [LLaVA-1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b)

   [LLaVA-Next](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b)

## 🎯 Usage

Specifically, `--sparse` in script indicates whether to perform sparseness, while `--global_thr` and `--individual_thr` control the degree of token sparsity.

1. Example for evaluating TextVQA results (192 tokens, global_thr = 0.001, individual_thr = 0.001):

```Shell

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

```

2. Example for evaluating TextVQA results (128 tokens, global_thr = 0.0012, individual_thr = 0.001):

```Shell

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

```

3. Example for evaluating TextVQA results (64 tokens, global_thr = 0.011, individual_thr = 0.01):

```Shell

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh

```

## License

This project is released under the [Apache 2.0 license](LICENSE).

## Citation

If you use MustDrop in your research, please cite our work by using the following BibTeX entry:

```bibtex

@article{liu2024multi,

  title={Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model},

  author={Liu, Ting and Shi, Liangtao and Hong, Richang and Hu, Yue and Yin, Quanjun and Zhang, Linfeng},

  journal={arXiv preprint arXiv:2411.10803},

  year={2024}

}

```

## Acknowledgment

We extend our gratitude to the open-source efforts of [LLaVA](https://github.com/haotian-liu/LLaVA), [SparseVLMs](https://github.com/dvlab-research/MGM) and [VideoLLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/liuting20/MustDrop

Awesome Lists containing this project

README

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

[Ting Liu]()^1,
[Liangtao Shi]()^2,
[Richang Hong]()²,
[Yue Hu]()¹,\
[Quanjun Yin]()^1✉️,
[Linfeng Zhang]()^3✉️

¹National University of Defense Technology, ²Hefei University of Technology,\
³Shanghai Jiao Tong University

https://github.com/liuting20/MustDrop

Awesome Lists containing this project

README

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

[Ting Liu]()1* , [Liangtao Shi]()2*, [Richang Hong]()2, [Yue Hu]()1,\ [Quanjun Yin]()1✉️, [Linfeng Zhang]()3✉️ 1National University of Defense Technology, 2Hefei University of Technology,\ 3Shanghai Jiao Tong University

[Ting Liu]()^1,
[Liangtao Shi]()^2,
[Richang Hong]()²,
[Yue Hu]()¹,\
[Quanjun Yin]()^1✉️,
[Linfeng Zhang]()^3✉️

¹National University of Defense Technology, ²Hefei University of Technology,\
³Shanghai Jiao Tong University