An open API service indexing awesome lists of open source software.

https://github.com/jinxins/awesome-token-merge-for-mllms

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
https://github.com/jinxins/awesome-token-merge-for-mllms

List: awesome-token-merge-for-mllms

large-language-models llama llava multimodal-large-language-models nlp token-drop token-merging vicuna vision-transformer

Last synced: 4 months ago
JSON representation

A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.

Awesome Lists containing this project

README

        

# 💫 Awesome-Token-Merge-for-MLLMs

[![Awesome](https://awesome.re/badge.svg)](https://awesome.re) ![GitHub stars](https://img.shields.io/github/stars/JinXins/Awesome-Token-Merge-for-MLLMs?color=green) ![GitHub forks](https://img.shields.io/github/forks/JinXins/Awesome-Token-Merge-for-MLLMs?color=yellow&label=Fork)

## Welcome to Awesome-Token-Merge-for-MLLMs.
**If you know some related papers which don't conclute in this list, please tell me in `Issues` !)**

If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you!



## 📜 Introduction

**We summarize awesome token merge / reduce / resample methods in vision model for multi-modal large language models.**

The list of token merge, reduce, drop, resample methods is summarized in chronological order and is on **updating**.

## 📖 Related Papers

### Baseline ###

* **Visual Instruction Tuning** arXiv
*Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee*

NIPS'2023 *(oral)* [[Paper](https://arxiv.org/abs/2304.08485)]
[[Code](https://github.com/haotian-liu/LLaVA)]

LLaVA Framework


* **Honeybee: Locality-enhanced Projector for Multimodal LLM** arXiv
*Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh*

CVPR'2024 [[Paper](https://arxiv.org/abs/2312.06742)]
[[Code](https://github.com/khanrc/honeybee)]

Honeybee Framework


* **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models** arXiv
*Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi*

arXiv'2023 [[Paper](https://arxiv.org/abs/2301.12597)]
[[Code](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)]

BLIP-2 Framework


### 2024.3 ###

* **MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer** arXiv
*Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen*

CVPR'2024 [[Paper](https://arxiv.org/abs/2403.02991)]
[[Code](https://github.com/double125/MADTP)]

MADTP Framework


* **Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models** arXiv
*Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji*

arXiv'2024 [[Paper](https://arxiv.org/abs/2403.03003)]

LLaVA-HR Framework


* **TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document** arXiv
*Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai*

arXiv'2024 [[Paper](https://arxiv.org/abs/2403.04473)]
[[Code](https://github.com/Yuliang-Liu/Monkey)]

TextMonkey Framework


* **An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models** arXiv
*Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang*

ECCV'2024 *(oral)* [[Paper](https://arxiv.org/abs/2403.06764)]
[[Code](https://github.com/pkunlp-icler/FastV?tab=readme-ov-file)]

FastV Framework


* **Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring** arXiv
*Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2403.09333)]
[[Code](https://github.com/jefferyZhan/Griffon)]

Griffon V2 Framework


* **LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models** arXiv
*Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan*

arXiv'2024 [[Paper](https://arxiv.org/abs/2403.15388)]
[[Code](https://github.com/42Shawn/LLaVA-PruMerge)]

LLaVA-PruMerge Framework


### 2024.5 ###

* **DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models** arXiv
*Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou*

arXiv'2024 [[Paper](https://arxiv.org/abs/2405.20985)]
[[Code](https://github.com/yaolinli/DeCo)]

DeCo Framework


### 2024.6 ###

* **Efficient Large Multi-modal Models via Visual Context Compression** arXiv
*Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille*

NIPS'2024 [[Paper](https://arxiv.org/abs/2406.20092)]
[[Code](https://github.com/Beckschen/LLaVolta)]

LLaVolta Framework


* **VoCo-LLaMA: Towards Vision Compression with Large Language Models** arXiv
*Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2406.12275)]
[[Code](https://github.com/Yxxxb/VoCo-LLaMA)]

VoCo-LLaMA Framework


### 2024.7 ###

* **TokenPacker: Efficient Visual Projector for Multimodal LLM** arXiv
*Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2407.02392)]
[[Code](https://github.com/CircleRadon/TokenPacker)]

TokenPacker Framework


* **Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding** arXiv
*Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie*

arXiv'2024 [[Paper](https://arxiv.org/abs/2407.14439)]
[[Code](https://github.com/JiuTian-VL/TokenCorrCompressor)]

Token-level Framework


### 2024.8 ###

* **HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments** arXiv
*Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji*

arXiv'2024 [[Paper](https://arxiv.org/abs/2408.10945)]

HiRED Framework


* **MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model** arXiv
*Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2408.12321)]

MaVEn Framework


### 2024.9 ###

* **Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information** arXiv
*Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu*

arXiv'2024 [[Paper](https://arxiv.org/abs/2409.01179)]

Recoverable Compression Framework


* **TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings** arXiv
*Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen*

arXiv'2024 [[Paper](https://arxiv.org/abs/2409.09564)]

TG-LLaVA Framework


* **Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs** arXiv
*Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2409.10994)]
[[Code](https://github.com/FreedomIntelligence/TRIM/)]

TRIM Framework


### 2024.10 ###

* **AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity** arXiv
*Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su*

arXiv'2024 [[Paper](https://arxiv.org/abs/2410.02745)]
[[Code](https://github.com/DeepLearnXMU/AVG-LLaVA)]

AVG-LLaVA Framework


* **Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See** arXiv
*Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su*

arXiv'2024 [[Paper](https://arxiv.org/abs/2410.06169)]

YOPO Framework


* **Retrieval Replace Reduction: An effective visual token reduction method via semantic match** arXiv
*Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li*

arXiv'2024 [[Paper](https://arxiv.org/abs/2410.07278)]

TRSM Framework


* **Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers** arXiv
*Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi*

arXiv'2024 [[Paper](https://arxiv.org/abs/2410.14072)]

Victor Framework


* **PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction** arXiv
*Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin*

arXiv'2024 [[Paper](https://arxiv.org/abs/2410.17247)]
[[Code](https://github.com/Cooperx521/PyramidDrop)]

PyramidDrop Framework


### 2024.11 ###

* **Inference Optimal VLMs Need Only One Visual Token but Larger Models** arXiv
*Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.03312)]
[[Code](https://github.com/locuslab/llava-token-compression)]

QuCC Framework


* **Don't Look Twice: Faster Video Transformers with Run-Length Tokenization** arXiv
*Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni*

NIPS'2024 *(Spotlight)* [[Paper](https://arxiv.org/abs/2411.05222)]
[[Code](https://github.com/rccchoudhury/rlt)]

RLT Framework


* **Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model** arXiv
*Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.10803)]
[[Code](https://github.com/liuting20/MustDrop)]

MustDrop Framework


* **FoPru: Focal Pruning for Efficient Large Vision-Language Models** arXiv
*Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14164)]

FoPru Framework


* **FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression** arXiv
*Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14228)]

FocusLLaVA Framework


* **LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval** arXiv
*Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14505)]

LLaVA-MR Framework


* **DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models** arXiv
*Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.15024)]
[[Code](https://github.com/KD-TAO/DyCoke)]

DyCoke Framework


* **freePruner: A Training-free Approach for Large Multimodal Model Acceleration** arXiv
*Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.15446)]

freePruner Framework


* **Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration** arXiv
*Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2411.17686)]

FiCoCo Framework


### 2024.12 ###

* **ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models** arXiv
*Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00447)]
[[Code](https://yxxxb.github.io/ATP-LLaVA-page/)]

ATP-LLaVA Framework


* **Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction** arXiv
*Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00556)]

Framework


* **Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification** arXiv
*Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00876)]
[[Code](https://github.com/Osilly/dynamic_llava)]

Dynamic-LLaVA Framework


* **Negative Token Merging: Image-based Adversarial Feature Guidance** arXiv
*Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.01339)]
[[Code](https://github.com/1jsingh/negtome)]

NegToMe Framework


* **[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster** arXiv
*Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.01818)]
[[Code](https://github.com/Theia-4869/FasterVLM)]

FasterVLM Framework




* **AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning** arXiv
*Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.03248)]
[[Code](https://github.com/LaVi-Lab/AIM)]

AIM Framework


* **VisionZip: Longer is Better but Not Necessary in Vision Language Models** arXiv
*Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.04467)]
[[Code](https://github.com/dvlab-research/VisionZip)]

VisionZip Framework


* **[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs** arXiv
*Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.05819)]
[[Code](https://github.com/THU-MIG/VTC-CLS)]

VTC-CLS Framework


* **iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models** arXiv
*Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.06263)]
[[Code](https://github.com/hulianyuyy/iLLaVA)]

iLLaVA Framework


* **Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models** arXiv
*Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.06458)]

PAR Framework


* **DocVLM: Make Your VLM an Efficient Reader** arXiv
*Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.08746)]

DocVLM Framework


* **LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information** arXiv
*Ke Wang, Hong Xuan*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.08771)]

LLaVA-Zip Framework


* **Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM** arXiv
*Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.09530)]

Dynamic-VLM Framework


* **PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models** arXiv
*Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.09613)]
[[Code](https://github.com/OpenGVLab/PVC)]

PVC Framework


* **Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration** arXiv
*Mark Endo, Xiaohan Wang, Serena Yeung-Levy*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.13180)]
[[Code](https://web.stanford.edu/~markendo/projects/feather.html)]

FEATHER Framework

* **FastVLM: Efficient Vision Encoding for Vision Language Models** arXiv
*Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.13303)]

FastVLM Framework


* **PruneVid: Visual Token Pruning for Efficient Video Large Language Models** arXiv
*Xiaohu Huang, Hao Zhou, Kai Han*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.16117)]
[[Code](https://github.com/Visual-AI/PruneVid)]

PruneVid Framework


* **ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding** arXiv
*Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie*

arXiv'2024 [[Paper](https://arxiv.org/abs/2412.20504)]
[[Code](https://github.com/SCZwangxiao/video-ReTaKe)]

ReTaKe Framework


### 2025.1 ###

* **FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models** arXiv
*Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang*

arXiv'2024 [[Paper](https://arxiv.org/abs/2501.01986)]
[[Code](https://github.com/thu-nics/FrameFusion)]

FrameFusion Framework


* **What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph** arXiv
*Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou*

arXiv'2024 [[Paper](https://arxiv.org/abs/2501.02268)]
[[Code](https://github.com/jytmelon/G-Prune)]

G-Prune Framework


* **LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token** arXiv
*Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng*

arXiv'2024 [[Paper](https://arxiv.org/abs/2501.03895)]
[[Code](https://github.com/ictnlp/LLaVA-Mini)]

LLaVA-Mini Framework


* **Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration** arXiv
*Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen*

arXiv'2024 [[Paper](https://arxiv.org/abs/2501.05179)]
[[Code](https://github.com/xuyang-liu16/GlobalCom2)]

GlobalCom2 Framework


(back to top)