https://github.com/jinxins/awesome-token-merge-for-mllms
A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
https://github.com/jinxins/awesome-token-merge-for-mllms
List: awesome-token-merge-for-mllms
large-language-models llama llava multimodal-large-language-models nlp token-drop token-merging vicuna vision-transformer
Last synced: 4 months ago
JSON representation
A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.
- Host: GitHub
- URL: https://github.com/jinxins/awesome-token-merge-for-mllms
- Owner: JinXins
- License: mit
- Created: 2024-11-23T14:35:53.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-01-13T18:21:03.000Z (4 months ago)
- Last Synced: 2025-01-13T19:21:34.183Z (4 months ago)
- Topics: large-language-models, llama, llava, multimodal-large-language-models, nlp, token-drop, token-merging, vicuna, vision-transformer
- Homepage:
- Size: 103 KB
- Stars: 16
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-token-merge-for-mllms - A paper list about Token Merge, Reduce, Resample, Drop for MLLMs. (Programming Language Lists / Python Lists)
README
# 💫 Awesome-Token-Merge-for-MLLMs
[](https://awesome.re)  
## Welcome to Awesome-Token-Merge-for-MLLMs.
**If you know some related papers which don't conclute in this list, please tell me in `Issues` !)**If this repository has been helpful to you, please consider giving it a ⭐️ to show your support. Your support helps us reach more researchers and contributes to the growth of this resource. Thank you!
![]()
## 📜 Introduction
**We summarize awesome token merge / reduce / resample methods in vision model for multi-modal large language models.**
The list of token merge, reduce, drop, resample methods is summarized in chronological order and is on **updating**.
## 📖 Related Papers
### Baseline ###
* **Visual Instruction Tuning**
![]()
*Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee*
NIPS'2023 *(oral)* [[Paper](https://arxiv.org/abs/2304.08485)]
[[Code](https://github.com/haotian-liu/LLaVA)]
LLaVA Framework
* **Honeybee: Locality-enhanced Projector for Multimodal LLM**
![]()
*Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh*
CVPR'2024 [[Paper](https://arxiv.org/abs/2312.06742)]
[[Code](https://github.com/khanrc/honeybee)]
Honeybee Framework
* **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
![]()
*Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi*
arXiv'2023 [[Paper](https://arxiv.org/abs/2301.12597)]
[[Code](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)]
BLIP-2 Framework
### 2024.3 ###
* **MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer**
![]()
*Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen*
CVPR'2024 [[Paper](https://arxiv.org/abs/2403.02991)]
[[Code](https://github.com/double125/MADTP)]
MADTP Framework
* **Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models**
![]()
*Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji*
arXiv'2024 [[Paper](https://arxiv.org/abs/2403.03003)]
LLaVA-HR Framework
* **TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**
![]()
*Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai*
arXiv'2024 [[Paper](https://arxiv.org/abs/2403.04473)]
[[Code](https://github.com/Yuliang-Liu/Monkey)]
TextMonkey Framework
* **An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models**
![]()
*Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang*
ECCV'2024 *(oral)* [[Paper](https://arxiv.org/abs/2403.06764)]
[[Code](https://github.com/pkunlp-icler/FastV?tab=readme-ov-file)]
FastV Framework
* **Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring**
![]()
*Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2403.09333)]
[[Code](https://github.com/jefferyZhan/Griffon)]
Griffon V2 Framework
* **LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models**
![]()
*Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan*
arXiv'2024 [[Paper](https://arxiv.org/abs/2403.15388)]
[[Code](https://github.com/42Shawn/LLaVA-PruMerge)]
LLaVA-PruMerge Framework
### 2024.5 ###
* **DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models**
![]()
*Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou*
arXiv'2024 [[Paper](https://arxiv.org/abs/2405.20985)]
[[Code](https://github.com/yaolinli/DeCo)]
DeCo Framework
### 2024.6 ###
* **Efficient Large Multi-modal Models via Visual Context Compression**
![]()
*Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille*
NIPS'2024 [[Paper](https://arxiv.org/abs/2406.20092)]
[[Code](https://github.com/Beckschen/LLaVolta)]
LLaVolta Framework
* **VoCo-LLaMA: Towards Vision Compression with Large Language Models**
![]()
*Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2406.12275)]
[[Code](https://github.com/Yxxxb/VoCo-LLaMA)]
VoCo-LLaMA Framework
### 2024.7 ###
* **TokenPacker: Efficient Visual Projector for Multimodal LLM**
![]()
*Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2407.02392)]
[[Code](https://github.com/CircleRadon/TokenPacker)]
TokenPacker Framework
* **Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding**
![]()
*Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie*
arXiv'2024 [[Paper](https://arxiv.org/abs/2407.14439)]
[[Code](https://github.com/JiuTian-VL/TokenCorrCompressor)]
Token-level Framework
### 2024.8 ###
* **HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments**
![]()
*Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji*
arXiv'2024 [[Paper](https://arxiv.org/abs/2408.10945)]
HiRED Framework
* **MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model**
![]()
*Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2408.12321)]
MaVEn Framework
### 2024.9 ###
* **Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information**
![]()
*Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu*
arXiv'2024 [[Paper](https://arxiv.org/abs/2409.01179)]
Recoverable Compression Framework
* **TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings**
![]()
*Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen*
arXiv'2024 [[Paper](https://arxiv.org/abs/2409.09564)]
TG-LLaVA Framework
* **Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs**
![]()
*Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2409.10994)]
[[Code](https://github.com/FreedomIntelligence/TRIM/)]
TRIM Framework
### 2024.10 ###
* **AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity**
![]()
*Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su*
arXiv'2024 [[Paper](https://arxiv.org/abs/2410.02745)]
[[Code](https://github.com/DeepLearnXMU/AVG-LLaVA)]
AVG-LLaVA Framework
* **Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See**
![]()
*Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su*
arXiv'2024 [[Paper](https://arxiv.org/abs/2410.06169)]
YOPO Framework
* **Retrieval Replace Reduction: An effective visual token reduction method via semantic match**
![]()
*Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li*
arXiv'2024 [[Paper](https://arxiv.org/abs/2410.07278)]
TRSM Framework
* **Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers**
![]()
*Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi*
arXiv'2024 [[Paper](https://arxiv.org/abs/2410.14072)]
Victor Framework
* **PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction**
![]()
*Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin*
arXiv'2024 [[Paper](https://arxiv.org/abs/2410.17247)]
[[Code](https://github.com/Cooperx521/PyramidDrop)]
PyramidDrop Framework
### 2024.11 ###
* **Inference Optimal VLMs Need Only One Visual Token but Larger Models**
![]()
*Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.03312)]
[[Code](https://github.com/locuslab/llava-token-compression)]
QuCC Framework
* **Don't Look Twice: Faster Video Transformers with Run-Length Tokenization**
![]()
*Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, László Jeni*
NIPS'2024 *(Spotlight)* [[Paper](https://arxiv.org/abs/2411.05222)]
[[Code](https://github.com/rccchoudhury/rlt)]
RLT Framework
* **Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model**
![]()
*Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.10803)]
[[Code](https://github.com/liuting20/MustDrop)]
MustDrop Framework
* **FoPru: Focal Pruning for Efficient Large Vision-Language Models**
![]()
*Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao Cheng, Xiaohua Xu*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14164)]
FoPru Framework
* **FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression**
![]()
*Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14228)]
FocusLLaVA Framework
* **LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval**
![]()
*Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, Min Xia*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.14505)]
LLaVA-MR Framework
* **DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models**
![]()
*Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.15024)]
[[Code](https://github.com/KD-TAO/DyCoke)]
DyCoke Framework
* **freePruner: A Training-free Approach for Large Multimodal Model Acceleration**
![]()
*Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.15446)]
freePruner Framework
* **Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration**
![]()
*Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2411.17686)]
FiCoCo Framework
### 2024.12 ###
* **ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models**
![]()
*Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00447)]
[[Code](https://yxxxb.github.io/ATP-LLaVA-page/)]
ATP-LLaVA Framework
* **Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction**
![]()
*Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00556)]
Framework
* **Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification**
![]()
*Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.00876)]
[[Code](https://github.com/Osilly/dynamic_llava)]
Dynamic-LLaVA Framework
* **Negative Token Merging: Image-based Adversarial Feature Guidance**
![]()
*Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.01339)]
[[Code](https://github.com/1jsingh/negtome)]
NegToMe Framework
* **[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster**
![]()
*Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.01818)]
[[Code](https://github.com/Theia-4869/FasterVLM)]
FasterVLM Framework
* **AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning**![]()
*Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.03248)]
[[Code](https://github.com/LaVi-Lab/AIM)]
AIM Framework
* **VisionZip: Longer is Better but Not Necessary in Vision Language Models**
![]()
*Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.04467)]
[[Code](https://github.com/dvlab-research/VisionZip)]
VisionZip Framework
* **[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs**
![]()
*Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.05819)]
[[Code](https://github.com/THU-MIG/VTC-CLS)]
VTC-CLS Framework
* **iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models**
![]()
*Lianyu Hu, Fanhua Shang, Liang Wan, Wei Feng*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.06263)]
[[Code](https://github.com/hulianyuyy/iLLaVA)]
iLLaVA Framework
* **Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models**
![]()
*Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.06458)]
PAR Framework
* **DocVLM: Make Your VLM an Efficient Reader**
![]()
*Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.08746)]
DocVLM Framework
* **LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information**
![]()
*Ke Wang, Hong Xuan*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.08771)]
LLaVA-Zip Framework
* **Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM**
![]()
*Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.09530)]
Dynamic-VLM Framework
* **PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models**
![]()
*Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.09613)]
[[Code](https://github.com/OpenGVLab/PVC)]
PVC Framework
* **Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration**
![]()
*Mark Endo, Xiaohan Wang, Serena Yeung-Levy*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.13180)]
[[Code](https://web.stanford.edu/~markendo/projects/feather.html)]
FEATHER Framework
* **FastVLM: Efficient Vision Encoding for Vision Language Models**
![]()
*Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.13303)]
FastVLM Framework
* **PruneVid: Visual Token Pruning for Efficient Video Large Language Models**
![]()
*Xiaohu Huang, Hao Zhou, Kai Han*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.16117)]
[[Code](https://github.com/Visual-AI/PruneVid)]
PruneVid Framework
* **ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding**
![]()
*Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie*
arXiv'2024 [[Paper](https://arxiv.org/abs/2412.20504)]
[[Code](https://github.com/SCZwangxiao/video-ReTaKe)]
ReTaKe Framework
### 2025.1 ###
* **FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models**
![]()
*Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang*
arXiv'2024 [[Paper](https://arxiv.org/abs/2501.01986)]
[[Code](https://github.com/thu-nics/FrameFusion)]
FrameFusion Framework
* **What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph**
![]()
*Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou*
arXiv'2024 [[Paper](https://arxiv.org/abs/2501.02268)]
[[Code](https://github.com/jytmelon/G-Prune)]
G-Prune Framework
* **LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token**
![]()
*Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng*
arXiv'2024 [[Paper](https://arxiv.org/abs/2501.03895)]
[[Code](https://github.com/ictnlp/LLaVA-Mini)]
LLaVA-Mini Framework
* **Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration**
![]()
*Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen*
arXiv'2024 [[Paper](https://arxiv.org/abs/2501.05179)]
[[Code](https://github.com/xuyang-liu16/GlobalCom2)]
GlobalCom2 Framework