awesome-multimodal-token-compression
Survey: https://arxiv.org/pdf/2507.20198
https://github.com/cokeshao/awesome-multimodal-token-compression
Last synced: 5 days ago
JSON representation
-
π Contents
-
Recent Papers (Last 6 Months)
- ![Arxiv - -LLM-purple)]() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2504.17040)<br> [GitHub](https://github.com/MikeWangWZHL/dymu)<br> |
- <br>[Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239)<br>Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2502.17239)<br> [GitHub](https://github.com/baichuan-inc/Baichuan-Audio)<br> [Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct)<br> |
- <br>Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng | []() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2507.02279)<br> |
- ]() | []()<br> []() | [Paper](https://arxiv.org/abs/2506.20066)<br> |
- <br>Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang | []() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2503.02175)<br> [GitHub](https://github.com/vbdi/divprune)<br> |
- ]() | []()<br> []() | [Paper](https://arxiv.org/abs/2506.20066)<br> |
- <br>[MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599)<br>Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang | []() []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2502.17599)<br> [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA)<br> |
- <br>[Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690)<br>Taehan Lee, Hyukjun Lee | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2504.01690)<br> [GitHub](https://github.com/andylee-24/token-pruning-audio-transformer)<br> [Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK)<br> |
- <br>[Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239)<br>Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2502.17239)<br> [GitHub](https://github.com/baichuan-inc/Baichuan-Audio)<br> [Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct)<br> |
- <br>[MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315)<br>Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2503.11315)<br> [GitHub](https://github.com/JeongHun0716/MMS-LLaMA)<br> |
- <br>Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang | []() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2508.11886)<br> |
- ![Arxiv - -LLM-purple)]() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2506.00993)<br> [GitHub](https://github.com/yunzhuzhang0918/flexselect)<br> |
- ![Arxiv - -LLM-purple)]() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2506.00993)<br> [GitHub](https://github.com/yunzhuzhang0918/flexselect)<br> |
- <br>[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)<br>Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | []() []() []() []() | | [Paper](https://arxiv.org/abs/2505.18227)<br> [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br> |
- ![Arxiv - -LLM-purple)]() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2504.17040)<br> [GitHub](https://github.com/MikeWangWZHL/dymu)<br> |
- <br>Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou | []() []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2504.08966)<br> [GitHub](https://github.com/orailix/PACT)<br> |
- <br>[When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198)<br>Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | []() []() []() []() | | [Paper](https://arxiv.org/abs/2507.20198)<br> [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression)<br> |
- <br>[Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227)<br>Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | []() []() []() []() | | [Paper](https://arxiv.org/abs/2505.18227)<br> [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction)<br> |
- <br>Umberto Cappellazzo, Minsu Kim, Stavros Petridis | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2503.06362)<br> |
- <br>Umberto Cappellazzo, Minsu Kim, Stavros Petridis | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2503.06362)<br> |
-
Published in Recent Conference/Journal
- <br>[LLaVA-PruMerge:Β Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388)<br>Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan | []() []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2403.15388)<br> [GitHub](https://github.com/42Shawn/LLaVA-PruMerge)<br> |
- <br>[MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599)<br>Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang | []() []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2502.17599)<br> [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA)<br> |
- <br>[PruneVid: Visual Token Pruning for Efficient Video Large Language Models](https://arxiv.org/abs/2412.16117)<br>Xiaohu Huang, Hao Zhou, Kai Han | []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2412.16117)<br> [GitHub](https://github.com/Visual-AI/PruneVid)<br> |
- <br>[MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315)<br>Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2503.11315)<br> [GitHub](https://github.com/JeongHun0716/MMS-LLaMA)<br> |
- <br>[PruneVid: Visual Token Pruning for Efficient Video Large Language Models](https://arxiv.org/abs/2412.16117)<br>Xiaohu Huang, Hao Zhou, Kai Han | []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2412.16117)<br> [GitHub](https://github.com/Visual-AI/PruneVid)<br> |
- <br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388)<br>Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier | []() []() | | [Paper](https://arxiv.org/abs/2410.12388)<br> [GitHub](https://github.com/ZongqianLi/Prompt-Compression-Survey)<br> |
- <br>Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang | []() []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2410.04417)<br> [GitHub](https://github.com/Gumpest/SparseVLMs)<br> |
- <br>[LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/abs/2410.17434)<br>Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra | []() | []() []()<br> []() | [Paper](https://arxiv.org/abs/2410.17434)<br> [GitHub](https://github.com/Vision-CAIR/LongVU)<br> [Model](https://huggingface.co/collections/Vision-CAIR/longvu-67181d2debabfc1eb050c21d)<br> |
- <br>[TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos](https://arxiv.org/abs/2504.17343)<br>Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun | []() | []()<br> []() | [Paper](https://arxiv.org/abs/2504.17343)<br> [GitHub](https://github.com/yaolinli/TimeChat-Online)<br> [Model](https://huggingface.co/wyccccc/TimeChatOnline-7B)<br> [Dataset](https://huggingface.co/datasets/yaolily/TimeChat-Online-139K)<br> |
- <br>Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang | []() []() | []()<br> []() | [Paper](https://arxiv.org/abs/2410.08584)<br> |
-
Badge Colors
-
-
π€ Contributing
-
Badge Colors
-
-
Audio LLM
-
2025
-
2024
- 
- 
-  [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
- 
- 
-  [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
-  [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)
- 
- 
-  [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)
- 
-
2023
- 
- ![Publish
- ![Arxiv
- ![Publish
- ![Arxiv
- ![Publish
- ![Area - purple)]() []() []() []()
-
-
Image LLM
-
2025
-  [Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models](https://arxiv.org/abs/2501.05179)
-
2024
- 
- 
-  [Matryoshka Multimodal Models](https://arxiv.org/abs/2405.17430)
-  [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
-  [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
- 
- 
- 
-  [NVLM: Open Frontier-Class Multimodal LLMs](https://arxiv.org/abs/2409.11402)
- 
- 
- 
-
2023
-  [MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)
-  [MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)
-  [Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)
- 
-  [Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)
- 
-  [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)
-  [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)
- 
-  [Minigpt-4: Enhancing vision-language understanding with advanced large language models.](https://arxiv.org/abs/2304.10592)
- 
-  [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742)
-
2022
-
-
Video LLM
-
2025
-
2024
-  [Video Instruction Tuning with Synthetic Data](http://arxiv.org/abs/2410.02713)
-  [PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994)
-  [Video Instruction Tuning with Synthetic Data](http://arxiv.org/abs/2410.02713)
-  [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
-  [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
- ![Area - purple)]() []() []()
- ![Area - purple)]() []() []()
-
2023
-  [Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)
-  [LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)
-  [Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)
-  [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https://arxiv.org/abs/2306.05424v2)
-  [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https://arxiv.org/abs/2306.05424v2)
- ![Area - -Based-green)]() []()
-  [LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)
- ![Area - -Based-green)]() []() []()
- ![Area - purple)]() []() []() []()
-
-
π Acknowledgments
-
Published in Recent Conference/Journal
- Awesome-Efficient-Reasoning-Models - Efficient-LLM](https://github.com/horseee/Awesome-Efficient-LLM/), [Awesome-Context-Engineering](https://github.com/Meirtz/Awesome-Context-Engineering)
- Awesome-Efficient-Reasoning-Models - Efficient-LLM](https://github.com/horseee/Awesome-Efficient-LLM/), [Awesome-Context-Engineering](https://github.com/Meirtz/Awesome-Context-Engineering)
-
Programming Languages
Categories
Sub Categories