awesome-multimodal-token-compression

Survey: https://arxiv.org/pdf/2507.20198
https://github.com/cokeshao/awesome-multimodal-token-compression

Last synced: 5 days ago
JSON representation

📚 Contents
- Recent Papers (Last 6 Months)
 - ![Arxiv - -LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2504.17040) [GitHub](https://github.com/MikeWangWZHL/dymu) |
 - ![Arxiv - inc/Baichuan-Audio.svg?style=social&label=Star)](https://github.com/baichuan-inc/Baichuan-Audio) [Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239) Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2502.17239) [GitHub](https://github.com/baichuan-inc/Baichuan-Audio) [Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct) |
 - ![Publish - wise Compression of Visual Tokens for Multimodal Large Language Models](https://arxiv.org/abs/2507.02279) Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2507.02279) |
 - ![Publish - Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2506.20066) |
 - ![Publish - based Visual Token Pruning for Large Multimodal Models](https://arxiv.org/abs/2503.02175) Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2503.02175) [GitHub](https://github.com/vbdi/divprune) |
 - ![Publish - Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2506.20066) |
 - ![Publish - MLSys-Lab/MEDA.svg?style=social&label=Star)](https://github.com/https://github.com/AIoT-MLSys-Lab/MEDA) [MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599) Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2502.17599) [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA) |
 - ![Publish - 24/token-pruning-audio-transformer.svg?style=social&label=Star)](https://github.com/andylee-24/token-pruning-audio-transformer) [Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance](https://arxiv.org/abs/2504.01690) Taehan Lee, Hyukjun Lee | [![Area](https://img.shields.io/badge/Audio--Transformer-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2504.01690) [GitHub](https://github.com/andylee-24/token-pruning-audio-transformer) [Model](https://drive.google.com/drive/folders/1cBDXh98m2qDlYLLX3q6xB-gtU1uUtxhK) |
 - ![Arxiv - inc/Baichuan-Audio.svg?style=social&label=Star)](https://github.com/https://github.com/baichuan-inc/Baichuan-Audio) [Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction](https://arxiv.org/abs/2502.17239) Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2502.17239) [GitHub](https://github.com/baichuan-inc/Baichuan-Audio) [Model](https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct) |
 - ![Publish - LLaMA.svg?style=social&label=Star)](https://github.com/JeongHun0716/MMS-LLaMA) [MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315) Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2503.11315) [GitHub](https://github.com/JeongHun0716/MMS-LLaMA) |
 - ![Arxiv - IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models](https://arxiv.org/abs/2508.11886) Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2508.11886) |
 - ![Arxiv - -LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2506.00993) [GitHub](https://github.com/yunzhuzhang0918/flexselect) |
 - ![Arxiv - -LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2506.00993) [GitHub](https://github.com/yunzhuzhang0918/flexselect) |
 - ![Arxiv - Collection-Token-Reduction.svg?style=social&label=Star)](https://github.com/ZLKong/Awesome-Collection-Token-Reduction) [Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227) Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Position--Paper-purple)]() | | [Paper](https://arxiv.org/abs/2505.18227) [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction) |
 - ![Arxiv - -LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2504.17040) [GitHub](https://github.com/MikeWangWZHL/dymu) |
 - ![Publish - Based Token Reduction for Faster Visual Language Models](https://arxiv.org/abs/2504.08966) Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2504.08966) [GitHub](https://github.com/orailix/PACT) |
 - ![Arxiv - Multimodal-Token-Compression.svg?style=social&label=Star)](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression) [When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios](https://arxiv.org/abs/2507.20198) Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() | | [Paper](https://arxiv.org/abs/2507.20198) [GitHub](https://github.com/cokeshao/Awesome-Multimodal-Token-Compression) |
 - ![Arxiv - Collection-Token-Reduction.svg?style=social&label=Star)](https://github.com/ZLKong/Awesome-Collection-Token-Reduction) [Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality](https://arxiv.org/abs/2505.18227) Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() [![Area](https://img.shields.io/badge/Position--Paper-purple)]() | | [Paper](https://arxiv.org/abs/2505.18227) [GitHub](https://github.com/ZLKong/Awesome-Collection-Token-Reduction) |
 - ![Arxiv - Visual Speech Recognition via Matryoshka-Based Multimodal LLMs](https://arxiv.org/abs/2503.06362) Umberto Cappellazzo, Minsu Kim, Stavros Petridis | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2503.06362) |
 - ![Arxiv - Visual Speech Recognition via Matryoshka-Based Multimodal LLMs](https://arxiv.org/abs/2503.06362) Umberto Cappellazzo, Minsu Kim, Stavros Petridis | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2503.06362) |
- Published in Recent Conference/Journal
 - ![Publish - PruMerge.svg?style=social&label=Star)](https://github.com/42Shawn/LLaVA-PruMerge) [LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models](https://arxiv.org/abs/2403.15388) Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2403.15388) [GitHub](https://github.com/42Shawn/LLaVA-PruMerge) |
 - ![Publish - MLSys-Lab/MEDA.svg?style=social&label=Star)](https://github.com/AIoT-MLSys-Lab/MEDA) [MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference](https://arxiv.org/abs/2502.17599) Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2502.17599) [GitHub](https://github.com/AIoT-MLSys-Lab/MEDA) |
 - ![Publish - AI/PruneVid.svg?style=social&label=Star)](https://github.com/Visual-AI/PruneVid) [PruneVid: Visual Token Pruning for Efficient Video Large Language Models](https://arxiv.org/abs/2412.16117) Xiaohu Huang, Hao Zhou, Kai Han | [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2412.16117) [GitHub](https://github.com/Visual-AI/PruneVid) |
 - ![Publish - LLaMA.svg?style=social&label=Star)](https://github.com/JeongHun0716/MMS-LLaMA) [MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens](https://arxiv.org/abs/2503.11315) Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro | [![Area](https://img.shields.io/badge/Audio--LLM-purple)]() | [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2503.11315) [GitHub](https://github.com/JeongHun0716/MMS-LLaMA) |
 - ![Publish - AI/PruneVid.svg?style=social&label=Star)](https://github.com/https://github.com/Visual-AI/PruneVid) [PruneVid: Visual Token Pruning for Efficient Video Large Language Models](https://arxiv.org/abs/2412.16117) Xiaohu Huang, Hao Zhou, Kai Han | [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2412.16117) [GitHub](https://github.com/Visual-AI/PruneVid) |
 - ![Publish - Compression-Survey.svg?style=social&label=Star)](https://github.com/ZongqianLi/Prompt-Compression-Survey) [Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier | [![Area](https://img.shields.io/badge/LLM-purple)]() [![Area](https://img.shields.io/badge/Survey-purple)]() | | [Paper](https://arxiv.org/abs/2410.12388) [GitHub](https://github.com/ZongqianLi/Prompt-Compression-Survey) |
 - ![Publish - Language Model Inference](https://arxiv.org/abs/2410.04417) Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2410.04417) [GitHub](https://github.com/Gumpest/SparseVLMs) |
 - ![Publish - CAIR/LongVU.svg?style=social&label=Star)](https://github.com/Vision-CAIR/LongVU) [LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/abs/2410.17434) Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra | [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2410.17434) [GitHub](https://github.com/Vision-CAIR/LongVU) [Model](https://huggingface.co/collections/Vision-CAIR/longvu-67181d2debabfc1eb050c21d) |
 - ![Publish - Online.svg?style=social&label=Star)](https://github.com/yaolinli/TimeChat-Online) [TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos](https://arxiv.org/abs/2504.17343) Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun | [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Similarity--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]() | [Paper](https://arxiv.org/abs/2504.17343) [GitHub](https://github.com/yaolinli/TimeChat-Online) [Model](https://huggingface.co/wyccccc/TimeChatOnline-7B) [Dataset](https://huggingface.co/datasets/yaolily/TimeChat-Online-139K) |
 - ![Publish - Language Models with Dynamic Token Sparsification](https://arxiv.org/abs/2410.08584) Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang | [![Area](https://img.shields.io/badge/Image--LLM-purple)]() [![Area](https://img.shields.io/badge/Video--LLM-purple)]() | [![Type](https://img.shields.io/badge/Attention--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Free-yellow)]() | [Paper](https://arxiv.org/abs/2410.08584) |
- Badge Colors
🤝 Contributing
- Badge Colors
Audio LLM
- 2025
  - ![Arxiv
- 2024
  - ![Arxiv
  - ![Arxiv - scale Generalizable Audio Language Model](https://arxiv.org/abs/2405.08295)
  - ![Arxiv
  - ![Arxiv - Omni: Seamless Speech Interaction with Large Language Models](https://arxiv.org/abs/2409.06666)
  - ![Arxiv - NLP-SG/VideoLLaMA2.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
  - ![Publish - aware Token Pruning for Speech Information Retrieval](https://arxiv.org/abs/2412.12009)
  - ![Publish - SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)
  - ![Arxiv - NLP-SG/VideoLLaMA2.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
  - ![Arxiv - Audio.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2-Audio) [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)
  - ![Publish - aware Token Pruning for Speech Information Retrieval](https://arxiv.org/abs/2412.12009)
  - ![Publish - SALMONN: Speech-Enhanced Audio-Visual Large Language Models](https://arxiv.org/abs/2406.15704)
  - ![Publish
  - ![Arxiv
  - ![Arxiv - Audio.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen2-Audio) [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759)
  - ![Publish - based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model](https://arxiv.org/abs/2406.03706)
- 2023
  - ![Arxiv - LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)
  - ![Publish
  - ![Arxiv
  - ![Publish
  - ![Arxiv
  - ![Publish
  - ![Area - purple)]() [![Area](https://img.shields.io/badge/Video_LLM-purple)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
Image LLM
- 2025
  - ![Arxiv - liu16/GlobalCom2.svg?style=social&label=Star)](https://github.com/xuyang-liu16/GlobalCom2) [Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models](https://arxiv.org/abs/2501.05179)
- 2024
  - ![Arxiv
  - ![Publish - modal LLMs](https://arxiv.org/abs/2409.10994)
  - ![Publish
  - ![Publish - modal LLMs](https://arxiv.org/abs/2409.10994)
  - ![Publish
  - ![Publish
  - ![Arxiv
  - ![Publish
  - ![Publish - cai/matryoshka-mm.svg?style=social&label=Star)](https://github.com/mu-cai/matryoshka-mm) [Matryoshka Multimodal Models](https://arxiv.org/abs/2405.17430)
  - ![Publish - icler/FastV.svg?style=social&label=Star)](https://github.com/pkunlp-icler/FastV) [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
  - ![Publish - icler/FastV.svg?style=social&label=Star)](https://github.com/pkunlp-icler/FastV) [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
  - ![Publish - wise Compression for Efficient MLLMs](https://arxiv.org/abs/2402.11187)
  - ![Arxiv - Free Token Reduction for MLLM Acceleration](https://arxiv.org/abs/2411.17686v3)
  - ![Arxiv - Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model](https://arxiv.org/abs/2411.10803)
  - ![Arxiv - LM.svg?style=social&label=Star)](https://github.com/NVIDIA/Megatron-LM) [NVLM: Open Frontier-Class Multimodal LLMs](https://arxiv.org/abs/2409.11402)
  - ![Publish - free Visual Token Pruning for Multi-modal Large Language Models](https://arxiv.org/abs/2409.10197)
  - ![Publish - Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models](https://arxiv.org/abs/2408.10945)
  - ![Arxiv - 4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites](https://arxiv.org/abs/2404.16821)
- 2023
  - ![Arxiv - AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM) [MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)
  - ![Arxiv - AutoML/MobileVLM.svg?style=social&label=Star)](https://github.com/Meituan-AutoML/MobileVLM) [MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/abs/2312.16886)
  - ![Arxiv - VL.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen-VL) [Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)
  - ![Arxiv - purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
  - ![Arxiv - VL.svg?style=social&label=Star)](https://github.com/QwenLM/Qwen-VL) [Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond](https://arxiv.org/abs/2308.12966)
  - ![Arxiv - LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding](https://arxiv.org/abs/2306.02858)
  - ![Arxiv
  - ![Arxiv - PLUG/mPLUG-Owl.svg?style=social&label=Star)](https://github.com/X-PLUG/mPLUG-Owl) [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)
  - ![Arxiv - PLUG/mPLUG-Owl.svg?style=social&label=Star)](https://github.com/X-PLUG/mPLUG-Owl) [mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality](https://arxiv.org/abs/2304.14178)
  - ![Publish - 2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)
  - ![Publish - CAIR/MiniGPT-4.svg?style=social&label=Star)](https://github.com/Vision-CAIR/MiniGPT-4) [Minigpt-4: Enhancing vision-language understanding with advanced large language models.](https://arxiv.org/abs/2304.10592)
  - ![Publish - 2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)
  - ![Publish - ov-file.svg?style=social&label=Star)](https://github.com/khanrc/honeybee?tab=readme-ov-file) [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://arxiv.org/abs/2312.06742)
- 2022
  - ![Publish - Shot Learning](https://arxiv.org/abs/2204.14198)
  - ![Publish - Shot Learning](https://arxiv.org/abs/2204.14198)
Video LLM
- 2025
  - ![Publish
  - ![Area - -Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
  - ![Area - -Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
- 2024
  - ![Publish
  - ![Arxiv - VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT) [Video Instruction Tuning with Synthetic Data](http://arxiv.org/abs/2410.02713)
  - ![Arxiv - research/PLLaVA.svg?style=social&label=Star)](https://github.com/magic-research/PLLaVA) [PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994)
  - ![Publish
  - ![Arxiv - VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT) [Video Instruction Tuning with Synthetic Data](http://arxiv.org/abs/2410.02713)
  - ![Arxiv - VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT) [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
  - ![Arxiv - VL/LLaVA-NeXT.svg?style=social&label=Star)](https://github.com/LLaVA-VL/LLaVA-NeXT) [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
  - ![Area - purple)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
  - ![Area - purple)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
- 2023
  - ![Publish
  - ![Publish - YuanGroup/Chat-UniVi.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Chat-UniVi) [Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)
  - ![Publish - research/LLaMA-VID.svg?style=social&label=Star)](https://github.com/dvlab-research/LLaMA-VID) [LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)
  - ![Publish - YuanGroup/Chat-UniVi.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Chat-UniVi) [Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding](https://arxiv.org/abs/2311.08046)
  - ![Publish - oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT) [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https://arxiv.org/abs/2306.05424v2)
  - ![Publish - oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT) [Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models](https://arxiv.org/abs/2306.05424v2)
  - ![Area - -Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
  - ![Publish - research/LLaMA-VID.svg?style=social&label=Star)](https://github.com/dvlab-research/LLaMA-VID) [LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models](https://arxiv.org/abs/2311.17043)
  - ![Area - -Based-green)]() [![Type](https://img.shields.io/badge/Transformation--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
  - ![Area - purple)]() [![Area](https://img.shields.io/badge/Video_LLM-purple)]() [![Type](https://img.shields.io/badge/Query--Based-green)]() [![Cost](https://img.shields.io/badge/Training--Based-yellow)]()
🙏 Acknowledgments
- Published in Recent Conference/Journal
  - Awesome-Efficient-Reasoning-Models - Efficient-LLM](https://github.com/horseee/Awesome-Efficient-LLM/), [Awesome-Context-Engineering](https://github.com/Meirtz/Awesome-Context-Engineering)
  - Awesome-Efficient-Reasoning-Models - Efficient-LLM](https://github.com/horseee/Awesome-Efficient-LLM/), [Awesome-Context-Engineering](https://github.com/Meirtz/Awesome-Context-Engineering)

Programming Languages

Python 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-multimodal-token-compression

📚 Contents

Recent Papers (Last 6 Months)

Published in Recent Conference/Journal

Badge Colors

🤝 Contributing

Badge Colors

Audio LLM

2025

2024

2023

Image LLM

2025

2024

2023

2022

Video LLM

2025

2024

2023

🙏 Acknowledgments

Published in Recent Conference/Journal