{"id":29219673,"url":"https://github.com/dvlab-research/visionzip","last_synced_at":"2025-07-03T02:06:36.855Z","repository":{"id":266778498,"uuid":"897099828","full_name":"dvlab-research/VisionZip","owner":"dvlab-research","description":"Official repository for VisionZip (CVPR 2025)","archived":false,"fork":false,"pushed_at":"2025-05-26T16:39:20.000Z","size":19079,"stargazers_count":284,"open_issues_count":16,"forks_count":12,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-03T03:37:18.941Z","etag":null,"topics":["efficiency","multi-modality","vision-language-model","vlms"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dvlab-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-02T02:51:48.000Z","updated_at":"2025-06-03T02:25:11.000Z","dependencies_parsed_at":"2024-12-28T18:00:48.867Z","dependency_job_id":null,"html_url":"https://github.com/dvlab-research/VisionZip","commit_stats":null,"previous_names":["dvlab-research/visionzip"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/dvlab-research/VisionZip","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionZip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionZip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionZip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionZip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dvlab-research","download_url":"https://codeload.github.com/dvlab-research/VisionZip/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionZip/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263245318,"owners_count":23436514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["efficiency","multi-modality","vision-language-model","vlms"],"created_at":"2025-07-03T02:06:35.212Z","updated_at":"2025-07-03T02:06:36.836Z","avatar_url":"https://github.com/dvlab-research.png","language":"Python","readme":"\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/visionzip.png\" alt=\"Stanford-Alpaca\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n# VisionZip: Longer is Better but Not Necessary in Vision Language Models\n\n\n[![Paper](https://img.shields.io/badge/Paper-Arxiv%20Link-light)](https://arxiv.org/abs/2412.04467)\n[![HF](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Discussion-orange)](https://huggingface.co/papers/2412.04467)\n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](https://github.com/dvlab-research/VisionZip/blob/main/LICENSE)\n[![Demo](https://img.shields.io/badge/Demo-Chat-red.svg)](http://202.104.135.156:7860/)\n[![Demo](https://img.shields.io/badge/Demo-Visualize%20-green)](http://202.104.135.156:11030/)\n\u003ca href='https://huggingface.co/spaces/Senqiao/VisionZip'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'\u003e\u003c/a\u003e\n\n## TABLE OF CONTENTS\n1. [News](#news)\n2. [Highlights](#highlights)\n3. [Video](#video)\n4. [Demo](#demo)\n5. [Installation](#installation)\n6. [Quick Start](#quick-start)\n7. [Evaluation](#evaluation)\n8. [Examples](#examples)\n9. [Citation](#citation)\n10. [Acknowledgement](#acknowledgement)\n11. [License](#license)\n      \n## News\n- [x] [2025.05.26] VisionZip for Qwen2.5VL is now released! See details [here](https://github.com/dvlab-research/VisionZip/tree/main/Qwen2_5_VL).\n- [x] [2025.02.27] VisionZip has been accepted by **CVPR 2025**. :rocket:\n- [x] [2024.12.28] With support from Hugging Face, we add our demo on the [Hugging Face Space](https://huggingface.co/spaces/Senqiao/VisionZip), allowing for easy comparison of output results across different model sizes.\n- [x] [2024.12.16] Due to positive feedback on our demo, we have released the VisionZip [Demo-Chat](http://202.104.135.156:7860/) code in a new branch, 'demo-chat'.\n- [x] [2024.12.05] We add an [Usage-Video](https://youtu.be/9GNIJy4U6-k?si=jcWIJ2O0IjB4aamm), providing a step-by-step guide on how to use the demo.\n- [x] [2024.12.05] We add a new [Demo-Chat](http://202.104.135.156:7860/), where users can manually select visual tokens to send to the LLM and observe how different visual tokens affect the final response. We believe this will further enhance the analysis of VLM interpretability.\n- [x] [2024.11.30] We release [Paper](https://arxiv.org/abs/2412.04467) and this GitHub repo, including code for LLaVA.\n\n**VisionZip: Longer is Better but Not Necessary in Vision Language Models [[Paper](https://arxiv.org/abs/2412.04467)]** \u003cbr /\u003e\n[Senqiao Yang](https://scholar.google.com/citations?user=NcJc-RwAAAAJ),\n[Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ),\n[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ),\n[Chengyao Wang](https://scholar.google.com.hk/citations?user=1pZcoqgAAAAJ),\n[Jingyao Li](https://scholar.google.com/citations?user=mqrKmvcAAAAJ),\n[Bei Yu](https://scholar.google.com/citations?user=tGneTm4AAAAJ),\n[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ)\u003cbr /\u003e\n\n## Highlights\n\u003cp align=\"center\" width=\"80%\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/Teaser.png\" alt=\"Stanford-Alpaca\" style=\"width: 80%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n1. Our VisionZip achieves state-of-the-art performance among efficient VLM methods. By retaining only **10%** of visual tokens, it achieves nearly **95%** of the performance in training-free mode.\n2. VisionZip can be applied during the inference stage (without incurring any additional training cost), the efficient tuning stage (to achieve better results), and the training stage (**almost no performance degradation，saving 2× memory and 2× training time**).\n3. VisionZip significantly reduces the prefilling time and the total inference time (with KV cache enabled).\n4. Why does this simple, text-agnostic method outperform text-relevant methods? We conduct an in-depth analysis in the [paper](https://arxiv.org/abs/2412.04467) and provide a [demo](http://202.104.135.156:7860/) to visualize these findings.\n5. Since VisionZip is a text-agnostic method that reduces visual tokens before input into the LLM, it can adapt to **any** existing LLM acceleration algorithms and is applicable to any task that a vanilla VLM can perform, such as multi-turn conversations.\n\n## Video\n\u003cp align=\"center\" width=\"80%\"\u003e\n  \u003ca href=\"https://youtu.be/sytaAzmxxpo?si=IieArmQ7YNf2dVyM\" target=\"_blank\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/VisionZip-youtube-video.png\" alt=\"Stanford-Alpaca\" style=\"width: 80%; min-width: 300px; display: block; margin: auto;\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n## Demo\n### Speed Improvement\nThe input [video](https://www.youtube.com/watch?v=I7c1etV7D7g\n) is about the Titanic, and the question is, \"What’s the video talking about?\"\n\n\n\u003cp align=\"center\" width=\"80%\"\u003e\n  \u003ca href=\"https://www.youtube.com/watch?v=I7c1etV7D7g\" target=\"_blank\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/titanic.png\" alt=\"Stanford-Alpaca\" style=\"width: 80%; min-width: 300px; display: block; margin: auto;\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\nIt is important to note that the left side shows the vanilla model, which encodes only 16 frames, while the right side shows our VisionZip, which, despite encoding **32 frames**, is still **twice** as fast as the vanilla model.\n\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/speed.gif\" alt=\"Stanford-Alpaca\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n\n### Visualize Redundancy and Misalignment\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/gradio.png\" alt=\"Stanford-Alpaca\" style=\"width: 80%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\nExplore the visual redundancy and feature misalignment in the above [Demo](http://202.104.135.156:7860/). To run it locally, use the following command:\n```\npython gradio_demo.py \n```\n\n### Observe How Different Visual Tokens Impact the Final Response\nThis [Demo-Chat](http://202.104.135.156:7860/) lets users to manually select which visual tokens to send to the LLM and observe how different visual tokens affect the final response.\n\n\n## Installation\nOur code is easy to use.\n\n1. Install the [LLaVA](https://github.com/haotian-liu/LLaVA) environment.\n\n2. For formal usage, you can install the package from PyPI by running the following command:\n```\npip install visionzip\n```\n\nFor development, you can install the package by cloning the repository and running the following command:\n```\ngit clone https://github.com/dvlab-research/VisionZip\ncd VisionZip\npip install -e .\n```\n\n## Quick Start\n```Python\nfrom llava.model.builder import load_pretrained_model\nfrom llava.mm_utils import get_model_name_from_path\nfrom llava.eval.run_llava import eval_model\nfrom visionzip import visionzip\nmodel_path = \"liuhaotian/llava-v1.5-7b\"\n\ntokenizer, model, image_processor, context_len = load_pretrained_model(\n    model_path=model_path,\n    model_base=None,\n    model_name=get_model_name_from_path(model_path)\n)\n## VisoinZip retains 54 dominant tokens and 10 contextual tokens\nmodel = visionzip(model, dominant=54, contextual=10)\n```\n\n\n\n## Evaluation\nThe evaluation code follows the structure of [LLaVA](https://github.com/haotian-liu/LLaVA) or [Lmms-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). After loading the model, simply add two lines as shown below:\n\n```python\n## Load LLaVA Model (code from llava.eval.model_vqa_loader)\ntokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)\n## add VisionZip\nfrom visionzip import visionzip\nmodel = visionzip(model, dominant=54, contextual=10)\n```\n\n## Examples\n### Multi-turn Conversations\nVisionZip, which extracts text-agnostic tokens, is better suited for multi-turn dialogue.\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/conversation.png\" width=\"80%\"\u003e \u003c/p\u003e\n\n### Longer Videos with More Frames\nVisionZip reduces the number of visual tokens per frame, allowing more frames to be processed. This improves the model's ability to understand longer videos.\n\u003cp align=\"center\"\u003e \u003cimg src=\"https://raw.githubusercontent.com/dvlab-research/VisionZip/main/imgs/longer-video.png\" width=\"80%\"\u003e \u003c/p\u003e\n\n## Citation\nIf you find this project useful in your research, please consider citing:\n\n```\n@article{yang2024visionzip,\n  title={VisionZip: Longer is Better but Not Necessary in Vision Language Models},\n  author={Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya},\n  journal={arXiv preprint arXiv:2412.04467},\n  year={2024}\n}\n```\n\n\n## Acknowledgement\n- This work is built upon [LLaVA](https://llava-vl.github.io/), [mini-Gemini](https://github.com/dvlab-research/MGM), [Lmms-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), and [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA). We thank them for their excellent open-source contributions.\n\n- We also thank [StreamingLLM](https://github.com/mit-han-lab/streaming-llm), [FastV](https://github.com/pkunlp-icler/FastV), [SparseVLM](https://github.com/Gumpest/SparseVLMs), and others for their contributions, which have provided valuable insights.\n\n## License\n- VisionZip is licensed under the Apache License 2.0. \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2Fvisionzip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdvlab-research%2Fvisionzip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2Fvisionzip/lists"}