{"id":24251449,"url":"https://github.com/OpenGVLab/PVC","last_synced_at":"2025-09-23T16:31:24.484Z","repository":{"id":267890758,"uuid":"902246422","full_name":"OpenGVLab/PVC","owner":"OpenGVLab","description":"PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models","archived":false,"fork":false,"pushed_at":"2024-12-13T03:27:34.000Z","size":2826,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-13T04:20:18.263Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2412.09613","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-12T07:39:24.000Z","updated_at":"2024-12-13T03:27:38.000Z","dependencies_parsed_at":"2024-12-13T04:30:53.158Z","dependency_job_id":null,"html_url":"https://github.com/OpenGVLab/PVC","commit_stats":null,"previous_names":["opengvlab/pvc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FPVC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FPVC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FPVC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FPVC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/PVC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233985944,"owners_count":18761563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-15T02:50:55.177Z","updated_at":"2025-09-23T16:31:24.479Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":["📖 Related Papers"],"sub_categories":["2024.12 ###"],"readme":"# Progressive Visual Token Compression (PVC)\n\n[![Static Badge](https://img.shields.io/badge/CVPR-2025-red)](https://cvpr.thecvf.com/virtual/2025/poster/34313)\n[![Static Badge](https://img.shields.io/badge/arXiv-2412.09613-green)](https://arxiv.org/abs/2412.09613)\n[![Static Badge](https://img.shields.io/badge/🤗\u0026nbsp;HuggingFace-checkpoint-blue)](https://huggingface.co/OpenGVLab/PVC-InternVL2-8B)\n\n**[CVPR 2025]** [**PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models**](https://arxiv.org/abs/2412.09613)\n\nWe introduce the **Progressive Visual Token Compression (PVC)** in large vision-language models (VLMs), which unifies the visual inputs as videos and progressively compresses vision tokens across video frames. Our PVC achieves:\n\n* Preserve spatial details and temporal dynamics for both images and videos.\n* Effectively reduce the tokens used for each video frame and image tile.\n* SoTA performance on various video benchmarks, including long and fine-grained short video tasks.\n* No performance loss on image benchmarks, especially on detail-sensitive tasks.\n\n\u003cdiv style=\"text-align: center;\"\u003e\n    \u003cimg src=\"./assets/overview.png\" width=\"70%\"/\u003e\n\u003c/div\u003e\n\n## 📈 Results\n\nOur implementation is based on the [InternVL2](https://github.com/OpenGVLab/InternVL) model, referred to as **PVC\u003csub\u003eInternVL2\u003c/sub\u003e**\n\n### Video Understanding Benckmarks\n\n| Model | LLaVA-OneVision-7B | Qwen2-VL-7B | InternVL2-8B | PVC\u003csub\u003eInternVL2\u003c/sub\u003e-8B \u003cbr\u003e 🤗 [link](https://huggingface.co/OpenGVLab/PVC-InternVL2-8B) |\n| :--------------: | :--: | :--: | :--: | :--: |\n| \\# token/frame   | 196  | -    | 256  | 64   |\n|                  |      |      |      |      |\n| MVbench          | 56.7 | 67.0 | 66.4 | 73.8 |\n| VideoMME w/o-sub | 58.2 | 63.3 | 54.0 | 64.1 |\n| VideoMME w-sub   | 61.5 | 69.0 | 56.9 | 69.7 |\n| MLVU             | 64.7 | -    | 52.0 | 72.4 |\n| LongVideoBench   | 56.5 | -    | -    | 59.2 |\n| NextQA           | 79.4 | -    | -    | 82.0 |\n| Egoschema        | 60.1 | 66.7 | 55.0 | 59.6 |\n| PercepTest       | 57.1 | 62.3 | 52.0 | 68.4 |\n| AcNet-QA         | 56.6 | -    | -    | 57.1 |\n\n### Image Understanding Benckmarks\n\n| Model | LLaVA-OneVision-7B | Qwen2-VL-7B | InternVL2-8B | PVC\u003csub\u003eInternVL2\u003c/sub\u003e-8B \u003cbr\u003e 🤗 [link](https://huggingface.co/OpenGVLab/PVC-InternVL2-8B) |\n| :--------------------: | :--: | :--: | :--: | :--: |\n| \\# token/image tile    | 729  | -    | 256  | 64   |\n|                        |      |      |      |      |\n| AI2D\u003csub\u003etest\u003c/sub\u003e    | 81.4 | 83.0 | 83.8 | 83.8 |\n| ChartQA\u003csub\u003etest\u003c/sub\u003e | 80.0 | 83.0 | 83.3 | 84.1 |\n| DocVQA\u003csub\u003etest\u003c/sub\u003e  | 87.5 | 94.5 | 91.6 | 92.5 |\n| InfoVQA\u003csub\u003etest\u003c/sub\u003e | 68.8 | 76.5 | 74.8 | 75.0 |\n| SQA\u003csub\u003etest\u003c/sub\u003e     | 96.0 | -    | 97.1 | 97.7 |\n| TextVQA\u003csub\u003eval\u003c/sub\u003e  | -    | 84.3 | 77.4 | 80.0 |\n| MMB\u003csub\u003een-test\u003c/sub\u003e  | -    | 83.0 | 81.7 | 83.9 |\n| MME\u003csub\u003esum\u003c/sub\u003e      | 1998 | 2327 | 2210 | 2282 |\n| MMMU\u003csub\u003eval\u003c/sub\u003e     | 48.8 | 54.1 | 49.3 | 50.9 |\n| SEED\u003csub\u003eI\u003c/sub\u003e       | 75.4 | -    | 76.2 | 77.2 |\n| OCRBench               | -    | 866  | 794  | 807  |\n\n## 🛠️ Usage\n\nYou can use `pip install -r requirements.txt` to set up the environment. Please use `transformers\u003e=4.37.2` to ensure the model works normally.\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModel\nfrom utils.preprocess import load_image, load_video\n\npath = 'OpenGVLab/PVC-InternVL2-8B'\nmodel = AutoModel.from_pretrained(\n    path,\n    torch_dtype=torch.bfloat16,\n    low_cpu_mem_usage=True,\n    trust_remote_code=True).eval().cuda()\ntokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)\ngeneration_config = dict(max_new_tokens=1024, do_sample=True)\n\n# single-image conversation\npixel_values = load_image('./assets/example_image1.jpg', max_num=12).to(torch.bfloat16).cuda()\ndata_flag = torch.tensor([1], dtype=torch.long).cuda()\n\nquestion = '\u003cimage\u003e\\nWhat is in the image?'\nresponse = model.chat(tokenizer, pixel_values, question, generation_config, data_flag=data_flag)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# multi-image conversation\npixel_values1 = load_image('./assets/example_image1.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values2 = load_image('./assets/example_image2.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)\ndata_flag = torch.tensor([2], dtype=torch.long).cuda()\nnum_patches_list = [pixel_values1.shape[0], pixel_values2.shape[0]]\n\nquestion = 'Image-1: \u003cimage\u003e\\nImage-2: \u003cimage\u003e\\nWhat are the similarities and differences between these two images.'\nresponse = model.chat(tokenizer, pixel_values, question, generation_config, data_flag=data_flag, num_patches_list=num_patches_list)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# video conversation\npixel_values, num_patches_list = load_video('./assets/example_video.mp4', num_segments=64, max_num=1)\npixel_values = pixel_values.to(torch.bfloat16).cuda()\nvideo_prefix = ''.join([f'Frame{i+1}: \u003cimage\u003e\\n' for i in range(len(num_patches_list))])\n# Frame1: \u003cimage\u003e\\nFrame2: \u003cimage\u003e\\n...\\nFrameN: \u003cimage\u003e\\n{question}\ndata_flag = torch.tensor([3], dtype=torch.long).cuda()\n\nquestion = video_prefix + 'Describe this video in detail.'\nresponse = model.chat(tokenizer, pixel_values, question, generation_config, data_flag=data_flag, num_patches_list=num_patches_list)\nprint(f'User: {question}\\nAssistant: {response}')\n```\n\n## 📊 Evaluation\n\n### Image Benchmarks \u0026 MVBench\n\n**Prepare data:** please follow [here](https://internvl.readthedocs.io/en/latest/get_started/eval_data_preparation.html) to prepare the data for evaluation.\n\n**Run evaluation:** use the following command to start the evaluation:\n\n```bash\nbash evaluate_launch.sh \u003ccheckpoint\u003e \u003ctask\u003e\n```\n\nCurrently supported tasks: `vqa-ai2d-test`, `vqa-chartqa-test`, `vqa-docvqa-val`, `vqa-docvqa-test`, `vqa-infovqa-val`, `vqa-infovqa-test`, `scienceqa`, `mme`, `mmbench-dev-en`, `mmbench-test-en`, `mmmu-val`, `seed`, `mvbench`.\n\nFor image benchmarks and MVBench, we use the evaluation codebase of InternVL2. Refer to [here](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html#) for more details.\n\n## 🖊️ Citation\n\nIf you find this work helpful in your research, please consider citing:\n\n```bibtex\n@inproceedings{yang2025pvc,\n  title={PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models},\n  author={Yang, Chenyu and Dong, Xuan and Zhu, Xizhou and Su, Weijie and Wang, Jiahao and Tian, Hao and Chen, Zhe and Wang, Wenhai and Lu, Lewei and Dai, Jifeng},\n  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},\n  pages={24939--24949},\n  year={2025}\n}\n```\n\n## 📃 License\n\nThis project is released under the [MIT license](LICENSE). Parts of this project contain code and models from other sources, which are subject to their respective licenses.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FPVC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FPVC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FPVC/lists"}