{"id":22836525,"url":"https://github.com/mbzuai-oryx/VideoGPT-plus","last_synced_at":"2025-08-10T21:32:23.219Z","repository":{"id":244284213,"uuid":"814778688","full_name":"mbzuai-oryx/VideoGPT-plus","owner":"mbzuai-oryx","description":"Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding","archived":false,"fork":false,"pushed_at":"2024-08-11T16:24:43.000Z","size":17271,"stargazers_count":226,"open_issues_count":16,"forks_count":15,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-12-03T17:04:33.936Z","etag":null,"topics":["chatbot","clip","dual-encoder","gpt4","gpt4o","image-encoder","llama3","llava","multimodal","phi-3-mini","vicuna","video-chatbot","video-conversation","video-encoder","vision-language","vision-language-pretraining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbzuai-oryx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-13T17:27:16.000Z","updated_at":"2024-12-02T17:48:51.000Z","dependencies_parsed_at":"2024-07-19T04:38:44.838Z","dependency_job_id":"d77c15ed-83d8-44be-8ba6-001bc7fc1a1a","html_url":"https://github.com/mbzuai-oryx/VideoGPT-plus","commit_stats":{"total_commits":6,"total_committers":2,"mean_commits":3.0,"dds":"0.16666666666666663","last_synced_commit":"0422b91fda488dab7831f5c0043ad5be051be5fd"},"previous_names":["mbzuai-oryx/videogpt-plus"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideoGPT-plus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideoGPT-plus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideoGPT-plus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideoGPT-plus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbzuai-oryx","download_url":"https://codeload.github.com/mbzuai-oryx/VideoGPT-plus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229464314,"owners_count":18077035,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","clip","dual-encoder","gpt4","gpt4o","image-encoder","llama3","llava","multimodal","phi-3-mini","vicuna","video-chatbot","video-conversation","video-encoder","vision-language","vision-language-pretraining"],"created_at":"2024-12-12T23:02:12.107Z","updated_at":"2025-08-10T21:32:23.181Z","avatar_url":"https://github.com/mbzuai-oryx.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# VideoGPT+ :movie_camera: :speech_balloon:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/videogpt_plus_face.jpeg\" alt=\"videogpt_plus_face\" width=\"200\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://i.imgur.com/waxVImv.png\" alt=\"Oryx Video-ChatGPT\"\u003e\n\u003c/p\u003e\n\n### VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding\n\n#### [Muhammad Maaz](https://www.mmaaz60.com) , [Hanoona Rasheed](https://www.hanoonarasheed.com/) , [Salman Khan](https://salman-h-khan.github.io/) and [Fahad Khan](https://sites.google.com/view/fahadkhans/home)\n\n#### **Mohamed bin Zayed University of Artificial Intelligence**\n\n---\n\n[![paper](https://img.shields.io/badge/arXiv-Paper-blue.svg)](https://arxiv.org/abs/2406.09418)\n[![video](https://img.shields.io/badge/Project-HuggingFace-F9D371)](https://huggingface.co/collections/MBZUAI/videogpt-665c8643221dda4987a67d8d)\n[![Dataset](https://img.shields.io/badge/VCGBench-Diverse-green)](https://huggingface.co/datasets/MBZUAI/VCGBench-Diverse)\n[![Demo](https://img.shields.io/badge/Annotation-Pipeline-red)](https://huggingface.co/datasets/MBZUAI/video_annotation_pipeline)\n\n---\n**Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)**\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videogpt-integrating-image-and-video-encoders/vcgbench-diverse-on-videoinstruct)](https://paperswithcode.com/sota/vcgbench-diverse-on-videoinstruct?p=videogpt-integrating-image-and-video-encoders)\n\n**Video Question Answering on MVBench**\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videogpt-integrating-image-and-video-encoders/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=videogpt-integrating-image-and-video-encoders)\n\n\n**Video-based Generative Performance Benchmarking**\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videogpt-integrating-image-and-video-encoders/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=videogpt-integrating-image-and-video-encoders)\n\n---\n\n## :loudspeaker: Latest Updates\n- **Mar-28-25**: *Mobile-VideoGPT* is released. It achieves excellent results on multiple benchmarks with 2x higher throughput. Check it out [Mobile-VideoGPT](https://github.com/Amshaker/Mobile-VideoGPT) :fire::fire:\n\n- **Jun-13-24**: VideoGPT+ paper, code, model, dataset and benchmark is released. :fire::fire:\n---\n\n## VideoGPT+ Overview :bulb:\n\nVideoGPT+ integrates image and video encoders to leverage detailed spatial understanding and global temporal context, respectively. It processes videos in segments using adaptive pooling on features from both encoders, enhancing performance across various video benchmarks.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/block_diagram.png\" alt=\"VideoGPT+ Architectural Overview\"\u003e\n\u003c/p\u003e\n\n---\n\n## Contributions :trophy:\n\n- **VideoGPT+ Model**: We present VideoGPT+, the first video-conversation model that benefits from a dual-encoding scheme based on both image and video features. These complimentary sets of features offer rich spatiotemporal details for improved video understanding.\n- **VCG+ 112K Dataset**: Addressing the limitations of the existing VideoInstruct100K dataset, we develop VCG+ 112K with a novel semi-automatic annotation pipeline, offering dense video captions along with spatial understanding and reasoning-based QA pairs, further improving the model performance.\n- **VCGBench-Diverse Benchmark**: Recognizing the lack of diverse benchmarks for video-conversation tasks, we propose VCGBench-Diverse, which provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/intro_radar_plot.png\" alt=\"Contributions\" width=\"650\"\u003e\n\u003c/p\u003e\n\n---\n\n## Video Annotation Pipeline (VCG+ 112K) :open_file_folder:\nVideo-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \\ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/vcg120k_block_diagram.png\" alt=\"Contributions\"\u003e\n\u003c/p\u003e\n\n---\n## VCGBench-Diverse :mag:\nRecognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/vcgbench_block_diag.png\" alt=\"Contributions\"\u003e\n\u003c/p\u003e\n\n---\n\n## Installation :wrench:\n\nWe recommend setting up a conda environment for the project:\n```shell\nconda create --name=videogpt_plus python=3.11\nconda activate videogpt_plus\n\ngit clone https://github.com/mbzuai-oryx/VideoGPT-plus\ncd VideoGPT-plus\n\npip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118\npip install transformers==4.41.0\n\npip install -r requirements.txt\n\nexport PYTHONPATH=\"./:$PYTHONPATH\"\n```\nAdditionally, install [FlashAttention](https://github.com/HazyResearch/flash-attention) for training,\n```shell\npip install ninja\n\ngit clone https://github.com/HazyResearch/flash-attention.git\ncd flash-attention\npython setup.py install\n```\n---\n\n## Quantitative Evaluation 📊\nWe provide instructions to reproduce VideoGPT+ results on VCGBench, VCGBench-Diverse and MVBench. Please follow the instructions at [eval/README.md](eval/README.md).\n\n### VCGBench Evaluation: Video-based Generative Performance Benchmarking :chart_with_upwards_trend:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/VCGBench_quantitative.png\" alt=\"VCGBench_quantitative\" width=\"1000\"\u003e\n\u003c/p\u003e\n\n---\n### VCGBench-Diverse Evaluation :bar_chart:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/VCGDiverse_quantitative.png\" alt=\"VCGDiverse_quantitative\"\u003e\n\u003c/p\u003e\n\n---\n### Zero-Shot Question-Answer Evaluation :question:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/zero_shot_quantitative.png\" alt=\"zero_shot_quantitative\"\u003e\n\u003c/p\u003e\n\n---\n\n### MVBench Evaluation :movie_camera:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/MVBench_quantitative.png\" alt=\"MVBench_quantitative\"\u003e\n\u003c/p\u003e\n\n---\n\n## Training :train:\nWe provide scripts for pretraining and finetuning of VideoGPT+. Please follow the instructions at [scripts/README.md](scripts/README.md).\n\n---\n\n## Qualitative Analysis :mag:\nA comprehensive evaluation of VideoGPT+ performance across multiple tasks and domains.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/demo_vcg+_main.png\" alt=\"demo_vcg+_main\" width=\"700\"\u003e\n\u003c/p\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/demo_vcg+_full_part1.jpg\" alt=\"demo_vcg+_full_part1\" width=\"700\"\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/demo_vcg+_full_part2.jpg\" alt=\"demo_vcg+_full_part2\" width=\"700\"\u003e\n\u003c/p\u003e\n\n---\n\n## Acknowledgements :pray:\n\n+ [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): A pioneering attempt in Video-based conversation models.\n+ [LLaVA](https://github.com/haotian-liu/LLaVA): Our code base is build upon LLaVA and Video-ChatGPT.\n+ [Chat-UniVi](https://github.com/PKU-YuanGroup/Chat-UniVi): A recent work in image and video-based conversation models. We borrowed some implementation details from their public codebase.\n\n## Citations 📜:\n\nIf you're using VideoGPT+ in your research or applications, please cite using this BibTeX:\n```bibtex\n@article{Maaz2024VideoGPT+,\n    title={VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding},\n    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},\n    journal={arxiv},\n    year={2024},\n    url={https://arxiv.org/abs/2406.09418}\n}\n\n@inproceedings{Maaz2023VideoChatGPT,\n    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},\n    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},\n    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},\n    year={2024}\n}\n```\n\n## License :scroll:\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png\" /\u003e\u003c/a\u003e\u003cbr /\u003eThis work is licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003eCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003c/a\u003e.\n\n\nLooking forward to your feedback, contributions, and stars! :star2:\nPlease raise any issues or questions [here](https://github.com/mbzuai-oryx/VideoGPT-plus/issues). \n\n\n---\n[\u003cimg src=\"docs/images/IVAL_logo.png\" width=\"200\" height=\"100\"\u003e](https://www.ival-mbzuai.com)\n[\u003cimg src=\"docs/images/Oryx_logo.png\" width=\"100\" height=\"100\"\u003e](https://github.com/mbzuai-oryx)\n[\u003cimg src=\"docs/images/MBZUAI_logo.png\" width=\"360\" height=\"85\"\u003e](https://mbzuai.ac.ae)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbzuai-oryx%2FVideoGPT-plus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbzuai-oryx%2FVideoGPT-plus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbzuai-oryx%2FVideoGPT-plus/lists"}