{"id":13460311,"url":"https://github.com/OpenGVLab/Ask-Anything","last_synced_at":"2025-03-24T19:32:09.706Z","repository":{"id":153681879,"uuid":"629922458","full_name":"OpenGVLab/Ask-Anything","owner":"OpenGVLab","description":"[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.","archived":false,"fork":false,"pushed_at":"2024-05-22T12:54:56.000Z","size":20535,"stargazers_count":2735,"open_issues_count":74,"forks_count":221,"subscribers_count":37,"default_branch":"main","last_synced_at":"2024-05-22T13:38:22.345Z","etag":null,"topics":["big-model","captioning-videos","chat","chatgpt","foundation-models","gradio","langchain","large-language-models","large-model","stablelm","video","video-question-answering","video-understanding"],"latest_commit_sha":null,"homepage":"https://vchat.opengvlab.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-19T09:49:10.000Z","updated_at":"2024-05-22T13:38:24.018Z","dependencies_parsed_at":"2024-05-20T03:38:00.240Z","dependency_job_id":null,"html_url":"https://github.com/OpenGVLab/Ask-Anything","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FAsk-Anything","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FAsk-Anything/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FAsk-Anything/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FAsk-Anything/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/Ask-Anything/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222004186,"owners_count":16914874,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-model","captioning-videos","chat","chatgpt","foundation-models","gradio","langchain","large-language-models","large-model","stablelm","video","video-question-answering","video-understanding"],"created_at":"2024-07-31T10:00:39.100Z","updated_at":"2025-03-24T19:32:09.695Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":["HarmonyOS","Python","Web apps","A01_文本生成_文本对话","精选开源项目合集","Applications","Join the Awesome Video Large Language Models Community 🎓🤝","ChatGPT-based applications for regular users and specialized problems","Video \u0026 Animation","Video and Long-Context Multimodality"],"sub_categories":["Windows Manager","Hosted and self-hosted","大语言对话模型及数据","GPT工具","提示语（魔法）","Action Recognition","Other sdk/libraries","Models and systems"],"readme":"\n\n# 🦜 VideoChat Family: Ask-Anything \n\n\n[![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/yinanhe/VideoChat2) | \n\u003ca src=\"https://img.shields.io/discord/1099920215724277770?label=Discord\u0026logo=discord\" href=\"https://discord.gg/A2Ex6Pph6A\"\u003e\n    \u003cimg src=\"https://img.shields.io/discord/1099920215724277770?label=Discord\u0026logo=discord\"\u003e\n\u003c/a\u003e | \n\u003ca src=\"https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv\u0026logoColor=red\" href=\"https://arxiv.org/abs/2305.06355\"\u003e \u003cimg src=\"https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv\u0026logoColor=red\"\u003e\n\u003c/a\u003e| \u003ca src=\"https://img.shields.io/badge/cs.CV-2311.17005-b31b1b?logo=arxiv\u0026logoColor=red\" href=\"https://arxiv.org/abs/2311.17005\"\u003e \u003cimg src=\"https://img.shields.io/badge/cs.CV-2311.17005-b31b1b?logo=arxiv\u0026logoColor=red\"\u003e\n\u003c/a\u003e| \n\u003ca src=\"https://img.shields.io/twitter/follow/opengvlab?style=social\" href=\"https://twitter.com/opengvlab\"\u003e\n    \u003cimg src=\"https://img.shields.io/twitter/follow/opengvlab?style=social\"\u003e \u003c/a\u003e\n\u003c/a\u003e\n\u003cbr\u003e\n\u003ca href=\"https://huggingface.co/spaces/OpenGVLab/VideoChatGPT\"\u003e\u003cimg src=\"https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg\" alt=\"Open in Spaces\"\u003e [VideoChat-7B-8Bit] End2End ChatBOT for video and image. \u003c/a\u003e \u003ca href=\"https://huggingface.co/spaces/OpenGVLab/InternVideo2-Chat-8B-HD\"\u003e\u003cimg src=\"https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg\" alt=\"Open in Spaces\"\u003e [InternVideo2-Chat-8B-HD]\u003c/a\u003e\n\n\n[中文 README 及 中文交流群](README_cn.md) | [Paper](https://arxiv.org/abs/2305.06355)\n\n\u003c!-- 🚀: We update `video_chat` by **instruction tuning for video \u0026 image chatting** now! Find its details [here](https://arxiv.org/pdf/2305.06355.pdf). We release **instruction data** at [InternVideo](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data). The old version of `video_chat` moved to `video_chat_with_chatGPT`.  --\u003e\n\n⭐️: We are also working on a updated version, stay tuned! \n    \n\n\n\n# :fire: Updates\n- **2025/01/18**: We release [videochat-flash](https://github.com/OpenGVLab/VideoChat-Flash) and [videochat-tpo](https://github.com/OpenGVLab/TPO) to extend MLLMs' capabilities on both long and accurate video understanding. [videochat-flash](https://github.com/OpenGVLab/VideoChat-Flash) sets new records in mutiple video benchmarks (for both short and long videos), improving code usability by leveaging [LLaVA](https://github.com/LLaVA-VL/LLaVA-NeXT) and others. [videochat-tpo](https://github.com/OpenGVLab/TPO) exploits classical vision task annotations (e.g. tracking) to optimize MLLMs in a DPO manner, enhancing MLLMs' performance and enabling capabilities in tracking, segmentation, and more.\n- **2024/06/25**: We release the [branch of videochat2 using `vllm`](https://github.com/OpenGVLab/Ask-Anything/tree/vllm), speed up the inference of videochat2.\n- **2024/06/19**: 🎉🎉 Our VideoChat2 achieves the best performances among the open-sourced VideoLLMs on [MLVU](https://github.com/JUNJIE99/MLVU), a multi-task long video understanding benchmark.\n- **2024/06/13**: Fix some bug and give testing scripts/\n    - :warning: We replace some repeated  (~30) QAs in MVBench, which may only affect the results by 0.5%.\n    - :loudspeaker: We give the scripts for testing [EgoSchema](https://github.com/egoschema/EgoSchema/tree/main) and [Video-MME](https://github.com/BradyFU/Video-MME/tree/main), please check the [demo_mistral.ipynb](./video_chat2/demo/demo_mistral.ipynb) and [demo_mistral_hd.ipynb](./video_chat2/demo/demo_mistral_hd.ipynb).\n- **2024/06/07**: :fire::fire::fire: We release **VideoChat2_HD**, which is fine-tuned with high-resolution data and is capable of handling more diverse tasks. It showcases better performance on different benchmarks, especially for detailed captioning. Furthermore, it achieves **54.8% on [Video-MME](https://github.com/BradyFU/Video-MME/tree/main)**, the best score among 7B MLLMs. Have a try! 🏃🏻‍♀️🏃🏻\n- **2024/06/06**: We release **VideoChat2_phi3**, a faster model with robust performaces. \n- **2024/05/22**: We release **VideoChat2_mistral**, which shows better capacity on diverse tasks (**60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA**). More details have been updated in the paper. \n- 2024/04/05 MVBench is selected as Poster (**Highlight**)!\n- 2024/2/27 [MVBench](./video_chat2) is accepted by CVPR2024.\n- 2023/11/29 VideoChat2 and MVBench are released.\n  - [VideoChat2](./video_chat2/) is a robust baseline built on [UMT](https://github.com/OpenGVLab/unmasked_teacher) and [Vicuna-v0](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md).\n  - **2M** diverse [instruction data](./video_chat2/DATA.md) are released for effective tuning.\n  - [MVBench](./video_chat2/MVBENCH.md) is a comprehensive benchmark for video understanding.\n\n- 2023/05/11 End-to-end VideoChat and its technical report.\n  - [VideoChat1](./video_chat/): Instruction tuning for video chatting (also supports image one).\n  - [Paper](https://arxiv.org/pdf/2305.06355.pdf): We present how we craft VideoChat with two versions (via text and embed) along with some discussions on its background, applications, and more.\n\n- 2023/04/25 Watch videos longer than one minute with chatGPT\n  - [VideoChat LongVideo](https://github.com/OpenGVLab/Ask-Anything/tree/long_video_support/): Incorporating langchain and whisper into VideoChat.\n\n- 2023/04/21 Chat with MOSS\n  - [VideoChat with MOSS](./video_chat_text/video_chat_with_MOSS/): Explicit communication with MOSS. \n\n- 2023/04/20: Chat with StableLM\n  - [VideoChat with StableLM](./video_chat_text/video_chat_with_StableLM/): Explicit communication with StableLM. \n\n- 2023/04/19: Code release \u0026 Online Demo\n  - [VideoChat with ChatGPT](./video_chat_with_ChatGPT): Explicit communication with ChatGPT. Sensitive with time. \n  - [MiniGPT-4 for video](./video_chat_text/video_miniGPT4/): Implicit communication with Vicuna. Not sensitive with time. (Simple extension of [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), which will be improved in the future.)\n\n\n\u003c!-- # :speech_balloon: Example\nhttps://user-images.githubusercontent.com/24236723/233631602-6a69d83c-83ef-41ed-a494-8e0d0ca7c1c8.mp4 --\u003e\n\n# 🔨 Getting Started\n\n### Build video chat with:\n* [End2End](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat#running-usage)\n* [ChatGPT](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat_text/video_chat_with_ChatGPT#running-usage)\n* [StableLM](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat_text/video_chat_with_StableLM#running-usage)\n* [MOSS](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat_text/video_chat_with_MOSS#running-usage)\n* [MiniGPT-4](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat_text/video_miniGPT4#running-usage)\n\n\n# :clapper: [\\[End2End ChatBot\\]](https://vchat.opengvlab.com)\n\n\nhttps://github.com/OpenGVLab/Ask-Anything/assets/24236723/a8667e87-49dd-4fc8-a620-3e408c058e26\n    \n\u003cvideo controls\u003e\n  \u003csource src=\"[https://user-images.githubusercontent.com/24236723/233630363-b20304ab-763b-40e5-b526-e2a6b9e9cae2.mp4](https://github.com/OpenGVLab/Ask-Anything/assets/24236723/a8667e87-49dd-4fc8-a620-3e408c058e26)\" type=\"video/mp4\"\u003e\nYour browser does not support the video tag.\n\u003c/video\u003e\n\n\n# :movie_camera: [\\[Communication with ChatGPT\\]](https://vchat.opengvlab.com)\n\nhttps://user-images.githubusercontent.com/24236723/233630363-b20304ab-763b-40e5-b526-e2a6b9e9cae2.mp4\n\n\u003cvideo controls\u003e\n  \u003csource src=\"https://user-images.githubusercontent.com/24236723/233630363-b20304ab-763b-40e5-b526-e2a6b9e9cae2.mp4\" type=\"video/mp4\"\u003e\nYour browser does not support the video tag.\n\u003c/video\u003e\n\n\n# :page_facing_up: Citation\n\nIf you find this project useful in your research, please consider cite:\n```BibTeX\n@article{2023videochat,\n  title={VideoChat: Chat-Centric Video Understanding},\n  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},\n  journal={arXiv preprint arXiv:2305.06355},\n  year={2023}\n}\n\n@inproceedings{li2024mvbench,\n  title={Mvbench: A comprehensive multi-modal video understanding benchmark},\n  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={22195--22206},\n  year={2024}\n}\n```\n\n# 🌤️ Discussion Group\n\nIf you have any questions during the trial, running or deployment, feel free to join our WeChat group discussion! If you have any ideas or suggestions for the project, you are also welcome to join our WeChat group discussion!\n\n\n![image](https://github.com/OpenGVLab/Ask-Anything/assets/43169235/9ac44555-7228-415c-be54-6be18df7d79a)\n\nWe are hiring researchers, engineers and interns in **General Vision Group, Shanghai AI Lab**.  If you are interested in working with us, please contact [Yi Wang](https://shepnerd.github.io/) (`wangyi@pjlab.org.cn`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FAsk-Anything","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FAsk-Anything","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FAsk-Anything/lists"}