{"id":22068894,"url":"https://github.com/mbzuai-oryx/Video-LLaVA","last_synced_at":"2025-07-24T07:31:32.054Z","repository":{"id":208295033,"uuid":"721210096","full_name":"mbzuai-oryx/Video-LLaVA","owner":"mbzuai-oryx","description":"PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models","archived":false,"fork":false,"pushed_at":"2024-01-02T17:51:01.000Z","size":19746,"stargazers_count":245,"open_issues_count":14,"forks_count":11,"subscribers_count":14,"default_branch":"main","last_synced_at":"2024-11-21T18:50:23.676Z","etag":null,"topics":["grounding","llm","lmm","transcription","video","video-conversation","video-grounding"],"latest_commit_sha":null,"homepage":"https://mbzuai-oryx.github.io/Video-LLaVA","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbzuai-oryx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-11-20T15:25:17.000Z","updated_at":"2024-11-15T02:43:24.000Z","dependencies_parsed_at":"2024-01-02T18:57:21.262Z","dependency_job_id":null,"html_url":"https://github.com/mbzuai-oryx/Video-LLaVA","commit_stats":null,"previous_names":["mbzuai-oryx/video-llava"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideo-LLaVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideo-LLaVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideo-LLaVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbzuai-oryx%2FVideo-LLaVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbzuai-oryx","download_url":"https://codeload.github.com/mbzuai-oryx/Video-LLaVA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227421362,"owners_count":17775011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["grounding","llm","lmm","transcription","video","video-conversation","video-grounding"],"created_at":"2024-11-30T20:04:28.090Z","updated_at":"2024-11-30T20:07:17.103Z","avatar_url":"https://github.com/mbzuai-oryx.png","language":"Python","funding_links":[],"categories":["Paper List"],"sub_categories":["Follow-up Papers"],"readme":"# \u003cimg src=\"docs/images/logos/logo.png\" height=\"40\"\u003e  PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models\n![](https://i.imgur.com/waxVImv.png)\n\n[Shehan Munasinghe](https://shehanmunasinghe.github.io/)* , [Rusiru Thushara](https://thusharakart.github.io/)* , [Muhammad Maaz](https://www.muhammadmaaz.com/) , [Hanoona Rasheed](https://www.hanoonarasheed.com/), [Salman Khan](https://salman-h-khan.github.io/), [Mubarak Shah](https://www.crcv.ucf.edu/person/mubarak-shah/),  [Fahad Shahbaz Khan](https://scholar.google.es/citations?user=zvaeYnUAAAAJ\u0026hl=en). \n\n*Equal Contribution\n\n**Mohamed bin Zayed University of Artificial Intelligence, UAE**\n\n[![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://mbzuai-oryx.github.io/Video-LLaVA/)\n[![paper](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2311.13435)\n\n---\n\n## 📢 Latest Updates\n- 📦 27-Dec-2023: Code, models released! 🚀\n---\n\n## \u003cimg src=\"docs/images/logos/logo.png\" height=\"25\"\u003e  Overview\n\nPG-Video-LLaVA is the first video-based Large Multimodal Model (LMM) with pixel-level grounding capabilities. 🔥🔥🔥\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/teaser.png\" width=\"70%\" alt=\"PG-Video-LLaVA Architectural Overview\"\u003e\n\u003c/p\u003e\n\n---\n## 🏆 Contributions\n\nThe key contributions of this work are:\n\n- We propose PG-Video-LLaVA, **the first video-based LMM with pixel-level grounding capabilities**, featuring a modular design for enhanced flexibility. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially ground objects in videos following user instructions.\n\n- We introduce a **new benchmark specifically designed to measure prompt-based object grounding performance**.\n\n- By incorporating audio context, PG-Video-LLaVA significantly **enhances its understanding of video content**, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).\n\n- We introduce **improved quantitative benchmarks** for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models.\n\n---\n\n## \u003cimg src=\"docs/images/logos/logo.png\" height=\"25\"\u003e PG-Video-LLaVA : Architecture\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/1-architecture.png\" alt=\"PG-Video-LLaVA Architectural Overview\"\u003e\n\u003c/p\u003e\n\n---\n\n## Installation and CLI Demo\n\nFor installation and setting up the CLI demo, please refer to the instructions [here](/docs/1-CLI_DEMO.md).\n\n---\n\n## Training\n\nFor training, please refer to the instructions [here](/docs/2-Training.md).\n\n---\n\n## Qualitative Analysis 🔍\n\n### Video Grounding 🎯\n\nOur framework uses an off-the-shelf tracker and a novel grounding module, enabling it to localize objects in videos following user instructions.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/grounding-qual.png\" alt=\"Video-Grounding Qualitative Results\"\u003e\n\u003c/p\u003e\n\n---\n\n### Including Audio Modality 🎧\n\nBy incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/audio-qual.png\" alt=\"Qualitative Results: Audio modality\"\u003e\n\u003c/p\u003e\n\n---\n\n### Video-ChatGPT vs PG-Video-LLaVA\u003cimg src=\"docs/images/logos/logo.png\" height=\"20\"\u003e\n\nPG-Video-LLaVA is based on a stronger image-LMM baseline which gives it better conversational ability compared to its predecessor. \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/comparison-prev_versions.png\" alt=\"Video-ChatGPT vs PG-Video-LLaVA\"\u003e\n\u003c/p\u003e\n\n\n---\n\n##  Quantitative Evaluation 📊\n\nWe evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks. We also introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos.\n\n### Video Grounding 🎯\n\nTo quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PG-Video-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG and HC-STVG datasets.\n\nFor detailed instructions on performing quantitative evaluation on video grounding, please refer [this](/grounding_evaluation/README.md).\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/quant_grounding.png\" width=\"60%\" alt=\"Video-Grounding Quantitative Results\"\u003e\n\u003c/p\u003e\n\n---\n\n### Video-based Generative Performance Benchmarking 🤖\n\nWe apply the benchmarking framework from Video-ChatGPT which measures performance on several axes critical for video-based conversational agents, including correctness of information, detail orientation, contextual understanding, temporal understanding, and consistency. In order to facilitate a reliable and reproducible evaluation, we have updated our assessment pipeline by replacing GPT-3.5 with Vicuna-13b-v1.5.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/quant_our_benchmark.png\" alt=\"Video-based Generative Performance Benchmarking\"\u003e\n\u003c/p\u003e\n\n---\n\n### Zero-Shot Question Answering 💬\n\nZero-shot question-answering (QA) capabilities were evaluated quantitatively using several established open-ended QA datasets: MSRVTT-QA, MSVD-QA, TGIF-QA, and ActivityNet-QA.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/figures/quant_zero_shot.png\" alt=\"Zero-shot QA Quantitative Results\"\u003e\n\u003c/p\u003e\n\nFor detailed instructions on video-based generative performance benchmarking and zero-shot question answering benchmark, please refer [this](/quantitative_evaluation/README.md).\n\n---\n\n## Acknowledgements 🙏\n\n+ [LLaMA](https://github.com/facebookresearch/llama): a great attempt towards open and efficient LLMs!\n+ [Vicuna](https://github.com/lm-sys/FastChat): has the amazing language capabilities!\n+ [LLaVA](https://github.com/haotian-liu/LLaVA): our architecture is inspired from LLaVA.\n+ [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): the predecessor to PG-Video-LLaVA\n\n---\n\n## Citation 📜\n\nIf you're using PG-Video-LLaVA in your research or applications, please cite using this BibTeX:\n\n```bibtex\n  @article{munasinghe2023PGVideoLLaVA,\n        title={PG-Video-LLaVA: Pixel Grounding Large Video-Language Models}, \n        author={Shehan Munasinghe and Rusiru Thushara and Muhammad Maaz and Hanoona Abdul Rasheed and Salman Khan and Mubarak Shah and Fahad Khan},\n        journal={ArXiv 2311.13435},\n        year={2023}\n  }\n```\n\n---\n\n[\u003cimg src=\"docs/images/logos/IVAL_logo.png\" width=\"200\" height=\"100\"\u003e](https://www.ival-mbzuai.com)\n[\u003cimg src=\"docs/images/logos/Oryx_logo.png\" width=\"100\" height=\"100\"\u003e](https://github.com/mbzuai-oryx)\n[\u003cimg src=\"docs/images/logos/MBZUAI_logo.png\" width=\"360\" height=\"85\"\u003e](https://mbzuai.ac.ae)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbzuai-oryx%2FVideo-LLaVA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbzuai-oryx%2FVideo-LLaVA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbzuai-oryx%2FVideo-LLaVA/lists"}