Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-llms-for-video-understanding
π₯π₯π₯Latest Papers, Codes and Datasets on Vid-LLMs.
https://github.com/yunlong10/awesome-llms-for-video-understanding
Last synced: 1 day ago
JSON representation
-
Tasks, Datasets, and Benchmarks
-
π¦Ύ Training-free Methods
- **FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation**
- **The Kinetics Human Action Video Dataset** - 400-1) | - |
- **MovieNet: A Holistic Dataset for Movie Understanding**
- **Creating Summaries from User Videos**
- **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering** - qa)|CVPR|
- **TGIF: A New Dataset and Benchmark on Animated GIF Description** - Release)|CVPR|
- **HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips**
- **MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions**
- **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/VAST)|NeurIPS|
- **YouTube-8M: A Large-Scale Video Classification Benchmark** - |
- **ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding** - net.org/) | CVPR |
- **VidChapters-7M: Video Chapters at Scale**
- **Collecting Highly Parallel Data for Paraphrase Evaluation**
- **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language** - us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/)|CVPR|
- **Dense-Captioning Events in Videos**
- **Towards Automatic Learning of Procedures from Web Instructional Videos**
- **TVSum: Summarizing web videos using titles**
- **From Recognition to Cognition: Visual Commonsense Reasoning**
- **Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences** - Dataset)|CVPR|
- **ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering** - qa)|AAAI|
- **VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset** - mercury/VALOR)|arXiv|
- **InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation**
- **MIMIC-IT: Multi-Modal In-Context Instruction Tuning**
- **Perception Test: A Diagnostic Benchmark for Multimodal Video Models** - deepmind/perception_test) | NeurIPS 2023, ICCV 2023 Workshop |
- **MVBench: A Comprehensive Multi-modal Video Understanding Benchmark** - Anything) | - |
- **VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models** - tssn/VideoHallucer.svg?style=social&label=Star)](https://github.com/patrick-tssn/VideoHallucer) | 06/2024 | [code](https://github.com/patrick-tssn/VideoHallucer) | - |
- **Actor and Observer: Joint Modeling of First and Third-Person Videos** - ego)|CVPR|
- **GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval** - plus) | ECCV |
- **Multimodal Pretraining for Dense Video Captioning** - research-datasets/Video-Timeline-Tags-ViTT)|AACL-IJCNLP|
- **VideoXum: Cross-modal Visual and Textural Summarization of Videos**
- **Rescaling Egocentric Vision** - kitchens.github.io/2021)|IJCV|
- **TALL: Temporal Activity Localization via Language Query**
- **Localizing Moments in Video with Natural Language**
- **DeepStory: Video Story QA by Deep Embedded Memory Networks** - Min/PororoQA)|IJCAI|
- **TempCompass: Do Video LLMs Really Understand Videos?**
- **Ego4D: Around the World in 3,000 Hours of Egocentric Video** - data.org/)|CVPR|
- **TVQA: Localized, Compositional Video Question Answering**
- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - 100K)|arXiv|
- **Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models** - YuanGroup/Video-Bench) | - |
- **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - plug/youku-mplug.svg?style=social&label=Star)](https://github.com/x-plug/youku-mplug) | 07/2023 | [code](https://github.com/x-plug/youku-mplug) | - |
- **MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding** - |
- **Encoding and Controlling Global Semantics for Long-form Video Question Answering**
- **TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**
- **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis** - MME.svg?style=social&label=Star)](https://github.com/BradyFU/Video-MME) | 06/2024 | [code](https://github.com/BradyFU/Video-MME) | - |
- **LVBench: An Extreme Long Video Understanding Benchmark** - |
- **VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs** - NLP-SG/Multi-Source-Video-Captioning)|arXiv|
- **Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding**
- **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - PLUG/Youku-mPLUG)|arXiv|
-
π¦Ύ Hybrid Methods
-
-
Contributing
-
π Star History
- ![Star History Chart - history.com/#yunlong10/Awesome-LLMs-for-Video-Understanding&Date)
-
-
π Vid-LLMs: Models
-
π€ LLM-based Video Agents
- **Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**
- **VLog: Video as a Long Document** - |
- **MISAR: A Multimodal Instructional System with Augmented Reality**
- **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions** - CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | Video ChatCaptioner | 04/2023 | [code](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | arXiv |
- **ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**
- **MM-VID: Advancing Video Understanding with GPT-4V(ision)** - VID | 10/2023 | - | arXiv |
- **VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding**
- **Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos** - Prompter | 12/2023 | - | arXiv |
- **NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation** - epic.github.io/NaVid/) - | RSS |
-
π Vid-LLM Instruction Tuning
- **VALLEY: Video Assistant with Large Language model Enhanced abilitY** - |
- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT) | Video-ChatGPT | 06/2023 | [code](https://github.com/mbzuai-oryx/Video-ChatGPT) | arXiv |
- **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
- **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
- **AutoAD: Movie Description in Context**
- **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** - research/LLaMA-VID)](https://github.com/dvlab-research/LLaMA-VID) | LLaMA-VID | 11/2023 | [code](https://github.com/dvlab-research/LLaMA-VID) | arXiv |
- **VTimeLLM: Empower LLM to Grasp Video Moments**
- **GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation** - | arXiv |
- **VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens** - LLAMA | 12/2023 | - | arXiv |
- **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
- **AutoAD III: The Prequel -- Back to the Pixels** - | CVPR |
- **Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding** - NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA) | Video-LLaMA | 06/2023 | [code](https://github.com/DAMO-NLP-SG/Video-LLaMA) | arXiv |
- **Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration** - llm.svg?style=social&label=Star)](https://github.com/lyuchenyang/macaw-llm) | Macaw-LLM | 06/2023 | [code](https://github.com/lyuchenyang/macaw-llm) | arXiv |
- **Large Language Models are Temporal and Causal Reasoners for Video Question Answering** - VQA.svg?style=social&label=Star)](https://github.com/mlvlab/Flipped-VQA) | LLaMA-VQA | 10/2023 | [code](https://github.com/mlvlab/Flipped-VQA) | EMNLP |
- **VideoLLM: Modeling Video Sequence with Large Language Models**
- **Audio-Visual LLM for Video Understanding** - | 12/2023 | - | arXiv |
- **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
- **LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning** - gebc.svg?style=social&label=Star)](https://github.com/zjr2000/llmva-gebc) | LLMVA-GEBC | 06/2023 | [code](https://github.com/zjr2000/llmva-gebc) | CVPR |
- **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**
- **AutoAD II: The Sequel - Who, When, and What in Movie Audio Description** - | ICCV |
- **Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models** - anonymous-bs/favor.svg?style=social&label=Star)](https://github.com/the-anonymous-bs/favor) | FAVOR | 10/2023 | [code](https://github.com/the-anonymous-bs/favor) | arXiv |
- **Otter: A Multi-Modal Model with In-Context Instruction Tuning**
-
π¦Ύ Hybrid Methods
- **VideoChat: Chat-Centric Video Understanding** - Anything.svg?style=social&label=Star)](https://github.com/OpenGVLab/Ask-Anything) | VideoChat | 05/2023 | [code](https://github.com/OpenGVLab/Ask-Anything) [demo](https://huggingface.co/spaces/ynhe/AskAnything) | arXiv |
- **PG-Video-LLaVA: Pixel Grounding Large Video-Language Models** - oryx/video-llava.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/video-llava) | PG-Video-LLaVA | 11/2023 | [code](https://github.com/mbzuai-oryx/video-llava) | arXiv |
- **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
- **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
- **A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot**
-
π¦Ύ Training-free Methods
- **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** - LLaVA | 07/2024 | - | arXiv |
-
πΎ Vid-LLM Pretraining
- **Learning Video Representations from Large Language Models**
- **Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning**
- **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/vast?style=social&label=Star)](https://github.com/txh-mercury/vast) | VAST | 05/2023 | [code](https://github.com/txh-mercury/vast) | NeurIPS |
- **Merlin:Empowering Multimodal LLMs with Foresight Minds** - | arXiv |
-
πΉοΈ Video Analyzer Γ LLM
- **Seeing the Unseen: Visual Metaphor Captioning for Videos** - LLaVA | 06/2024 | [code]() | arXiv |
- **Zero-shot long-form video understanding through screenplay** - Screenplayer | 06/2024 | [project page]() | CVPR |
- **MoReVQA exploring modular reasoning models for video question answering**
- **An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM** - VLM | 03/2024 | [code](https://github.com/imagegridworth/IG-VLM) | arXiv |
- **Language repository for long video understanding**
- **Understanding long videos in one multimodal language model pass**
- **Video ReCap recursive captioning of hour-long videos**
- **A Simple LLM Framework for Long-Range Video Question-Answering**
- **Learning object state changes in videos an open-world perspective**
- **AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?** - palm.github.io/AntGPT) | ICLR |
- **LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos** - Learning-AI-Lab/lifelong-memory) | arXiv |
- **Zero-Shot Video Question Answering with Procedural Programs**
- **AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**
- **ViperGPT: Visual Inference via Python Execution for Reasoning**
- **Hawk: Learning to Understand Open-World Video Anomalies**
- **DrVideo: Document Retrieval Based Long Video Understanding**
- **OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer**
- **Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA**
- **VideoTree adaptive tree-based video representation for LLM reasoning on long videos**
- **Harnessing Large Language Models for Training-free Video Anomaly Detection**
- **TraveLER a multi-LMM agent framework for video question-answering**
- **GPTSee enhancing moment retrieval and highlight detection via description-based similarity features**
- **Reframe anything LLM agent for open world video reframing**
- **SCHEMA state CHangEs MAtter for procedure planning in instructional videos**
- **TV-TREES multimodal entailment trees for neuro-symbolic video reasoning** - TREES | 02/2024 | [code]() | arXiv |
- **VideoAgent long-form video understanding with large language model as agent**
- **VURF a general-purpose reasoning and self-refinement framework for video understanding**
- **Why not use your textbook knowledge-enhanced procedure planning of instructional videos**
- **DoraemonGPT toward understanding dynamic scenes with large language models** - x-yang/DoraemonGPT) | arXiv |
Programming Languages
Sub Categories