Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vvukimy/awesome-video-understanding

A curated list of resources (paper, code, data) on video understanding research.
https://github.com/vvukimy/awesome-video-understanding

List: awesome-video-understanding

awesome-list large-multimodal-models video-understanding

Last synced: 4 days ago
JSON representation

A curated list of resources (paper, code, data) on video understanding research.

Awesome Lists containing this project

README

        

# awesome-video-understanding [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A curated list of resources (paper, code, data) on video understanding research. **(sorted by release date)**

🚀 This repo will be continuously updated.
⭐️ Please Star it if you find it helpful!
🤝 Feel free to submit a PR or open an issue with suggestions or improvements.

---
**Table of Contents**
- [Models](#models)
- [Large Multimodal Models](#large-multimodal-models)
- [Agents](#agents)
- [Benchmarks](#benchmarks)
- [General QA](#general-qa)
- [Caption](#caption)
- [Temporal-Grounding](#temporal-grounding)
- [Action Recognition](#action-recognition)
- [Hallucination](#hallucination)
- [Datasets](#datasets)
- [Pre-Training](#pre-training)
- [Instruction-Tuning](#instruction-tuning)
- [RLHF](#rlhf)
- [Research Topics](#research-topics)
- [Visual Encoding](#visual-encoding)
- [Visual Token Reduction](#visual-token-reduction)
- [Streaming](#streaming)
---

## Models
### Large Multimodal Models
| Name | Paper | Task | Note |
|:---|:---|:---|:---|
| **LongVU**
@Meta | [LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/abs/2410.17434)
24.10.22 / ArXiv / [Project Page](https://vision-cair.github.io/LongVU/) | General QA | / |
| **Aria**
@Rhymes AI | [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://arxiv.org/abs/2410.05993)
24.10.08 / ArXiv / [Project Page](https://github.com/rhymes-ai/Aria) | General QA / Caption | / |
| **LLaVA-Video**
@ByteDance | [Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713)
24.10.03 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/) | General QA / Caption | / |
| **Oryx**
@THU | [Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution](https://arxiv.org/abs/2409.12961v2)
24.09.19 / ArXiv / [Project Page](https://oryx-mllm.github.io/) | General QA / Caption | / |
| **Qwen2-VL**
@Qwen | [Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](https://arxiv.org/abs/2409.12191)
24..09.18 / ArXiv / [Project Page](https://github.com/QwenLM/Qwen2-VL) | General QA / Caption | / |
| **LLaVA-OneVision**
@ByteDance | [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
24.08.06 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |
| **InternVL-2**
@OpenGVLab | [InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/)
24.07.04 / Blog / [Project Page](https://github.com/OpenGVLab/InternVL) | General QA / Caption | / |
| **VideoLLaMA**
@Alibaba | [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476)
24.06.11 / ArXiv / [Project Page](https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |
| **LWM**
@Berkeley | [World Model on Million-Length Video And Language With Blockwise RingAttention]()
24.02.13 / ArXiv / [Project Page](https://largeworldmodel.github.io/lwm/) | General QA | / |
| **VILA**
@Nvidia | [VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533)
23.12.12 / CVPR'24 / [Project Page](https://github.com/NVlabs/VILA?tab=readme-ov-file) | General QA / Caption | / |
### Agents
| Name | Paper | Task | Note |
|:---|:---|:---|:---|
| **TraveLER**
@Berkeley | [TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering](https://arxiv.org/abs/2404.01476)
24.04.01 / EMNLP'24 / [Project Page](https://github.com/traveler-framework/TraveLER) | QA | / |
| **VideoAgent**
@BIGAI | [VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding](https://arxiv.org/abs/2403.11481)
24.03.18 / ECCV'24 / [Project Page](https://videoagent.github.io/) | QA / Temporal Grounding | / |
| **VideoAgent**
@Stanford | [VideoAgent: Long-form Video Understanding with Large Language Model as Agent](https://arxiv.org/abs/2403.10517)
24.03.15 / ECCV'24 / [Project Page](https://wxh1996.github.io/VideoAgent-Website/) | QA | / |

## Benchmarks
### General QA
| Name | Paper | Metadata | Note |
|:---|:---|:---|:---|
| **HourVideo**
@Stanford | [HourVideo: 1-Hour Video-Language Understanding](https://arxiv.org/abs/2411.04998)
24.11.07 / NIPS'24 D&B / [Project Page](https://hourvideo.stanford.edu/) | LLM+Human Annotated / 500 videos / 20~120m / 13K QAs | Long / Egocentric |
| **TOMATO**
@Yale | [TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models](https://arxiv.org/abs/2410.23266)
24.10.31 / ArXiv / [Project Page](https://github.com/yale-nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |
| **TemporalBench**
@UWM | [TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models](https://arxiv.org/abs/2410.10818)
24.10.14 / ArXiv / [Project Page](https://temporalbench.github.io/#leaderboard) | Human+LLM Annotated / 2K videos / 0~20m / 10K QAs | / |
| **LongVideoBench**
@NTU | [LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding](https://arxiv.org/abs/2407.15754)
24.07.22 / NIPS'24 D&B / [Project Page](https://longvideobench.github.io/) | **Humman Annotated** / 3.8K videos / 0~1h / 6.7K QAs | / |
| **LVBench**
@Zhipu | [LVBench: An Extreme Long Video Understanding Benchmark](https://arxiv.org/abs/2406.08035)
24.06.12 / ArXiv / [Project Page](https://lvbench.github.io/) | **Humman Annotated** / 500 videos / avg. 1h / 1.5k QAs | / |
| **VideoMME**
@VideoMME-Team| [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://arxiv.org/abs/2405.21075)
24.05.31 / ArXiv / [Project Page](https://video-mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |
| **TempCompass**
@PKU | [TempCompass: Do Video LLMs Really Understand Videos?](https://arxiv.org/abs/2403.00476)
24.03.01 / ACL'24 Findings / [Project Page](https://llyx97.github.io/tempcompass/) | ChatGPT+Human Annotated / 410 videos / 0~35s / 7.5K QAs | / |
### Caption
| Name | Paper | Metadata | Note |
|:---|:---|:---|:---|
| **VDC**
@UW | [AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](https://arxiv.org/abs/2410.03051)
24.10.04 / ArXiv / [Project Page](https://rese1f.github.io/aurora-web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
### Temporal Grounding
| Name | Paper | Metadata | Note |
|:---|:---|:---|:---|
| **QVHightlight**
@UNC | [QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries](https://arxiv.org/abs/2107.09609)
21.07.20 / NIPS'21 / [Project Page](https://github.com/jayleicn/moment_detr) | **Human Annotated** / 10K videos / avg. 150s / 10K queries | / |
| **Charades-STA**
@USC| [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101)
17.05.05 / ICCV'17 / [Project Page](https://github.com/jiyanggao/TALL) | Rule+Human Annotated / 4233 clip-sentence pairs | / |
| **ActivityNet Captions**
@Stanford| [Dense-Captioning Events in Videos](https://arxiv.org/abs/1705.00754)
17.05.05 / ICCV'17 / [Project Page](https://cs.stanford.edu/people/ranjaykrishna/densevid/) | **Human Annotated** / 20K videos / 0~270s | / |
| **YouCook2**
@Google Brain | [Towards Automatic Learning of Procedures from Web Instructional Videos]()
17.03.28 / AAAI'18 / [Project Page](http://youcook2.eecs.umich.edu/) | **Human Annotated** / 2K videos / 0~800s / avg. 7.7 segments per video | / |
### Action Recognition
| Name | Paper | Metadata | Note |
|:---|:---|:---|:---|
| **FineGym**
@CUHK | [FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding](https://arxiv.org/abs/2004.06704)
20.04.14 / CVPR'20 / [Project Page](https://sdolivia.github.io/FineGym/) | **Human Annotated** | / |
### Hallucination
| Name | Paper | Metadata | Note |
|:---|:---|:---|:---|
| **VideoHallucer**
@BIGAI | [VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models](https://arxiv.org/abs/2406.16338)
24.06.24 / ArXiv / [Project Page](https://videohallucer.github.io/) | Rule+Human Annotated / 948 videos / 7~187s / 1.8K QAs | / |

## Datasets
### Pre-Training
| Name | Paper | Data | Metadata |
|:---|:---|:---|:---|
|**ShareGPTVideo**
@CMU| [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258)
24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions |
### Instruction-Tuning
| Name | Paper | Data | Metadata |
|:---|:---|:---|:---|
| **LLaVA-Video-178k**
@ByteDance | [Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713)
24.10.03 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/)| [Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) | GPT-4o Annotated **(1 FPS)** / 178K videos / 0~3m / 178K Captions / 1.1M QAs |
|**ShareGPTVideo**
@CMU| [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258)
24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs |
### RLHF
| Name | Paper | Data | Metadata |
|:---|:---|:---|:---|
| **ShareGPTVideo**
@CMU | [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258)
24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/video_instruction/train/dpo) | ChatGPT Annotated / 17K videos / 17K preference data |

## Research Topics
### Visual Encoding
| Name | Paper | Note |
|:---|:---|:---|
| **ElasticTok**
@Berkeley | [ElasticTok: Adaptive Tokenization for Image and Video](https://arxiv.org/abs/2410.08368)
24.10.10 / ArXiv / [Project Page](https://largeworldmodel.github.io/elastictok/) | Visual Tokenizer |
| **VideoPrism**
@Google | [VideoPrism: A Foundational Visual Encoder for Video Understanding](https://arxiv.org/abs/2402.13217)
24.02.20 / ICML'24 / [Project Page](https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |
| **MMVP**
@NYU | [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/abs/2401.06209)
24.01.11 / ArXiv / [Project Page](https://tsb0601.github.io/mmvp_blog/) | Hybrid Encoder |
### Visual Token Reduction
| Name | Paper | Note |
|:---|:---|:---|
| **RLT**
@CMU | [Don't Look Twice: Faster Video Transformers with Run-Length Tokenization](https://arxiv.org/abs/2411.05222)
24.11.07 / NIPS'24 / [Project Page](https://rccchoudhury.github.io/rlt/) | Run-Length Tokenization |
| **InTI**
@NJU | [Dynamic and Compressive Adaptation of Transformers From Images to Videos](https://arxiv.org/abs/2408.06840)
24.08.13 / ECCV'24 / [Project Page]() | Dynamic Inter-frame token interpolation |
| **Cambrian-1**
@NYU | [Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/abs/2406.16860)
24.06.24 / ArXiv / [Project Page](https://cambrian-mllm.github.io/) | Spatial Vision Aggregator |
| **FastV**
@PKU | [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764)
24.03.11 / ECCV'24 / [Project Page](https://github.com/pkunlp-icler/FastV) | Prune tokens after layer 2 |
### Streaming
| Name | Paper | Note |
|:---|:---|:---|
| **Streaming_VDC**
@Google | [Streaming Dense Video Captioning](https://arxiv.org/abs/2404.01297)
24.04.01 / CVPR'24 / [Project Page](https://github.com/google-research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |