https://github.com/vvukimy/awesome-video-understanding

A curated list of resources (paper, code, data) on video understanding research.
https://github.com/vvukimy/awesome-video-understanding
List: awesome-video-understanding
awesome-list large-multimodal-models video-understanding
Last synced: 5 months ago
JSON representation
A curated list of resources (paper, code, data) on video understanding research.
Host: GitHub
URL: https://github.com/vvukimy/awesome-video-understanding
Owner: vvukimy
Created: 2024-11-09T11:30:16.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-17T03:15:35.000Z (6 months ago)
Last Synced: 2024-12-26T17:02:08.558Z (6 months ago)
Topics: awesome-list, large-multimodal-models, video-understanding
Homepage:
Size: 29.3 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ultimate-awesome - awesome-video-understanding - A curated list of resources (paper, code, data) on video understanding research. (Other Lists / Julia Lists)
README

        # awesome-video-understanding [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of resources (paper, code, data) on video understanding research. **(sorted by release date)**

🚀 This repo will be continuously updated. 
 ⭐️ Please Star it if you find it helpful! 
 🤝 Feel free to submit a PR or open an issue with suggestions or improvements.

---

**Table of Contents**

- [Models](#models)

  - [Large Multimodal Models](#large-multimodal-models)

  - [Agents](#agents)

- [Benchmarks](#benchmarks)

  - [General QA](#general-qa)

  - [Caption](#caption)

  - [Temporal-Grounding](#temporal-grounding)

  - [Action Recognition](#action-recognition)

  - [Hallucination](#hallucination)

- [Datasets](#datasets)

  - [Pre-Training](#pre-training)

  - [Instruction-Tuning](#instruction-tuning)

  - [RLHF](#rlhf)

- [Research Topics](#research-topics)

  - [Visual Encoding](#visual-encoding)

  - [Visual Token Reduction](#visual-token-reduction)

  - [Streaming](#streaming)

---

## Models

### Large Multimodal Models

| Name | Paper | Task | Note |

|:---|:---|:---|:---|

| **Apollo** 
 @Meta | [Apollo: An Exploration of Video Understanding in Large Multimodal Models](https://arxiv.org/abs/2412.10360) 
 24.12.13 / ArXiv / [Project Page](https://apollo-lmms.github.io/) | General QA | / |

| **InternVL2.5** 
 @Nvidia | [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://arxiv.org/abs/2412.05271) 
 24.12.06 / ArXiv / [Project Page](https://internvl.github.io/blog/2024-12-05-InternVL-2.5/) | General QA | / |

| **NVILA** 
 @Nvidia | [NVILA: Efficient Frontier Visual Language Models](https://arxiv.org/abs/2412.04468) 
 24.12.05 / ArXiv / [Project Page](https://github.com/NVlabs/VILA) | General QA | / |

| **LongVU** 
 @Meta | [LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/abs/2410.17434) 
 24.10.22 / ArXiv / [Project Page](https://vision-cair.github.io/LongVU/) | General QA | / |

| **Aria** 
 @Rhymes AI | [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://arxiv.org/abs/2410.05993) 
  24.10.08 / ArXiv / [Project Page](https://github.com/rhymes-ai/Aria) | General QA / Caption | / |

| **LLaVA-Video** 
 @ByteDance | [Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713) 
 24.10.03 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/) | General QA / Caption | / |

| **Oryx** 
 @THU | [Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution](https://arxiv.org/abs/2409.12961v2) 
 24.09.19 / ArXiv / [Project Page](https://oryx-mllm.github.io/) | General QA / Caption | / |

| **Qwen2-VL** 
 @Qwen | [Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](https://arxiv.org/abs/2409.12191) 
 24..09.18 / ArXiv / [Project Page](https://github.com/QwenLM/Qwen2-VL) | General QA / Caption | / |

| **LLaVA-OneVision** 
 @ByteDance | [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) 
 24.08.06 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |

| **InternVL-2** 
 @OpenGVLab | [InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) 
 24.07.04 / Blog / [Project Page](https://github.com/OpenGVLab/InternVL) | General QA / Caption | / |

| **VideoLLaMA** 
 @Alibaba | [VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs](https://arxiv.org/abs/2406.07476) 
 24.06.11 / ArXiv / [Project Page](https://github.com/DAMO-NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |

| **LWM** 
 @Berkeley | [World Model on Million-Length Video And Language With Blockwise RingAttention]() 
 24.02.13 / ArXiv / [Project Page](https://largeworldmodel.github.io/lwm/) | General QA | / |

| **VILA** 
 @Nvidia | [VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533) 
 23.12.12 / CVPR'24 / [Project Page](https://github.com/NVlabs/VILA?tab=readme-ov-file) | General QA / Caption | / |

### Agents

| Name | Paper | Task | Note |

|:---|:---|:---|:---|

| **TraveLER**
@Berkeley | [TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering](https://arxiv.org/abs/2404.01476) 
 24.04.01 / EMNLP'24 / [Project Page](https://github.com/traveler-framework/TraveLER) | QA | / |

| **VideoAgent**
@BIGAI | [VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding](https://arxiv.org/abs/2403.11481) 
 24.03.18 / ECCV'24 / [Project Page](https://videoagent.github.io/) | QA / Temporal Grounding | / |

| **VideoAgent**
@Stanford | [VideoAgent: Long-form Video Understanding with Large Language Model as Agent](https://arxiv.org/abs/2403.10517) 
 24.03.15 / ECCV'24 / [Project Page](https://wxh1996.github.io/VideoAgent-Website/) | QA | / |

## Benchmarks

### General QA

| Name | Paper | Metadata | Note |

|:---|:---|:---|:---|

| **HourVideo** 
 @Stanford | [HourVideo: 1-Hour Video-Language Understanding](https://arxiv.org/abs/2411.04998) 
 24.11.07 / NIPS'24 D&B / [Project Page](https://hourvideo.stanford.edu/) | LLM+Human Annotated / 500 videos / 20~120m / 13K QAs | Long / Egocentric |

| **TOMATO** 
 @Yale | [TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models](https://arxiv.org/abs/2410.23266) 
 24.10.31 / ArXiv / [Project Page](https://github.com/yale-nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |

| **TemporalBench** 
 @UWM | [TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models](https://arxiv.org/abs/2410.10818) 
 24.10.14 / ArXiv / [Project Page](https://temporalbench.github.io/#leaderboard) | Human+LLM Annotated / 2K videos / 0~20m / 10K QAs | / |

| **LongVideoBench** 
 @NTU | [LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding](https://arxiv.org/abs/2407.15754) 
 24.07.22 / NIPS'24 D&B / [Project Page](https://longvideobench.github.io/) | **Humman Annotated** / 3.8K videos / 0~1h / 6.7K QAs | / |

| **LVBench** 
 @Zhipu | [LVBench: An Extreme Long Video Understanding Benchmark](https://arxiv.org/abs/2406.08035) 
 24.06.12 / ArXiv / [Project Page](https://lvbench.github.io/) | **Humman Annotated** / 500 videos / avg. 1h / 1.5k QAs | / |

| **VideoMME** 
 @VideoMME-Team| [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://arxiv.org/abs/2405.21075) 
 24.05.31 / ArXiv / [Project Page](https://video-mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |

| **TempCompass** 
 @PKU | [TempCompass: Do Video LLMs Really Understand Videos?](https://arxiv.org/abs/2403.00476) 
 24.03.01 / ACL'24 Findings / [Project Page](https://llyx97.github.io/tempcompass/) | ChatGPT+Human Annotated / 410 videos / 0~35s / 7.5K QAs | / |

### Caption

| Name | Paper | Metadata | Note |

|:---|:---|:---|:---|

| **VDC** 
 @UW | [AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](https://arxiv.org/abs/2410.03051) 
 24.10.04 / ArXiv / [Project Page](https://rese1f.github.io/aurora-web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |

### Temporal Grounding

| Name | Paper | Metadata | Note |

|:---|:---|:---|:---|

| **QVHightlight** 
 @UNC | [QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries](https://arxiv.org/abs/2107.09609) 
 21.07.20 / NIPS'21 / [Project Page](https://github.com/jayleicn/moment_detr) | **Human Annotated** / 10K videos / avg. 150s / 10K queries | / |

| **Charades-STA** 
 @USC| [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101) 
 17.05.05 / ICCV'17 / [Project Page](https://github.com/jiyanggao/TALL) | Rule+Human Annotated / 4233 clip-sentence pairs | / |

| **ActivityNet Captions** 
 @Stanford| [Dense-Captioning Events in Videos](https://arxiv.org/abs/1705.00754) 
 17.05.05 / ICCV'17 / [Project Page](https://cs.stanford.edu/people/ranjaykrishna/densevid/) | **Human Annotated** / 20K videos / 0~270s | / |

| **YouCook2** 
 @Google Brain | [Towards Automatic Learning of Procedures from Web Instructional Videos]() 
 17.03.28 / AAAI'18 / [Project Page](http://youcook2.eecs.umich.edu/) | **Human Annotated** / 2K videos / 0~800s / avg. 7.7 segments per video | / |

### Action Recognition

| Name | Paper | Metadata | Note |

|:---|:---|:---|:---|

| **FineGym** 
 @CUHK | [FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding](https://arxiv.org/abs/2004.06704) 
 20.04.14 / CVPR'20 / [Project Page](https://sdolivia.github.io/FineGym/) | **Human Annotated** | / |

### Hallucination

| Name | Paper | Metadata | Note |

|:---|:---|:---|:---|

| **VideoHallucer** 
 @BIGAI | [VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models](https://arxiv.org/abs/2406.16338) 
 24.06.24 / ArXiv / [Project Page](https://videohallucer.github.io/) | Rule+Human Annotated / 948 videos / 7~187s / 1.8K QAs | / |

## Datasets

### Pre-Training

| Name | Paper | Data | Metadata |

|:---|:---|:---|:---|

|**ShareGPTVideo** 
 @CMU| [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258) 
  24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions |

### Instruction-Tuning

| Name | Paper | Data | Metadata |

|:---|:---|:---|:---|

| **LLaVA-Video-178k** 
 @ByteDance |  [Video Instruction Tuning with Synthetic Data](https://arxiv.org/abs/2410.02713)
24.10.03 / ArXiv / [Project Page](https://llava-vl.github.io/blog/2024-09-30-llava-video/)| [Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) | GPT-4o Annotated **(1 FPS)** / 178K videos / 0~3m / 178K Captions / 1.1M QAs |

|**ShareGPTVideo** 
 @CMU| [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258) 
  24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs |

### RLHF

| Name | Paper | Data | Metadata |

|:---|:---|:---|:---|

| **ShareGPTVideo** 
 @CMU | [Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward](https://arxiv.org/abs/2404.01258) 
  24.04.01 / ArXiv / [Project Page](https://github.com/RifleZhang/LLaVA-Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/video_instruction/train/dpo) | ChatGPT Annotated / 17K videos / 17K preference data |

## Research Topics

### Visual Encoding

| Name | Paper | Note |

|:---|:---|:---|

| **ElasticTok** 
 @Berkeley | [ElasticTok: Adaptive Tokenization for Image and Video](https://arxiv.org/abs/2410.08368) 
 24.10.10 / ArXiv / [Project Page](https://largeworldmodel.github.io/elastictok/) | Visual Tokenizer |

| **VideoPrism** 
 @Google | [VideoPrism: A Foundational Visual Encoder for Video Understanding](https://arxiv.org/abs/2402.13217) 
 24.02.20 / ICML'24 / [Project Page](https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |

| **MMVP** 
 @NYU | [Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs](https://arxiv.org/abs/2401.06209) 
 24.01.11 / ArXiv / [Project Page](https://tsb0601.github.io/mmvp_blog/) | Hybrid Encoder |

### Visual Token Reduction

| Name | Paper | Note |

|:---|:---|:---|

| **RLT**  
 @CMU | [Don't Look Twice: Faster Video Transformers with Run-Length Tokenization](https://arxiv.org/abs/2411.05222) 
 24.11.07 / NIPS'24 / [Project Page](https://rccchoudhury.github.io/rlt/) | Run-Length Tokenization |

| **InTI** 
 @NJU | [Dynamic and Compressive Adaptation of Transformers From Images to Videos](https://arxiv.org/abs/2408.06840) 
 24.08.13 / ECCV'24 / [Project Page]() | Dynamic Inter-frame token interpolation |

| **Cambrian-1** 
 @NYU | [Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/abs/2406.16860) 
 24.06.24 / ArXiv / [Project Page](https://cambrian-mllm.github.io/) | Spatial Vision Aggregator |

| **FastV** 
 @PKU | [An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models](https://arxiv.org/abs/2403.06764) 
 24.03.11 / ECCV'24 / [Project Page](https://github.com/pkunlp-icler/FastV) | Prune tokens after layer 2 |

### Streaming

| Name | Paper | Note |

|:---|:---|:---|

| **StreamChat** 
 @CUHK | [StreamChat: Chatting with Streaming Video](https://arxiv.org/abs/2412.08646) 
 24.12.11 / ArXiv'24 / [Project Page](https://jihaonew.github.io/projects/streamchat.html) | / |

| **Streaming_VDC** 
 @Google | [Streaming Dense Video Captioning](https://arxiv.org/abs/2404.01297) 
 24.04.01 / CVPR'24 / [Project Page](https://github.com/google-research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vvukimy/awesome-video-understanding

Awesome Lists containing this project

README