Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-video-understanding
A curated list of resources (paper, code, data) on video understanding research.
https://github.com/vvukimy/awesome-video-understanding
Last synced: 4 days ago
JSON representation
-
Models
-
Agents
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent - Website/) | QA | / |
- TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering - framework/TraveLER) | QA | / |
- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
- TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering - framework/TraveLER) | QA | / |
- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent - Website/) | QA | / |
-
Large Multimodal Models
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - cair.github.io/LongVU/) | General QA | / |
- Aria: An Open Multimodal Native Mixture-of-Experts Model - ai/Aria) | General QA / Caption | / |
- InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs - NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |
- VILA: On Pre-training for Visual Language Models - ov-file) | General QA / Caption | / |
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - cair.github.io/LongVU/) | General QA | / |
- Aria: An Open Multimodal Native Mixture-of-Experts Model - ai/Aria) | General QA / Caption | / |
- Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution - mllm.github.io/) | General QA / Caption | / |
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution - VL) | General QA / Caption | / |
- LLaVA-OneVision: Easy Visual Task Transfer - vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |
- Video Instruction Tuning with Synthetic Data - vl.github.io/blog/2024-09-30-llava-video/) | General QA / Caption | / |
- Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution - mllm.github.io/) | General QA / Caption | / |
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution - VL) | General QA / Caption | / |
- LLaVA-OneVision: Easy Visual Task Transfer - vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |
- InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs - NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |
-
-
Benchmarks
-
General QA
- HourVideo: 1-Hour Video-Language Understanding
- TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models - nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |
- LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
- LVBench: An Extreme Long Video Understanding Benchmark
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |
- TempCompass: Do Video LLMs Really Understand Videos?
- HourVideo: 1-Hour Video-Language Understanding
- TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models - nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
- LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
- LVBench: An Extreme Long Video Understanding Benchmark
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |
- TempCompass: Do Video LLMs Really Understand Videos?
-
Caption
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark - web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark - web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
-
Temporal Grounding
- QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
- TALL: Temporal Activity Localization via Language Query - sentence pairs | / |
- QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
- TALL: Temporal Activity Localization via Language Query - sentence pairs | / |
- Dense-Captioning Events in Videos
-
Action Recognition
-
Hallucination
-
-
Research Topics
-
Visual Token Reduction
- Dynamic and Compressive Adaptation of Transformers From Images to Videos - frame token interpolation |
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs - mllm.github.io/) | Spatial Vision Aggregator |
- Don't Look Twice: Faster Video Transformers with Run-Length Tokenization - Length Tokenization |
- Don't Look Twice: Faster Video Transformers with Run-Length Tokenization - Length Tokenization |
- Dynamic and Compressive Adaptation of Transformers From Images to Videos - frame token interpolation |
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs - mllm.github.io/) | Spatial Vision Aggregator |
- An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models - icler/FastV) | Prune tokens after layer 2 |
- An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models - icler/FastV) | Prune tokens after layer 2 |
-
Visual Encoding
- ElasticTok: Adaptive Tokenization for Image and Video
- VideoPrism: A Foundational Visual Encoder for Video Understanding - a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- VideoPrism: A Foundational Visual Encoder for Video Understanding - a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |
- Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
-
Streaming
- Streaming Dense Video Captioning - research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |
- Streaming Dense Video Captioning - research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |
-
-
Datasets
-
Instruction-Tuning
- Video Instruction Tuning with Synthetic Data - vl.github.io/blog/2024-09-30-llava-video/)| [Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) | GPT-4o Annotated **(1 FPS)** / 178K videos / 0~3m / 178K Captions / 1.1M QAs |
- Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward - Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs |
-
RLHF
- Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward - Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/video_instruction/train/dpo) | ChatGPT Annotated / 17K videos / 17K preference data |
-