Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-video-understanding

A curated list of resources (paper, code, data) on video understanding research.
https://github.com/vvukimy/awesome-video-understanding

Last synced: 4 days ago
JSON representation

Models
- Agents
  - VideoAgent: Long-form Video Understanding with Large Language Model as Agent - Website/) | QA | / |
  - TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering - framework/TraveLER) | QA | / |
  - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
  - TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering - framework/TraveLER) | QA | / |
  - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
  - VideoAgent: Long-form Video Understanding with Large Language Model as Agent - Website/) | QA | / |
- Large Multimodal Models
  - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - cair.github.io/LongVU/) | General QA | / |
  - Aria: An Open Multimodal Native Mixture-of-Experts Model - ai/Aria) | General QA / Caption | / |
  - InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy
  - VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs - NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |
  - VILA: On Pre-training for Visual Language Models - ov-file) | General QA / Caption | / |
  - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding - cair.github.io/LongVU/) | General QA | / |
  - Aria: An Open Multimodal Native Mixture-of-Experts Model - ai/Aria) | General QA / Caption | / |
  - Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution - mllm.github.io/) | General QA / Caption | / |
  - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution - VL) | General QA / Caption | / |
  - LLaVA-OneVision: Easy Visual Task Transfer - vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |
  - Video Instruction Tuning with Synthetic Data - vl.github.io/blog/2024-09-30-llava-video/) | General QA / Caption | / |
  - Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution - mllm.github.io/) | General QA / Caption | / |
  - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution - VL) | General QA / Caption | / |
  - LLaVA-OneVision: Easy Visual Task Transfer - vl.github.io/blog/2024-08-05-llava-onevision/) | General QA / Caption | / |
  - InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy
  - VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs - NLP-SG/VideoLLaMA2?tab=readme-ov-file) | General QA / Caption | / |
Benchmarks
- General QA
  - HourVideo: 1-Hour Video-Language Understanding
  - TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models - nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |
  - LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
  - LVBench: An Extreme Long Video Understanding Benchmark
  - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |
  - TempCompass: Do Video LLMs Really Understand Videos?
  - HourVideo: 1-Hour Video-Language Understanding
  - TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models - nlp/TOMATO) | **Human Annotated** / 1.4K videos / 0~72s / 1.5K QAs | / |
  - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
  - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
  - LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
  - LVBench: An Extreme Long Video Understanding Benchmark
  - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - mme.github.io/) | **Humman Annotated** / 900 videos / 0~60m / 2.7K QAs| / |
  - TempCompass: Do Video LLMs Really Understand Videos?
- Caption
  - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark - web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
  - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark - web/) | GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
- Temporal Grounding
- Action Recognition
  - FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
  - FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
- Hallucination
  - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
  - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Research Topics
- Visual Token Reduction
  - Dynamic and Compressive Adaptation of Transformers From Images to Videos - frame token interpolation |
  - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs - mllm.github.io/) | Spatial Vision Aggregator |
  - Don't Look Twice: Faster Video Transformers with Run-Length Tokenization - Length Tokenization |
  - Don't Look Twice: Faster Video Transformers with Run-Length Tokenization - Length Tokenization |
  - Dynamic and Compressive Adaptation of Transformers From Images to Videos - frame token interpolation |
  - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs - mllm.github.io/) | Spatial Vision Aggregator |
  - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models - icler/FastV) | Prune tokens after layer 2 |
  - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models - icler/FastV) | Prune tokens after layer 2 |
- Visual Encoding
  - ElasticTok: Adaptive Tokenization for Image and Video
  - VideoPrism: A Foundational Visual Encoder for Video Understanding - a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |
  - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
  - VideoPrism: A Foundational Visual Encoder for Video Understanding - a-foundational-visual-encoder-for-video-understanding/) | Video Encoder |
  - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- Streaming
  - Streaming Dense Video Captioning - research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |
  - Streaming Dense Video Captioning - research/scenic/tree/main/scenic/projects/streaming_dvc) | Framework |
Datasets
- Instruction-Tuning
  - Video Instruction Tuning with Synthetic Data - vl.github.io/blog/2024-09-30-llava-video/)| [Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) | GPT-4o Annotated **(1 FPS)** / 178K videos / 0~3m / 178K Captions / 1.1M QAs |
  - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward - Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs |
- RLHF
  - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward - Hound-DPO?tab=readme-ov-file) | [Dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/video_instruction/train/dpo) | ChatGPT Annotated / 17K videos / 17K preference data |

Categories

Benchmarks 25 Models 22 Research Topics 15 Datasets 3

Sub Categories

Large Multimodal Models 16 General QA 14 Visual Token Reduction 8 Agents 6 Temporal Grounding 5 Visual Encoding 5 Hallucination 2 Action Recognition 2 Streaming 2 Instruction-Tuning 2 Caption 2 RLHF 1