Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-llms-for-video-understanding

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
https://github.com/yunlong10/awesome-llms-for-video-understanding

Last synced: 3 days ago
JSON representation

Tasks, Datasets, and Benchmarks
- 🗒️ Taxonomy 2
  - **FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation**
  - **The Kinetics Human Action Video Dataset** - 400-1) | - |
  - **MovieNet: A Holistic Dataset for Movie Understanding**
  - **Creating Summaries from User Videos**
  - **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering** - qa)|CVPR|
  - **TGIF: A New Dataset and Benchmark on Animated GIF Description** - Release)|CVPR|
  - **HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips**
  - **MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions**
  - **YouTube-8M: A Large-Scale Video Classification Benchmark** - |
  - **ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding** - net.org/) | CVPR |
  - **VidChapters-7M: Video Chapters at Scale**
  - **Collecting Highly Parallel Data for Paraphrase Evaluation**
  - **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language** - us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/)|CVPR|
  - **Dense-Captioning Events in Videos**
  - **Towards Automatic Learning of Procedures from Web Instructional Videos**
  - **TVSum: Summarizing web videos using titles**
  - **From Recognition to Cognition: Visual Commonsense Reasoning**
  - **Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences** - Dataset)|CVPR|
  - **ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering** - qa)|AAAI|
  - **VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset** - mercury/VALOR)|arXiv|
  - **InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation**
  - **MIMIC-IT: Multi-Modal In-Context Instruction Tuning**
  - **Perception Test: A Diagnostic Benchmark for Multimodal Video Models** - deepmind/perception_test) | NeurIPS 2023, ICCV 2023 Workshop |
  - **MVBench: A Comprehensive Multi-modal Video Understanding Benchmark** - Anything) | - |
  - **VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models** - tssn/VideoHallucer.svg?style=social&label=Star)](https://github.com/patrick-tssn/VideoHallucer) | 06/2024 | [code](https://github.com/patrick-tssn/VideoHallucer) | - |
  - **Actor and Observer: Joint Modeling of First and Third-Person Videos** - ego)|CVPR|
  - **GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval** - plus) | ECCV |
  - **Multimodal Pretraining for Dense Video Captioning** - research-datasets/Video-Timeline-Tags-ViTT)|AACL-IJCNLP|
  - **VideoXum: Cross-modal Visual and Textural Summarization of Videos**
  - **Rescaling Egocentric Vision** - kitchens.github.io/2021)|IJCV|
  - **TALL: Temporal Activity Localization via Language Query**
  - **Localizing Moments in Video with Natural Language**
  - **DeepStory: Video Story QA by Deep Embedded Memory Networks** - Min/PororoQA)|IJCAI|
  - **TempCompass: Do Video LLMs Really Understand Videos?**
  - **Ego4D: Around the World in 3,000 Hours of Egocentric Video** - data.org/)|CVPR|
  - **TVQA: Localized, Compositional Video Question Answering**
  - **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - 100K)|arXiv|
  - **Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models** - YuanGroup/Video-Bench) | - |
  - **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - plug/youku-mplug.svg?style=social&label=Star)](https://github.com/x-plug/youku-mplug) | 07/2023 | [code](https://github.com/x-plug/youku-mplug) | - |
  - **MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding** - |
  - **Encoding and Controlling Global Semantics for Long-form Video Question Answering**
  - **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis** - MME.svg?style=social&label=Star)](https://github.com/BradyFU/Video-MME) | 06/2024 | [code](https://github.com/BradyFU/Video-MME) | - |
  - **LVBench: An Extreme Long Video Understanding Benchmark** - |
  - **TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**
  - **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/VAST)|NeurIPS|
  - **VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs** - NLP-SG/Multi-Source-Video-Captioning)|arXiv|
  - **Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding**
  - **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - PLUG/Youku-mPLUG)|arXiv|
- 🦾 Hybrid Methods
  - **Actor and Observer: Joint Modeling of First and Third-Person Videos** - ego)|CVPR|
Contributing
- 🌟 Star History
  - ![Star History Chart - history.com/#yunlong10/Awesome-LLMs-for-Video-Understanding&Date)
- ♥️ Contributors
  - Yunlong Tang
  - Jing Bi
  - Siting Xu
  - Luchuan Song
  - Susan Liang
  - Teng Wang
  - Daoan Zhang
  - Jie An
  - Jingyang Lin
  - Rongyi Zhu
  - Ali Vosoughi
  - Chao Huang
  - Zeliang Zhang
  - Pinxin Liu
  - Mingqian Feng
  - Feng Zheng
  - Jianguo Zhang
  - Ping Luo
  - Jiebo Luo
  - Chenliang Xu
😎 Vid-LLMs: Models
- - image
  - image
- 🗒️ Taxonomy 2
  - **Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**
  - **AutoAD: Movie Description in Context**
  - **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** - research/LLaMA-VID)](https://github.com/dvlab-research/LLaMA-VID) | LLaMA-VID | 11/2023 | [code](https://github.com/dvlab-research/LLaMA-VID) | arXiv |
  - **VTimeLLM: Empower LLM to Grasp Video Moments**
  - **GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation** - | arXiv |
  - **VLog: Video as a Long Document** - |
  - **MISAR: A Multimodal Instructional System with Augmented Reality**
  - **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** - LLaVA | 07/2024 | - | arXiv |
  - **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
  - **AutoAD III: The Prequel -- Back to the Pixels** - | CVPR |
  - **VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding**
  - **Learning Video Representations from Large Language Models**
  - **Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning**
  - **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/vast?style=social&label=Star)](https://github.com/txh-mercury/vast) | VAST | 05/2023 | [code](https://github.com/txh-mercury/vast) | NeurIPS |
  - **Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding** - NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA) | Video-LLaMA | 06/2023 | [code](https://github.com/DAMO-NLP-SG/Video-LLaMA) | arXiv |
  - **Large Language Models are Temporal and Causal Reasoners for Video Question Answering** - VQA.svg?style=social&label=Star)](https://github.com/mlvlab/Flipped-VQA) | LLaMA-VQA | 10/2023 | [code](https://github.com/mlvlab/Flipped-VQA) | EMNLP |
  - **VideoLLM: Modeling Video Sequence with Large Language Models**
  - **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
  - **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions** - CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | Video ChatCaptioner | 04/2023 | [code](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | arXiv |
  - **Audio-Visual LLM for Video Understanding** - | 12/2023 | - | arXiv |
  - **A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot**
  - **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
  - **MM-VID: Advancing Video Understanding with GPT-4V(ision)** - VID | 10/2023 | - | arXiv |
  - **Merlin:Empowering Multimodal LLMs with Foresight Minds** - | arXiv |
  - **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**
  - **AutoAD II: The Sequel - Who, When, and What in Movie Audio Description** - | ICCV |
  - **Otter: A Multi-Modal Model with In-Context Instruction Tuning**
  - **ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**
  - **NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation** - epic.github.io/NaVid/) - | RSS |
  - **VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens** - LLAMA | 12/2023 | - | arXiv |
  - **LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning** - gebc.svg?style=social&label=Star)](https://github.com/zjr2000/llmva-gebc) | LLMVA-GEBC | 06/2023 | [code](https://github.com/zjr2000/llmva-gebc) | CVPR |
  - **Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration** - llm.svg?style=social&label=Star)](https://github.com/lyuchenyang/macaw-llm) | Macaw-LLM | 06/2023 | [code](https://github.com/lyuchenyang/macaw-llm) | arXiv |
  - **VALLEY: Video Assistant with Large Language model Enhanced abilitY** - |
  - **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT) | Video-ChatGPT | 06/2023 | [code](https://github.com/mbzuai-oryx/Video-ChatGPT) | arXiv |
  - **PG-Video-LLaVA: Pixel Grounding Large Video-Language Models** - oryx/video-llava.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/video-llava) | PG-Video-LLaVA | 11/2023 | [code](https://github.com/mbzuai-oryx/video-llava) | arXiv |
  - **Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos** - Prompter | 12/2023 | - | arXiv |
  - **VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs**
  - **Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models** - anonymous-bs/favor.svg?style=social&label=Star)](https://github.com/the-anonymous-bs/favor) | FAVOR | 10/2023 | [code](https://github.com/the-anonymous-bs/favor) | arXiv |
  - **VideoChat: Chat-Centric Video Understanding** - Anything.svg?style=social&label=Star)](https://github.com/OpenGVLab/Ask-Anything) | VideoChat | 05/2023 | [code](https://github.com/OpenGVLab/Ask-Anything) [demo](https://huggingface.co/spaces/ynhe/AskAnything) | arXiv |
  - **TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models** - LLaVA | 11/2024 | [code](https://github.com/tingyu215/TS-LLaVA) | arXiv |
- 👀 Vid-LLM Instruction Tuning
  - **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
  - **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
- 🦾 Hybrid Methods
  - **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
- 🗒️ Taxonomy 1
  - **Seeing the Unseen: Visual Metaphor Captioning for Videos** - LLaVA | 06/2024 | [code]() | arXiv |
  - **Zero-shot long-form video understanding through screenplay** - Screenplayer | 06/2024 | [project page]() | CVPR |
  - **MoReVQA exploring modular reasoning models for video question answering**
  - **An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM** - VLM | 03/2024 | [code](https://github.com/imagegridworth/IG-VLM) | arXiv |
  - **Language repository for long video understanding**
  - **Understanding long videos in one multimodal language model pass**
  - **Video ReCap recursive captioning of hour-long videos**
  - **A Simple LLM Framework for Long-Range Video Question-Answering**
  - **Learning object state changes in videos an open-world perspective**
  - **AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?** - palm.github.io/AntGPT) | ICLR |
  - **ViperGPT: Visual Inference via Python Execution for Reasoning**
  - **Hawk: Learning to Understand Open-World Video Anomalies**
  - **DrVideo: Document Retrieval Based Long Video Understanding**
  - **OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer**
  - **Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA**
  - **VideoTree adaptive tree-based video representation for LLM reasoning on long videos**
  - **Harnessing Large Language Models for Training-free Video Anomaly Detection**
  - **TraveLER a multi-LMM agent framework for video question-answering**
  - **GPTSee enhancing moment retrieval and highlight detection via description-based similarity features**
  - **Reframe anything LLM agent for open world video reframing**
  - **SCHEMA state CHangEs MAtter for procedure planning in instructional videos**
  - **TV-TREES multimodal entailment trees for neuro-symbolic video reasoning** - TREES | 02/2024 | [code]() | arXiv |
  - **VideoAgent long-form video understanding with large language model as agent**
  - **VURF a general-purpose reasoning and self-refinement framework for video understanding**
  - **Why not use your textbook knowledge-enhanced procedure planning of instructional videos**
  - **DoraemonGPT toward understanding dynamic scenes with large language models** - x-yang/DoraemonGPT) | arXiv |
  - **Long context transfer from language to vision** - Lab/LongVA) | arXiv |
  - **ShareGPT4Video improving video understanding and generation with better captions**
  - **AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark** - web/) | arXiv |
  - **Artemis towards referential understanding in complex videos**
  - **EmoLLM multimodal emotional understanding meets large language models**
  - **Fewer tokens and fewer videos extending video understanding abilities in large vision-language models** - LLM | 06/2024 | - | arXiv |
  - **Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams** - VStream | 06/2024 | [code](https://invinciblewyq.github.io/vstream-page/) | arXiv |
  - **LLAVIDAL benchmarking large language vision models for daily activities of living** - x.github.io/) | arXiv |
  - **Towards event-oriented long video understanding** - Bench) | arXiv |
  - **Video-SALMONN speech-enhanced audio-visual large language models** - SALMONN | 06/2024 | [code](https://github.com/bytedance/SALMONN/) | ICML |
  - **VideoGPT+ integrating image and video encoders for enhanced video understanding** - oryx/VideoGPT-plus) | arXiv |
  - **MotionLLM: Understanding Human Behaviors from Human Motions and Videos**
  - **TOPA extend large language models for video understanding via text-only pre-alignment** - wei/TOPA) | NeurIPS |
  - **MovieChat+: Question-aware Sparse Memory for Long Video Question Answering**
  - **AutoAD III: The Prequel – Back to the Pixels**
  - **Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward** - Hound-DPO | 04/2024 | [code](https://github.com/RifleZhang/LLaVA-Hound-DPO) | arXiv |
  - **From image to video, what do we need in multimodal LLMs** - VILLM | 04/2024 | - | arXiv |
  - **Koala key frame-conditioned long video-LLM** - people.bu.edu/rxtan/projects/Koala) | CVPR |
  - **LongVLM efficient long video understanding via large language models**
  - **Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization** - | arXiv |
  - **Streaming long video understanding with large language models** - | arXiv |
  - **Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline** - | arXiv |
  - **MA-LMM memory-augmented large multimodal model for long-term video understanding** - LMM | 04/2024 | [code](https://boheumd.github.io/MA-LMM/) | CVPR |
  - **MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens** - Video | 04/2024 | [code](https://vision-cair.github.io/MiniGPT4-video/) | arXiv |
  - **Pegasus-v1 technical report** - v1 | 04/2024 | [code]() | arXiv |
  - **LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding** - nlco/VideoTGB) | EMNLP |
  - **PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning**
  - **Tarsier recipes for training and evaluating large video description models**
  - **X-VARS introducing explainability in football refereeing with multi-modal large language model** - VARS | 04/2024 | [code]() | arXiv |
  - **LVCHAT facilitating long video comprehension** - ustc/LVChat) | arXiv |
  - **OSCaR: Object State Captioning and State Change Representation**
  - **CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios** - CAT) | arXiv |
  - **InternVideo2 scaling video foundation models for multimodal video understanding**
  - **MovieLLM enhancing long video understanding with AI-generated movies**
  - **LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs**
  - **Slot-VLM SlowFast slots for video-language modeling** - VLM | 02/2024 | [code]() | arXiv |
  - **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
  - **Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering**
  - **Generative Multimodal Models are In-Context Learners**
  - **MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples**
  - **VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding**
  - **VILA: On Pre-training for Visual Language Models**
  - **Chat-UniVi unified visual representation empowers large language models with image and video understanding** - UniVi | 11/2023 | [code](https://github.com/PKU-YuanGroup/Chat-UniVi) | CVPR |
  - **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** - VID | 11/2023 | [code](https://github.com/dvlab-research/LLaMA-VID) | arXiv |
  - **Video-LLaVA learning united visual representation by alignment before projection** - LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
  - **Otter: A Multi-Modal Model with In-Context Instruction Tuning**
  - **AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue**
  - **Elysium exploring object-level perception in videos via MLLM** - Wong/Elysium) | arXiv |
  - **HawkEye training video-text LLMs for grounding text in videos** - binary-tree/HawkEye) | arXiv |
  - **LITA language instructed temporal-localization assistant**
  - **OmniViD: A Generative Framework for Universal Video Understanding**
  - **GroundingGPT: Language Enhanced Multi-modal Grounding Model** - lzw/GroundingGPT) | arXiv |
  - **Self-Chained Image-Language Model for Video Localization and Question Answering**
  - **VTimeLLM: Empower LLM to Grasp Video Moments**
  - **VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding** - LLM | 05/2024 | [code](https://github.com/gyxxyg/VTG-LLM) | arXiv |
  - **VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing** - llm.github.io/) | NeurIPS |
  - **VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT** - GPT | 03/2024 | [code](https://github.com/YoucanBaby/VTG-GPT) | arXiv |
  - **LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos** - Learning-AI-Lab/lifelong-memory) | arXiv |
  - **Zero-Shot Video Question Answering with Procedural Programs**
  - **AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**
  - **ST-LLM: Large Language Models Are Effective Temporal Learners** - LLM | 04/2024 | [code](https://github.com/TencentARC/ST-LLM) | arXiv |
  - **ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst** - chatbridge.github.io) | arXiv |
  - **Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM** - VAD | 06/2024 | [code](https://holmesvad.github.io/) | arXiv |
  - **VideoLLM-online online video large language model for streaming video** - online | 06/2024 | [code](https://showlab.github.io/videollm-online) | CVPR |
  - **HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision** - ref/) | arXiv |
  - **V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning** - LLaMA | 04/2024 | [code](https://hanghuacs.github.io/v2xum/) | arXiv |
  - **Momentor advancing video large language model with fine-grained temporal reasoning**
  - **Detours for navigating instructional videos**
  - **OneLLM: One Framework to Align All Modalities with Language**
  - **GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation**
  - **Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos** - shot | 12/2023 | [code](https://mingfei.info/shot2story/) | arXiv |
  - **Vript: A Video Is Worth Thousands of Words**
  - **Merlin:Empowering Multimodal LLMs with Foresight Minds**
  - **Contextual AD Narration with Interleaved Multimodal Sequence** - AD | 03/2024 | [code](https://github.com/MCG-NJU/Uni-AD) | arXiv |
  - **MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning** - narrator | 11/2023 | [project page](https://mm-narrator.github.io/) | arXiv |
  - **Vamos: Versatile Action Models for Video Understanding** - palm.github.io/Vamos/) | ECCV |
  - **AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description** - AD II | 10/2023 | [project page](https://www.robots.ox.ac.uk/vgg/research/autoad/) | ICCV |

Programming Languages

Categories

😎 Vid-LLMs: Models 148 Tasks, Datasets, and Benchmarks 49 Contributing 21

Sub Categories

🗒️ Taxonomy 1 103 🗒️ Taxonomy 2 88 ♥️ Contributors 20 👀 Vid-LLM Instruction Tuning 2 🦾 Hybrid Methods 2 🌟 Star History 1

Keywords

whisper 1 video-language 1 large-language-model 1 langchain 1 chatgpt 1