Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-llms-for-video-understanding
π₯π₯π₯Latest Papers, Codes and Datasets on Vid-LLMs.
https://github.com/yunlong10/awesome-llms-for-video-understanding
Last synced: 3 days ago
JSON representation
-
Tasks, Datasets, and Benchmarks
-
ποΈ Taxonomy 2
- **FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation**
- **The Kinetics Human Action Video Dataset** - 400-1) | - |
- **MovieNet: A Holistic Dataset for Movie Understanding**
- **Creating Summaries from User Videos**
- **TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering** - qa)|CVPR|
- **TGIF: A New Dataset and Benchmark on Animated GIF Description** - Release)|CVPR|
- **HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips**
- **MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions**
- **YouTube-8M: A Large-Scale Video Classification Benchmark** - |
- **ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding** - net.org/) | CVPR |
- **VidChapters-7M: Video Chapters at Scale**
- **Collecting Highly Parallel Data for Paraphrase Evaluation**
- **MSR-VTT: A Large Video Description Dataset for Bridging Video and Language** - us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/)|CVPR|
- **Dense-Captioning Events in Videos**
- **Towards Automatic Learning of Procedures from Web Instructional Videos**
- **TVSum: Summarizing web videos using titles**
- **From Recognition to Cognition: Visual Commonsense Reasoning**
- **Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences** - Dataset)|CVPR|
- **ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering** - qa)|AAAI|
- **VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset** - mercury/VALOR)|arXiv|
- **InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation**
- **MIMIC-IT: Multi-Modal In-Context Instruction Tuning**
- **Perception Test: A Diagnostic Benchmark for Multimodal Video Models** - deepmind/perception_test) | NeurIPS 2023, ICCV 2023 Workshop |
- **MVBench: A Comprehensive Multi-modal Video Understanding Benchmark** - Anything) | - |
- **VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models** - tssn/VideoHallucer.svg?style=social&label=Star)](https://github.com/patrick-tssn/VideoHallucer) | 06/2024 | [code](https://github.com/patrick-tssn/VideoHallucer) | - |
- **Actor and Observer: Joint Modeling of First and Third-Person Videos** - ego)|CVPR|
- **GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval** - plus) | ECCV |
- **Multimodal Pretraining for Dense Video Captioning** - research-datasets/Video-Timeline-Tags-ViTT)|AACL-IJCNLP|
- **VideoXum: Cross-modal Visual and Textural Summarization of Videos**
- **Rescaling Egocentric Vision** - kitchens.github.io/2021)|IJCV|
- **TALL: Temporal Activity Localization via Language Query**
- **Localizing Moments in Video with Natural Language**
- **DeepStory: Video Story QA by Deep Embedded Memory Networks** - Min/PororoQA)|IJCAI|
- **TempCompass: Do Video LLMs Really Understand Videos?**
- **Ego4D: Around the World in 3,000 Hours of Egocentric Video** - data.org/)|CVPR|
- **TVQA: Localized, Compositional Video Question Answering**
- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - 100K)|arXiv|
- **Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models** - YuanGroup/Video-Bench) | - |
- **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - plug/youku-mplug.svg?style=social&label=Star)](https://github.com/x-plug/youku-mplug) | 07/2023 | [code](https://github.com/x-plug/youku-mplug) | - |
- **MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding** - |
- **Encoding and Controlling Global Semantics for Long-form Video Question Answering**
- **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis** - MME.svg?style=social&label=Star)](https://github.com/BradyFU/Video-MME) | 06/2024 | [code](https://github.com/BradyFU/Video-MME) | - |
- **LVBench: An Extreme Long Video Understanding Benchmark** - |
- **TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**
- **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/VAST)|NeurIPS|
- **VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs** - NLP-SG/Multi-Source-Video-Captioning)|arXiv|
- **Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding**
- **Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks** - PLUG/Youku-mPLUG)|arXiv|
-
π¦Ύ Hybrid Methods
-
-
Contributing
-
π Star History
- ![Star History Chart - history.com/#yunlong10/Awesome-LLMs-for-Video-Understanding&Date)
-
β₯οΈ Contributors
-
-
π Vid-LLMs: Models
-
ποΈ Taxonomy 2
- **Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language**
- **AutoAD: Movie Description in Context**
- **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** - research/LLaMA-VID)](https://github.com/dvlab-research/LLaMA-VID) | LLaMA-VID | 11/2023 | [code](https://github.com/dvlab-research/LLaMA-VID) | arXiv |
- **VTimeLLM: Empower LLM to Grasp Video Moments**
- **GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation** - | arXiv |
- **VLog: Video as a Long Document** - |
- **MISAR: A Multimodal Instructional System with Augmented Reality**
- **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** - LLaVA | 07/2024 | - | arXiv |
- **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
- **AutoAD III: The Prequel -- Back to the Pixels** - | CVPR |
- **VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding**
- **Learning Video Representations from Large Language Models**
- **Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning**
- **VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset** - mercury/vast?style=social&label=Star)](https://github.com/txh-mercury/vast) | VAST | 05/2023 | [code](https://github.com/txh-mercury/vast) | NeurIPS |
- **Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding** - NLP-SG/Video-LLaMA.svg?style=social&label=Star)](https://github.com/DAMO-NLP-SG/Video-LLaMA) | Video-LLaMA | 06/2023 | [code](https://github.com/DAMO-NLP-SG/Video-LLaMA) | arXiv |
- **Large Language Models are Temporal and Causal Reasoners for Video Question Answering** - VQA.svg?style=social&label=Star)](https://github.com/mlvlab/Flipped-VQA) | LLaMA-VQA | 10/2023 | [code](https://github.com/mlvlab/Flipped-VQA) | EMNLP |
- **VideoLLM: Modeling Video Sequence with Large Language Models**
- **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
- **Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions** - CAIR/ChatCaptioner.svg?style=social&label=Star)](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | Video ChatCaptioner | 04/2023 | [code](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner) | arXiv |
- **Audio-Visual LLM for Video Understanding** - | 12/2023 | - | arXiv |
- **A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot**
- **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
- **MM-VID: Advancing Video Understanding with GPT-4V(ision)** - VID | 10/2023 | - | arXiv |
- **Merlin:Empowering Multimodal LLMs with Foresight Minds** - | arXiv |
- **MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**
- **AutoAD II: The Sequel - Who, When, and What in Movie Audio Description** - | ICCV |
- **Otter: A Multi-Modal Model with In-Context Instruction Tuning**
- **ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System**
- **NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation** - epic.github.io/NaVid/) - | RSS |
- **VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens** - LLAMA | 12/2023 | - | arXiv |
- **LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning** - gebc.svg?style=social&label=Star)](https://github.com/zjr2000/llmva-gebc) | LLMVA-GEBC | 06/2023 | [code](https://github.com/zjr2000/llmva-gebc) | CVPR |
- **Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration** - llm.svg?style=social&label=Star)](https://github.com/lyuchenyang/macaw-llm) | Macaw-LLM | 06/2023 | [code](https://github.com/lyuchenyang/macaw-llm) | arXiv |
- **VALLEY: Video Assistant with Large Language model Enhanced abilitY** - |
- **Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models** - oryx/Video-ChatGPT.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/Video-ChatGPT) | Video-ChatGPT | 06/2023 | [code](https://github.com/mbzuai-oryx/Video-ChatGPT) | arXiv |
- **PG-Video-LLaVA: Pixel Grounding Large Video-Language Models** - oryx/video-llava.svg?style=social&label=Star)](https://github.com/mbzuai-oryx/video-llava) | PG-Video-LLaVA | 11/2023 | [code](https://github.com/mbzuai-oryx/video-llava) | arXiv |
- **Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos** - Prompter | 12/2023 | - | arXiv |
- **VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs**
- **Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models** - anonymous-bs/favor.svg?style=social&label=Star)](https://github.com/the-anonymous-bs/favor) | FAVOR | 10/2023 | [code](https://github.com/the-anonymous-bs/favor) | arXiv |
- **VideoChat: Chat-Centric Video Understanding** - Anything.svg?style=social&label=Star)](https://github.com/OpenGVLab/Ask-Anything) | VideoChat | 05/2023 | [code](https://github.com/OpenGVLab/Ask-Anything) [demo](https://huggingface.co/spaces/ynhe/AskAnything) | arXiv |
- **TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models** - LLaVA | 11/2024 | [code](https://github.com/tingyu215/TS-LLaVA) | arXiv |
-
π Vid-LLM Instruction Tuning
- **Video-LLaVA: Learning United Visual Representation by Alignment Before Projection** - YuanGroup/Video-LLaVA.svg?style=social&label=Star)](https://github.com/PKU-YuanGroup/Video-LLaVA) | Video-LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
- **Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding** - yuangroup/chat-univi.svg?style=social&label=Star)](https://github.com/pku-yuangroup/chat-univi) | Chat-UniVi | 11/2023 | [code](https://github.com/pku-yuangroup/chat-univi) | arXiv |
-
π¦Ύ Hybrid Methods
- **Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding** - GroundingDINO.svg?style=social&label=Star)](https://github.com/TalalWasim/Video-GroundingDINO) | Video-GroundingDINO | 12/2023 | [code](https://github.com/TalalWasim/Video-GroundingDINO) | arXiv |
-
ποΈ Taxonomy 1
- **Seeing the Unseen: Visual Metaphor Captioning for Videos** - LLaVA | 06/2024 | [code]() | arXiv |
- **Zero-shot long-form video understanding through screenplay** - Screenplayer | 06/2024 | [project page]() | CVPR |
- **MoReVQA exploring modular reasoning models for video question answering**
- **An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM** - VLM | 03/2024 | [code](https://github.com/imagegridworth/IG-VLM) | arXiv |
- **Language repository for long video understanding**
- **Understanding long videos in one multimodal language model pass**
- **Video ReCap recursive captioning of hour-long videos**
- **A Simple LLM Framework for Long-Range Video Question-Answering**
- **Learning object state changes in videos an open-world perspective**
- **AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?** - palm.github.io/AntGPT) | ICLR |
- **ViperGPT: Visual Inference via Python Execution for Reasoning**
- **Hawk: Learning to Understand Open-World Video Anomalies**
- **DrVideo: Document Retrieval Based Long Video Understanding**
- **OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer**
- **Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA**
- **VideoTree adaptive tree-based video representation for LLM reasoning on long videos**
- **Harnessing Large Language Models for Training-free Video Anomaly Detection**
- **TraveLER a multi-LMM agent framework for video question-answering**
- **GPTSee enhancing moment retrieval and highlight detection via description-based similarity features**
- **Reframe anything LLM agent for open world video reframing**
- **SCHEMA state CHangEs MAtter for procedure planning in instructional videos**
- **TV-TREES multimodal entailment trees for neuro-symbolic video reasoning** - TREES | 02/2024 | [code]() | arXiv |
- **VideoAgent long-form video understanding with large language model as agent**
- **VURF a general-purpose reasoning and self-refinement framework for video understanding**
- **Why not use your textbook knowledge-enhanced procedure planning of instructional videos**
- **DoraemonGPT toward understanding dynamic scenes with large language models** - x-yang/DoraemonGPT) | arXiv |
- **Long context transfer from language to vision** - Lab/LongVA) | arXiv |
- **ShareGPT4Video improving video understanding and generation with better captions**
- **AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark** - web/) | arXiv |
- **Artemis towards referential understanding in complex videos**
- **EmoLLM multimodal emotional understanding meets large language models**
- **Fewer tokens and fewer videos extending video understanding abilities in large vision-language models** - LLM | 06/2024 | - | arXiv |
- **Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams** - VStream | 06/2024 | [code](https://invinciblewyq.github.io/vstream-page/) | arXiv |
- **LLAVIDAL benchmarking large language vision models for daily activities of living** - x.github.io/) | arXiv |
- **Towards event-oriented long video understanding** - Bench) | arXiv |
- **Video-SALMONN speech-enhanced audio-visual large language models** - SALMONN | 06/2024 | [code](https://github.com/bytedance/SALMONN/) | ICML |
- **VideoGPT+ integrating image and video encoders for enhanced video understanding** - oryx/VideoGPT-plus) | arXiv |
- **MotionLLM: Understanding Human Behaviors from Human Motions and Videos**
- **TOPA extend large language models for video understanding via text-only pre-alignment** - wei/TOPA) | NeurIPS |
- **MovieChat+: Question-aware Sparse Memory for Long Video Question Answering**
- **AutoAD III: The Prequel β Back to the Pixels**
- **Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward** - Hound-DPO | 04/2024 | [code](https://github.com/RifleZhang/LLaVA-Hound-DPO) | arXiv |
- **From image to video, what do we need in multimodal LLMs** - VILLM | 04/2024 | - | arXiv |
- **Koala key frame-conditioned long video-LLM** - people.bu.edu/rxtan/projects/Koala) | CVPR |
- **LongVLM efficient long video understanding via large language models**
- **Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization** - | arXiv |
- **Streaming long video understanding with large language models** - | arXiv |
- **Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline** - | arXiv |
- **MA-LMM memory-augmented large multimodal model for long-term video understanding** - LMM | 04/2024 | [code](https://boheumd.github.io/MA-LMM/) | CVPR |
- **MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens** - Video | 04/2024 | [code](https://vision-cair.github.io/MiniGPT4-video/) | arXiv |
- **Pegasus-v1 technical report** - v1 | 04/2024 | [code]() | arXiv |
- **LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding** - nlco/VideoTGB) | EMNLP |
- **PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning**
- **Tarsier recipes for training and evaluating large video description models**
- **X-VARS introducing explainability in football refereeing with multi-modal large language model** - VARS | 04/2024 | [code]() | arXiv |
- **LVCHAT facilitating long video comprehension** - ustc/LVChat) | arXiv |
- **OSCaR: Object State Captioning and State Change Representation**
- **CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios** - CAT) | arXiv |
- **InternVideo2 scaling video foundation models for multimodal video understanding**
- **MovieLLM enhancing long video understanding with AI-generated movies**
- **LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs**
- **Slot-VLM SlowFast slots for video-language modeling** - VLM | 02/2024 | [code]() | arXiv |
- **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
- **Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering**
- **Generative Multimodal Models are In-Context Learners**
- **MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples**
- **VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding**
- **VILA: On Pre-training for Visual Language Models**
- **Chat-UniVi unified visual representation empowers large language models with image and video understanding** - UniVi | 11/2023 | [code](https://github.com/PKU-YuanGroup/Chat-UniVi) | CVPR |
- **LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models** - VID | 11/2023 | [code](https://github.com/dvlab-research/LLaMA-VID) | arXiv |
- **Video-LLaVA learning united visual representation by alignment before projection** - LLaVA | 11/2023 | [code](https://github.com/PKU-YuanGroup/Video-LLaVA) | arXiv |
- **Otter: A Multi-Modal Model with In-Context Instruction Tuning**
- **AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue**
- **Elysium exploring object-level perception in videos via MLLM** - Wong/Elysium) | arXiv |
- **HawkEye training video-text LLMs for grounding text in videos** - binary-tree/HawkEye) | arXiv |
- **LITA language instructed temporal-localization assistant**
- **OmniViD: A Generative Framework for Universal Video Understanding**
- **GroundingGPT: Language Enhanced Multi-modal Grounding Model** - lzw/GroundingGPT) | arXiv |
- **Self-Chained Image-Language Model for Video Localization and Question Answering**
- **VTimeLLM: Empower LLM to Grasp Video Moments**
- **VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding** - LLM | 05/2024 | [code](https://github.com/gyxxyg/VTG-LLM) | arXiv |
- **VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing** - llm.github.io/) | NeurIPS |
- **VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT** - GPT | 03/2024 | [code](https://github.com/YoucanBaby/VTG-GPT) | arXiv |
- **LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos** - Learning-AI-Lab/lifelong-memory) | arXiv |
- **Zero-Shot Video Question Answering with Procedural Programs**
- **AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn**
- **ST-LLM: Large Language Models Are Effective Temporal Learners** - LLM | 04/2024 | [code](https://github.com/TencentARC/ST-LLM) | arXiv |
- **ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst** - chatbridge.github.io) | arXiv |
- **Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM** - VAD | 06/2024 | [code](https://holmesvad.github.io/) | arXiv |
- **VideoLLM-online online video large language model for streaming video** - online | 06/2024 | [code](https://showlab.github.io/videollm-online) | CVPR |
- **HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision** - ref/) | arXiv |
- **V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning** - LLaMA | 04/2024 | [code](https://hanghuacs.github.io/v2xum/) | arXiv |
- **Momentor advancing video large language model with fine-grained temporal reasoning**
- **Detours for navigating instructional videos**
- **OneLLM: One Framework to Align All Modalities with Language**
- **GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation**
- **Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos** - shot | 12/2023 | [code](https://mingfei.info/shot2story/) | arXiv |
- **Vript: A Video Is Worth Thousands of Words**
- **Merlin:Empowering Multimodal LLMs with Foresight Minds**
- **Contextual AD Narration with Interleaved Multimodal Sequence** - AD | 03/2024 | [code](https://github.com/MCG-NJU/Uni-AD) | arXiv |
- **MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning** - narrator | 11/2023 | [project page](https://mm-narrator.github.io/) | arXiv |
- **Vamos: Versatile Action Models for Video Understanding** - palm.github.io/Vamos/) | ECCV |
- **AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description** - AD II | 10/2023 | [project page](https://www.robots.ox.ac.uk/vgg/research/autoad/) | ICCV |
Programming Languages
Sub Categories