https://github.com/yyyujintang/Awesome-VideoLLM-Papers

This repository compiles a list of papers related to Video LLM.
https://github.com/yyyujintang/Awesome-VideoLLM-Papers

Last synced: 7 months ago
JSON representation

This repository compiles a list of papers related to Video LLM.

Host: GitHub
URL: https://github.com/yyyujintang/Awesome-VideoLLM-Papers
Owner: yyyujintang
Created: 2024-06-03T05:52:14.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-06-27T06:49:40.000Z (over 1 year ago)
Last Synced: 2025-03-14T08:01:36.425Z (7 months ago)
Homepage:
Size: 4.88 KB
Stars: 19
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome-Video-Robotic-Papers - Awesome-VideoLLM-Papers
ultimate-awesome - Awesome-VideoLLM-Papers - This repository compiles a list of papers related to Video LLM. (Other Lists / TeX Lists)

README

          # Awesome-VideoLLM-Papers

This repository compiles a list of papers related to Video LLM.  Continual improvements are being made to this repository. If you come across any relevant papers that should be included, please don't hesitate to open an issue.

## Survey

(Arxiv23.06.23) A Survey on Multimodal Large Language Models [Paper](https://arxiv.org/abs/2306.13549) [Code](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models) ![Stars](https://img.shields.io/github/stars/BradyFU/Awesome-Multimodal-Large-Language-Models)

(Arxiv23.11.10) How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model [Paper](https://arxiv.org/abs/2311.07594)

(Arxiv23.11.21) A Survey on Multimodal Large Language Models for Autonomous Driving [Paper](https://arxiv.org/abs/2311.12320)

(Arxiv24.01.10) Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning [Paper](https://arxiv.org/abs/2401.06805) 

(Arxiv24.02.01) Safety of Multimodal Large Language Models on Images and Text [Paper](https://arxiv.org/abs/2402.00357) [Code](https://github.com/isXinLiu/MLLM-Safety-Collection) ![Stars](https://img.shields.io/github/stars/isXinLiu/MLLM-Safety-Collection)

(Arxiv24.02.19) The (R)Evolution of Multimodal Large Language Models: A Survey [Paper](https://arxiv.org/abs/2402.12451) 

(Arxiv24.05.17) Efficient Multimodal Large Language Models: A Survey [Paper](https://arxiv.org/abs/2405.10739) [Code](https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey) ![Stars](https://img.shields.io/github/stars/lijiannuist/Efficient-Multimodal-LLMs-Survey)

(Arxiv24.05.29) LLMs Meet Multimodal Generation and Editing: A Survey [Paper](https://arxiv.org/abs/2405.19334) [Code](https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation) ![Stars](https://img.shields.io/github/stars/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation)

## VLM

## MLLM

(Arxiv23.09.01) HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving [Paper](https://arxiv.org/abs/2309.05186) 

(Arxiv24.01.18) Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models [Paper](https://arxiv.org/abs/2309.05186) 

(Arxiv24.01.26) From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities [Paper](https://arxiv.org/abs/2401.15071) 

(Arxiv24.03.25) Elysium: Exploring Object-level Perception in Videos via MLLM [Paper](https://arxiv.org/abs/2403.16558) 

(Arxiv24.04.28) WorldGPT: Empowering LLM as Multimodal World Model [Paper](https://arxiv.org/abs/2404.18202) 

(Arxiv24.05.01) EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model [Paper](https://arxiv.org/abs/2405.00574) 

(Arxiv24.05.31) ToxVidLLM: A Multimodal LLM-based Framework for Toxicity Detection in Code-Mixed Videos [Paper](https://arxiv.org/abs/2405.20628) 

(Arxiv24.06.03) SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model [Paper](https://arxiv.org/abs/2406.01584) 

(**Arxiv24.06.05, ICML24**) Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [Paper](https://arxiv.org/abs/2406.02915) [Code](https://github.com/tmlr-group/WCA) ![Stars](https://img.shields.io/github/stars/tmlr-group/WCA)

(Arxiv24.06.24) Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [Paper](https://arxiv.org/abs/2406.16860) 

## VideoLLM

(Arxiv23.10.02) DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [Paper](https://arxiv.org/abs/2310.01412) 

(Arxiv23.11.25) GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation [Paper](https://arxiv.org/abs/2311.16511) 

(Arxiv23.11.28) MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [Paper](https://arxiv.org/abs/2311.17005) 

(Arxiv23.12.05) EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model [Paper](https://arxiv.org/abs/2312.02483) 

(Arxiv24.01.02) Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models [Paper](https://arxiv.org/abs/2401.00988) 

(Arxiv24.04.18) From Image to Video, what do we need in multimodal LLMs? [Paper](https://arxiv.org/abs/2404.11865) 

(Arxiv24.04.25) Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model [Paper](https://arxiv.org/abs/2404.16305) 

(Arxiv24.05.13) FreeVA: Offline MLLM as Training-Free Video Assistant [Paper](https://arxiv.org/abs/2405.07798) 

(Arxiv24.05.29) VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos [Paper](https://arxiv.org/abs/2405.19209) 

(Arxiv24.05.30) MotionLLM: Understanding Human Behaviors from Human Motions and Videos [Paper](https://arxiv.org/abs/2405.20340) 

(Arxiv24.05.31) Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [Paper](https://arxiv.org/abs/2405.21075) 

(Arxiv24.06.06) MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding [Paper](https://arxiv.org/abs/2406.04264) 

(Arxiv24.06.10) Vript: A Video Is Worth Thousands of Words [Paper](https://arxiv.org/abs/2406.06040) [Code](https://github.com/mutonix/Vript) ![Stars](https://img.shields.io/github/stars/mutonix/Vript)

(Arxiv24.06.11) VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs [Paper](https://arxiv.org/abs/2406.07476) 

## Other Useful Sources

[Awesome-LLMs-for-Video-Understanding](https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding)

[VLM-Eval: A General Evaluation on Video Large Language Models](https://github.com/zyayoung/Awesome-Video-LLMs)

[LLMs Meet Multimodal Generation and Editing: A Survey](https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yyyujintang/Awesome-VideoLLM-Papers

Awesome Lists containing this project

README