https://github.com/damo-nlp-sg/videorefer
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
https://github.com/damo-nlp-sg/videorefer
mllm pixel-understanding sam2 video-understanding
Last synced: 5 months ago
JSON representation
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
- Host: GitHub
- URL: https://github.com/damo-nlp-sg/videorefer
- Owner: DAMO-NLP-SG
- Created: 2024-12-23T05:24:06.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-04-28T15:40:04.000Z (5 months ago)
- Last Synced: 2025-05-08T19:05:12.803Z (5 months ago)
- Topics: mllm, pixel-understanding, sam2, video-understanding
- Language: Python
- Homepage:
- Size: 130 MB
- Stars: 194
- Watchers: 10
- Forks: 11
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
![]()
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
[](http://arxiv.org/abs/2501.00599)
[](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K)
[](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B)
[](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-Bench)[](https://www.youtube.com/watch?v=gLNOj1OPFJE)
[](https://damo-nlp-sg.github.io/VideoRefer/)
![]()
VideoRefer can understand any object you're interested within a video.## 📰 News
* **[2025.4.22]** 🔥Our VideoRefer-Bench has been adopted in [Describe Anything Model](https://arxiv.org/pdf/2504.16072) (NVIDIA & UC Berkeley).
* **[2025.2.27]** 🔥VideoRefer Suite has been accepted to CVPR2025!
* **[2025.2.18]** 🔥We release the [VideoRefer-700K dataset](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K) on HuggingFace.
* **[2025.1.1]** 🔥We release the code of VideoRefer and the VideoRefer-Bench.## 🎥 Video
https://github.com/user-attachments/assets/d943c101-72f3-48aa-9822-9cfa46fa114b
- HD video can be viewed on [YouTube](https://www.youtube.com/watch?v=gLNOj1OPFJE).
## 🔍 About VideoRefer Suite
`VideoRefer Suite` is designed to enhance the fine-grained spatial-temporal understanding capabilities of Video Large Language Models (Video LLMs). It consists of three primary components:
* **Model (VideoRefer)**
`VideoRefer` is an effective Video LLM, which enables fine-grained perceiving, reasoning, and retrieval for user-defined regions at any specified timestamps—supporting both single-frame and multi-frame region inputs.
![]()
* **Dataset (VideoRefer-700K)**
`VideoRefer-700K` is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.
![]()
* **Benchmark (VideoRefer-Bench)**
`VideoRefer-Bench` is a comprehensive benchmark to evaluate the object-level video understanding capabilities of a model, which consists of two sub-benchmarks: **VideoRefer-Bench-D** and **VideoRefer-Bench-Q**.
![]()
## 🛠️ Requirements and Installation
Basic Dependencies:
* Python >= 3.8
* Pytorch >= 2.2.0
* CUDA Version >= 11.8
* transformers == 4.40.0 (for reproducing paper results)
* tokenizers == 0.19.1Install required packages:
```bash
git clone https://github.com/DAMO-NLP-SG/VideoRefer
cd VideoRefer
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation
```## 🌟 Getting started
Please refer to the examples in [infer.ipynb](./demo/infer.ipynb) for detailed instructions on how to use our model for single video inference, which supports both single-frame and multi-frame modes.
For better usage, the demo integrates with [SAM2](https://github.com/facebookresearch/sam2), to get started, please install SAM2 first:
```shell
git clone https://github.com/facebookresearch/sam2.git && cd sam2SAM2_BUILD_CUDA=0 pip install -e ".[notebooks]"
```
Then, download [sam2.1_hiera_large.pt](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt) to `checkpoints`.## 🗝️ Training & Evaluation
### Training
The training data and data structure can be found in [Dataset preparation](training.md).The training pipeline of our model is structured into four distinct stages.
- **Stage1: Image-Text Alignment Pre-training**
- We use the same data as in [VideoLLaMA2.1](https://github.com/DAMO-NLP-SG/VideoLLaMA2).
- The pretrained projector weights can be found in [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base).- **Stage2: Region-Text Alignment Pre-training**
- Prepare datasets used for stage2.
- Run `bash scripts/train/stage2.sh`.- **Stage2.5: High-Quality Knowledge Learning**
- Prepare datasets used for stage2.5.
- Run `bash scripts/train/stage2.5.sh`.
- **Stage3: Visual Instruction Tuning**
- Prepare datasets used for stage3.
- Run `bash scripts/train/stage3.sh`.
### Evaluation
For model evaluation, please refer to [eval](eval/eval.md).## 🌏 Model Zoo
| Model Name | Visual Encoder | Language Decoder | # Training Frames |
|:----------------|:----------------|:------------------|:----------------:|
| [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
| [VideoRefer-7B-stage2](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |
| [VideoRefer-7B-stage2.5](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2.5) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 |## 🖨️ VideoRefer-700K
The dataset can be accessed on [huggingface](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K).By leveraging our multi-agent data engine, we meticulously create three primary types of object-level video instruction data:
- Object-level Detailed Caption
- Object-level Short Caption
- Object-level QAVideo sources:
- Detailed&Short Caption
- [Panda-70M](https://snap-research.github.io/Panda-70M/).
- QA
- [MeViS](https://codalab.lisn.upsaclay.fr/competitions/15094)
- [A2D](https://web.eecs.umich.edu/~jjcorso/r/a2d/index.html#downloads)
- [Youtube-VOS](https://competitions.codalab.org/competitions/29139#participate-get_data)Data format:
```json
[
{
"video": "videos/xxx.mp4",
"conversations": [
{
"from": "human",
"value": "\nWhat is the relationship of and ?"
},
{
"from": "gpt",
"value": "...."
},
...
],
"annotation":[
//object1
{
"frame_idx":{
"segmentation": {
//rle format or polygon
}
}
"frame_idx":{
"segmentation": {
//rle format or polygon
}
}
},
//object2
{
"frame_idx":{
"segmentation": {
//rle format or polygon
}
}
},
...
]}
```## 🕹️ VideoRefer-Bench
`VideoRefer-Bench` assesses the models in two key areas: Description Generation, corresponding to `VideoRefer-BenchD`, and Multiple-choice Question-Answer, corresponding to `VideoRefer-BenchQ`.
https://github.com/user-attachments/assets/33757d27-56bd-4523-92da-8f5a58fe5c85
- The annotations of the benchmark can be found in [🤗benchmark](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-Bench).
- The usage of VideoRefer-Bench is detailed in [doc](./benchmark/README.md).
- To evaluate general MLLMs on VideoRefer-Bench, please refer to [eval](./benchmark/evaluation_general_mllms.md).
## 📑 Citation
If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{yuan2025videorefersuite,
title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
journal={arXiv},
year={2025},
url = {http://arxiv.org/abs/2501.00599}
}
```💡 Some other multimodal-LLM projects from our team may interest you ✨.
> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://github.com/DAMO-NLP-SG/Video-LLaMA)
> Hang Zhang, Xin Li, Lidong Bing
[](https://github.com/DAMO-NLP-SG/Video-LLaMA) [](https://github.com/DAMO-NLP-SG/Video-LLaMA) [](https://arxiv.org/abs/2306.02858)> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
> Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
[](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [](https://arxiv.org/abs/2406.07476)> [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https://github.com/CircleRadon/Osprey)
> Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu
[](https://github.com/CircleRadon/Osprey) [](https://github.com/CircleRadon/Osprey) [](https://arxiv.org/abs/2312.10032)## 👍 Acknowledgement
The codebase of VideoRefer is adapted from [**VideoLLaMA 2**](https://github.com/DAMO-NLP-SG/VideoLLaMA2).
The visual encoder and language decoder we used in VideoRefer are [**Siglip**](https://huggingface.co/google/siglip-so400m-patch14-384) and [**Qwen2**](https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f), respectively.