https://github.com/longvideobench/LongVideoBench

Official Dataloader and Evaluation Scripts for LongVideoBench.
https://github.com/longvideobench/LongVideoBench

Last synced: 3 months ago
JSON representation

Official Dataloader and Evaluation Scripts for LongVideoBench.

Host: GitHub
URL: https://github.com/longvideobench/LongVideoBench
Owner: longvideobench
Created: 2024-06-13T05:06:59.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-07-24T07:31:28.000Z (11 months ago)
Last Synced: 2024-07-24T08:48:39.129Z (11 months ago)
Language: Python
Size: 2.58 MB
Stars: 29
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-golang-ai - LongVideoBench

README

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu,
Dongxu Li,
Bei Chen,
Junnan Li

Dataset on HuggingFace | Homepage | HuggingFace Leaderboard | ArXiv

Introduction

(left) An referring reasoning question. (right) Results with different input frames.

Initial Leaderboard

View more on HuggingFace Leaderboard.

## [Custom Use] Load the LongVideoBench Dataset

1. Download the dataset via Hugging Face CLI:

```shell
huggingface-cli download longvideobench/LongVideoBench --repo-type dataset --local-dir LongVideoBench --local-dir-use-symlinks False
```

2. Extract from the `.tar` files:

```shell
cat videos.tar.part.* > videos.tar
tar -xvf videos.tar
tar -xvf subtitles.tar
```

3. Use the [LongVideoBench] dataloader to load the data from raw MP4 files and subtitles:

- (a) Install the dataloader:

```shell
git clone https://github.com/LongVideoBench/LongVideoBench.git
cd LongVideoBench
pip install -e .
```
- (b) Load the dataset in python scripts:

```python
from longvideobench import LongVideoBenchDataset

# validation
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_val.json", max_num_frames=64)

# test
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_test_wo_gt.json", max_num_frames=64)

print(dataset[0]["inputs"]) # A list consisting of PIL.Image and strings.
```

The "inputs" are interleaved video frames and text subtitles, followed by questions and option prompts. You can then convert them to the format that your LMMs can accept.

## [Automatic] Evaluating with LMMs-Eval

LongVideoBench has been integrated into [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) library for automatic evaluation. With datasets and models on Hugging Face, you and can start automatic evaluation once the LMMs-Eval library is properly installed.

### Install

Please install LMMs-Eval as follows:

```
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
```

This will install the GitHub main version that supports tasks: `longvideobench_val_i` (LongVideoBench for Image LMMs) and `longvideobenc_val_v` (LongVideoBench for Video-specific LMMs).

### Example Use (Image LMMs)

We feed 16 frames by default for Image LMMs. To modify this, please go to `lmms_eval/tasks/longvideobench/utils.py` and change the parameter `max_num_frames` to other values (e.g. 4, 8, or 32, or even 64, 128, 256 for proprietary models).

- Idefics2

```
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --model idefics2 --tasks longvideobench_val_i --batch_size 1 --log_samples --log_samples_suffix idefics2_lvb_i --output_path ./logs/
```

- Phi3V

```
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --model phi3v --tasks longvideobench_val_i --batch_size 1 --log_samples --log_samples_suffix phi3v_lvb_i --output_path ./logs/
```

### Example Use (Video-specific LMMs)

- LLaVA-NeXT-Video-34B-DPO

(32 frames)

```
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --model llavavid --model_args pretrained="lmms-lab/LLaVA-NeXT-Video-34B-DPO",max_frames_num=32,conv_template=chatml_direct,video_decode_backend="decord" --tasks longvideobench_val_v --batch_size 1 --log_samples --log_samples_suffix llavavid_34b_dpo_lvb_v --output_path ./logs/
```

- LLaVA-NeXT-Video-7B-DPO

(32 frames)

```
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --model llavavid --model_args pretrained="lmms-lab/LLaVA-NeXT-Video-7B-DPO",max_frames_num=32,video_decode_backend="decord" --tasks longvideobench_val_v --batch_size 1 --log_samples --log_samples_suffix llavavid_7b_dpo_lvb_v --output_path ./logs/
```

- Video-LLaVA

(8 frames)

```
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --model video_llava --tasks longvideobench_val_v --batch_size 1 --log_samples --log_samples_suffix video_llava_lvb_v --output_path ./logs/
```

## Contact

Please contact `[email protected]` for any queries.

## License

This dataset follows CC-BY-NC-SA 4.0 license. Please use this dataset for non-commercial use ONLY.

## Citation

```bibtex
@misc{wu2024longvideobench,
title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding},
author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
year={2024},
eprint={2407.15754},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.15754},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome