https://github.com/tiger-ai-lab/videoeval-pro

More reliable Video Understanding Evaluation
https://github.com/tiger-ai-lab/videoeval-pro

evaluation multimodal understanding video

Last synced: 12 months ago
JSON representation

More reliable Video Understanding Evaluation

Host: GitHub
URL: https://github.com/tiger-ai-lab/videoeval-pro
Owner: TIGER-AI-Lab
Created: 2025-05-15T20:09:40.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-10T14:07:10.000Z (about 1 year ago)
Last Synced: 2025-06-10T15:31:00.416Z (about 1 year ago)
Topics: evaluation, multimodal, understanding, video
Language: Python
Homepage: https://tiger-ai-lab.github.io/VideoEval-Pro
Size: 6.6 MB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# VideoEval-Pro
This repository contains the evaluation code for the VideoEval-Pro.
The data is available on HuggingFace: [VideoEval-Pro](https://huggingface.co/datasets/TIGER-Lab/VideoEval-Pro)

## Dataset Introduction
VideoEval-Pro is a robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. The dataset is constructed by reformatting questions from four existing long video understanding MCQ benchmarks: Video-MME, MLVU, LVBench, and LongVideoBench into free-form questions.

Each example in the dataset contains:
- `video`: Name (path) of the video file
- `question`: The question about the video content
- `options`: Original options from the source benchmark
- `answer`: The correct MCQ answer
- `answer_text`: The correct free-form answer
- `meta`: Additional metadata from the source benchmark
- `source`: Source benchmark
- `qa_subtype`: Question task subtype
- `qa_type`: Question task type

## Evaluation Steps

1. **Download and Prepare Videos**
```bash
# Download the dataset from HuggingFace
git lfs install
git clone https://huggingface.co/datasets/TIGER-Lab/VideoEval-Pro

# Navigate to videos directory
cd VideoEval-Pro/videos

# Merge all split tar.gz files into a single archive
cat videos_part_*.tar.gz > videos_merged.tar.gz

# Extract the merged archive
tar -xzf videos_merged.tar.gz

# [Optional] Clean up the split files and merged archive
rm videos_part_*.tar.gz videos_merged.tar.gz

# After extraction, you will get a directory containing all videos
# The path to this directory will be used as --video_root in evaluation
# For example: 'VideoEval-Pro/videos'
```

2. **[Optional] Pre-extract Frames**
To improve efficiency, you can pre-extract frames from videos. The extracted frames should be organized as follows:
```
frames_root/
├── video_name_1/ # Video name
│ ├── 000001.jpg # Frame images
│ ├── 000002.jpg
│ └── ...
├── video_name_2/
│ ├── 000001.jpg
│ ├── 000002.jpg
│ └── ...
└── ...
```

After frame extraction, the path to the frames will be used as `--frames_root`. Set `--using_frames True` when running the evaluation script.

3. **Setup Evaluation Environment**
```bash
# Clone the repository from the GitHub repository
git clone https://github.com/TIGER-AI-Lab/VideoEval-Pro
cd VideoEval-Pro

# Create conda environment from requirements.txt (there are different env files for different models)
conda create -n videoevalpro --file *.yaml
conda activate videoevalpro
```

4. **Run Evaluation**
```bash
cd VideoEval-Pro

# Set PYTHONPATH
export PYTHONPATH=.

# Run evaluation script with the following parameters:
# --video_root: Path to video files folder
# --frames_root: Path to video frames folder [For using_frames]
# --output_path: Path to save output results
# --using_frames: Whether to use pre-extracted frames
# --model_path: Path to model
# --device: Device to run inference on
# --num_frames: Number of frames to sample from video
# --max_retries: Maximum number of retries for failed inference
# --num_threads: Number of threads for parallel processing

python tools/*_chat.py \
--video_root \
--frames_root \
--output_path \
--using_frames \
--model_path \
--device \
--num_frames \
--max_retries \
--num_threads

E.g.:
python tools/qwen_chat.py \
--video_root ./videos \
--frames_root ./frames \
--output_path ./results/qwen_results.jsonl \
--using_frames False \
--model_path Qwen/Qwen2-VL-7B-Instruct \
--device cuda \
--num_frames 32 \
--max_retries 10 \
--num_threads 1
```

5. **Judge the results**
```bash
cd VideoEval-Pro

# Set PYTHONPATH
export PYTHONPATH=.

# Run judge script *gpt4o_judge.py* with the following parameters:
# --input_path: Path to save output results
# --output_path: Path to judged results
# --model_name: Version of the judge model
# --num_threads: Number of threads for parallel processing

python tools/gpt4o_judge.py \
--input_path \
--output_path \
--model_name \
--num_threads

E.g.:
python tools/gpt4o_judge.py \
--input_path ./results/qwen_results.jsonl \
--output_path ./results/qwen_results_judged.jsonl \
--model_name gpt-4o-2024-08-06 \
--num_threads 1
```
**Note: the released results are judged by *gpt-4o-2024-08-06***

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tiger-ai-lab/videoeval-pro

Awesome Lists containing this project

README