https://github.com/thunlp-mt/streamingbench
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
https://github.com/thunlp-mt/streamingbench
Last synced: 3 months ago
JSON representation
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
- Host: GitHub
- URL: https://github.com/thunlp-mt/streamingbench
- Owner: THUNLP-MT
- License: mit
- Created: 2024-11-05T13:12:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-27T03:10:28.000Z (about 1 year ago)
- Last Synced: 2025-03-31T08:12:02.931Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 31.4 MB
- Stars: 114
- Watchers: 5
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
**StreamingBench** evaluates **Multimodal Large Language Models (MLLMs)** in real-time, streaming video understanding tasks. ๐
------
[**NEW!** 2025.05.15] ๐ฅ: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) achieved ALL model SOTA with a score of 82.80 on the Proactive Output.
[**NEW!** 2025.03.17] โญ: [ViSpeeker](https://arxiv.org/abs/2503.12769) achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding.
[**NEW!** 2025.01.14] ๐: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) achieved Streaming SOTA with a score of 66.01 on the Overall benchmark.
[**NEW!** 2025.01.06] ๐: [Dispider](https://github.com/Mark12Ding/Dispider) achieved Streaming SOTA with a score of 53.12 on the Overall benchmark.
[**NEW!** 2024.12.09] ๐: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) achieved 73.79 on Real-Time Visual Understanding.
------
## ๐๏ธ Overview
As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, **StreamingBench** introduces the first comprehensive benchmark for streaming video understanding in MLLMs.
### Key Evaluation Aspects
- ๐ฏ **Real-time Visual Understanding**: Can the model process and respond to visual changes in real-time?
- ๐ **Omni-source Understanding**: Does the model integrate visual and audio inputs synchronously in real-time video streams?
- ๐ฌ **Contextual Understanding**: Can the model comprehend the broader context within video streams?
### Dataset Statistics
- ๐ **900** diverse videos
- ๐ **4,500** human-annotated QA pairs
- โฑ๏ธ Five questions per video at different timestamps
#### ๐ฌ Video Categories
#### ๐ Task Taxonomy
## ๐ Dataset Examples
https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307
Your browser does not support the video tag.
## ๐ฎ Evaluation Pipeline
### Requirements
- Python 3.x
- ffmpeg-python
### Data Preparation
1. **Download Dataset**: Retrieve all necessary files from the [StreamingBench Dataset](https://huggingface.co/datasets/mjuicem/StreamingBench).
2. **Decompress Files**: Extract the downloaded files and organize them in the `./data` directory as follows:
```
StreamingBench/
โโโ data/
โ โโโ real/ # Unzip Real Time Visual Understanding_*.zip into this folder
โ โโโ omni/ # Unzip other .zip files into this folder
โ โโโ sqa/ # Unzip Sequential Question Answering_*.zip into this folder
โ โโโ proactive/ # Unzip Proactive Output_*.zip into this folder
```
3. **Preprocess Data**: Run the following command to preprocess the data:
```bash
cd ./scripts
bash preprocess.sh
```
### Model Preparation
Prepare your own model for evaluation by following the instructions provided [here](./docs/model_guide.md). This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.
### Evaluation
Now you can run the benchmark:
```sh
bash eval.sh
```
This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:
```sh
bash stats.sh
```
## ๐ฌ Experimental Results
### Performance of Various MLLMs on StreamingBench
- 60 seconds of context preceding the query time (Main)
- All Context (+ Long Context)
- Comparison of Main Experiment vs. 60 Seconds of Video Context
-
### Performance of Different MLLMs on the Proactive Output Task
*"โค xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.*
## ๐ Citation
```bibtex
@article{lin2024streaming,
title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
journal={arXiv preprint arXiv:2411.03628},
year={2024}
}
```