An open API service indexing awesome lists of open source software.

https://github.com/thunlp-mt/streamingbench

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
https://github.com/thunlp-mt/streamingbench

Last synced: 3 months ago
JSON representation

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Awesome Lists containing this project

README

          

# StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding


StreamingBench Banner


๐Ÿ  Project Page |
๐Ÿ“„ arXiv Paper |
๐Ÿ“ฆ Dataset |
๐Ÿ…Leaderboard

**StreamingBench** evaluates **Multimodal Large Language Models (MLLMs)** in real-time, streaming video understanding tasks. ๐ŸŒŸ

------

[**NEW!** 2025.05.15] ๐Ÿ”ฅ: [Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL) achieved ALL model SOTA with a score of 82.80 on the Proactive Output.

[**NEW!** 2025.03.17] โญ: [ViSpeeker](https://arxiv.org/abs/2503.12769) achieved Open-Source SOTA with a score of 61.60 on the Omni-Source Understanding.

[**NEW!** 2025.01.14] ๐Ÿš€: [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) achieved Streaming SOTA with a score of 66.01 on the Overall benchmark.

[**NEW!** 2025.01.06] ๐Ÿ†: [Dispider](https://github.com/Mark12Ding/Dispider) achieved Streaming SOTA with a score of 53.12 on the Overall benchmark.

[**NEW!** 2024.12.09] ๐ŸŽ‰: [InternLM-XComposer2.5-OmniLive](https://github.com/InternLM/InternLM-XComposer) achieved 73.79 on Real-Time Visual Understanding.

------

## ๐ŸŽž๏ธ Overview

As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, **StreamingBench** introduces the first comprehensive benchmark for streaming video understanding in MLLMs.

### Key Evaluation Aspects
- ๐ŸŽฏ **Real-time Visual Understanding**: Can the model process and respond to visual changes in real-time?
- ๐Ÿ”Š **Omni-source Understanding**: Does the model integrate visual and audio inputs synchronously in real-time video streams?
- ๐ŸŽฌ **Contextual Understanding**: Can the model comprehend the broader context within video streams?

### Dataset Statistics
- ๐Ÿ“Š **900** diverse videos
- ๐Ÿ“ **4,500** human-annotated QA pairs
- โฑ๏ธ Five questions per video at different timestamps
#### ๐ŸŽฌ Video Categories


Video Categories

#### ๐Ÿ” Task Taxonomy


Task Taxonomy

## ๐Ÿ“ Dataset Examples
https://github.com/user-attachments/assets/e6d1655d-ab3f-47a7-973a-8fd6c8962307




Your browser does not support the video tag.

## ๐Ÿ”ฎ Evaluation Pipeline

### Requirements

- Python 3.x
- ffmpeg-python

### Data Preparation

1. **Download Dataset**: Retrieve all necessary files from the [StreamingBench Dataset](https://huggingface.co/datasets/mjuicem/StreamingBench).

2. **Decompress Files**: Extract the downloaded files and organize them in the `./data` directory as follows:

```
StreamingBench/
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ real/ # Unzip Real Time Visual Understanding_*.zip into this folder
โ”‚ โ”œโ”€โ”€ omni/ # Unzip other .zip files into this folder
โ”‚ โ”œโ”€โ”€ sqa/ # Unzip Sequential Question Answering_*.zip into this folder
โ”‚ โ””โ”€โ”€ proactive/ # Unzip Proactive Output_*.zip into this folder
```

3. **Preprocess Data**: Run the following command to preprocess the data:

```bash
cd ./scripts
bash preprocess.sh
```

### Model Preparation

Prepare your own model for evaluation by following the instructions provided [here](./docs/model_guide.md). This guide will help you set up and configure your model to ensure it is ready for testing against the dataset.

### Evaluation

Now you can run the benchmark:

```sh
bash eval.sh
```

This will run the benchmark and save the results to the specified output file. Then you can calculate the metrics using the following command:
```sh
bash stats.sh
```

## ๐Ÿ”ฌ Experimental Results

### Performance of Various MLLMs on StreamingBench
- 60 seconds of context preceding the query time (Main)


Task Taxonomy

- All Context (+ Long Context)


Task Taxonomy

- Comparison of Main Experiment vs. 60 Seconds of Video Context
-


Task Taxonomy

### Performance of Different MLLMs on the Proactive Output Task
*"โ‰ค xs" means that the answer is considered correct if the actual output time is within x seconds of the ground truth.*


Task Taxonomy

## ๐Ÿ“ Citation
```bibtex
@article{lin2024streaming,
title={StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding},
author={Junming Lin and Zheng Fang and Chi Chen and Zihao Wan and Fuwen Luo and Peng Li and Yang Liu and Maosong Sun},
journal={arXiv preprint arXiv:2411.03628},
year={2024}
}
```