https://github.com/aim-uofa/active-o3
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
https://github.com/aim-uofa/active-o3
active-perception active-vision grpo mllms o3 rl thinking-with-image
Last synced: 12 months ago
JSON representation
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
- Host: GitHub
- URL: https://github.com/aim-uofa/active-o3
- Owner: aim-uofa
- Created: 2025-05-25T07:10:15.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-30T02:48:15.000Z (about 1 year ago)
- Last Synced: 2025-06-05T03:14:57.144Z (about 1 year ago)
- Topics: active-perception, active-vision, grpo, mllms, o3, rl, thinking-with-image
- Homepage: https://aim-uofa.github.io/ACTIVE-o3/
- Size: 4.86 MB
- Stars: 58
- Watchers: 2
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
1[Zhejiang University](https://www.zju.edu.cn/english/), ย
2[Ant Group](https://www.antgroup.com/en)
[๐ **Paper**](https://arxiv.org/abs/2505.21457)ย | ย [๐ **Project Page**](https://aim-uofa.github.io/ACTIVE-o3)ย | ย [๐พ **Model Weights**](https://www.modelscope.cn/models/zzzmmz/ACTIVE-o3)
## ๐ Overview
## ๐ Description
we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasksโsuch as small-object and dense object groundingโand domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.
## ๐ฉ Plan
- [x] Release the weights.
- [x] Release the inference demo.
- [ ] Release the dataset.
- [ ] Release the training scripts.
- [ ] Release the evaluation scripts.
## ๐ ๏ธ Getting Started
### ๐ Set up Environment
```bash
# build environment
conda create -n activeo3 python=3.10
conda activate activeo3
# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]
```
### ๐ demo
```bash
# run demo
python demo/activeo3_demo_vstar.py
```
## ๐ซ License
For academic usage, this project is licensed under [the 2-clause BSD License](LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
## ๐๏ธ Citation
If you find this work helpful for your research, please cite:
```BibTeX
@article{zhu2025active,
title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
journal={arXiv preprint arXiv:2505.21457},
year={2025}
}