https://github.com/aim-uofa/active-o3

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
https://github.com/aim-uofa/active-o3

active-perception active-vision grpo mllms o3 rl thinking-with-image

Last synced: 12 months ago
JSON representation

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Host: GitHub
URL: https://github.com/aim-uofa/active-o3
Owner: aim-uofa
Created: 2025-05-25T07:10:15.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-30T02:48:15.000Z (about 1 year ago)
Last Synced: 2025-06-05T03:14:57.144Z (about 1 year ago)
Topics: active-perception, active-vision, grpo, mllms, o3, rl, thinking-with-image
Homepage: https://aim-uofa.github.io/ACTIVE-o3/
Size: 4.86 MB
Stars: 58
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

¹[Zhejiang University](https://www.zju.edu.cn/english/),
²[Ant Group](https://www.antgroup.com/en)

[📄 **Paper**](https://arxiv.org/abs/2505.21457) | [🌐 **Project Page**](https://aim-uofa.github.io/ACTIVE-o3) | [💾 **Model Weights**](https://www.modelscope.cn/models/zzzmmz/ACTIVE-o3)

## 🚀 Overview

## 📖 Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

## 🚩 Plan

- [x] Release the weights.
- [x] Release the inference demo.
- [ ] Release the dataset.
- [ ] Release the training scripts.
- [ ] Release the evaluation scripts.

## 🛠️ Getting Started

### 📐 Set up Environment

```bash

# build environment
conda create -n activeo3 python=3.10
conda activate activeo3

# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]
```
### 🔍 demo

```bash
# run demo
python demo/activeo3_demo_vstar.py
```

## 🎫 License

For academic usage, this project is licensed under [the 2-clause BSD License](LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).

## 🖊️ Citation

If you find this work helpful for your research, please cite:

```BibTeX
@article{zhu2025active,
title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
journal={arXiv preprint arXiv:2505.21457},
year={2025}
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aim-uofa/active-o3

Awesome Lists containing this project

README