An open API service indexing awesome lists of open source software.

https://github.com/aim-uofa/active-o3

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
https://github.com/aim-uofa/active-o3

active-perception active-vision grpo mllms o3 rl thinking-with-image

Last synced: 12 months ago
JSON representation

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Awesome Lists containing this project

README

          

# ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

1[Zhejiang University](https://www.zju.edu.cn/english/), ย 
2[Ant Group](https://www.antgroup.com/en)

[๐Ÿ“„ **Paper**](https://arxiv.org/abs/2505.21457)ย  | ย [๐ŸŒ **Project Page**](https://aim-uofa.github.io/ACTIVE-o3)ย  | ย [๐Ÿ’พ **Model Weights**](https://www.modelscope.cn/models/zzzmmz/ACTIVE-o3)

## ๐Ÿš€ Overview


SegAgent Framework

## ๐Ÿ“– Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasksโ€”such as small-object and dense object groundingโ€”and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

## ๐Ÿšฉ Plan

- [x] Release the weights.
- [x] Release the inference demo.
- [ ] Release the dataset.
- [ ] Release the training scripts.
- [ ] Release the evaluation scripts.

## ๐Ÿ› ๏ธ Getting Started

### ๐Ÿ“ Set up Environment

```bash

# build environment
conda create -n activeo3 python=3.10
conda activate activeo3

# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]
```
### ๐Ÿ” demo

```bash
# run demo
python demo/activeo3_demo_vstar.py
```

## ๐ŸŽซ License

For academic usage, this project is licensed under [the 2-clause BSD License](LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).

## ๐Ÿ–Š๏ธ Citation

If you find this work helpful for your research, please cite:

```BibTeX
@article{zhu2025active,
title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
journal={arXiv preprint arXiv:2505.21457},
year={2025}
}