{"id":28403360,"url":"https://github.com/evolvinglmms-lab/mgpo","last_synced_at":"2026-03-06T12:03:22.920Z","repository":{"id":295747478,"uuid":"991078583","full_name":"EvolvingLMMs-Lab/MGPO","owner":"EvolvingLMMs-Lab","description":"High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning","archived":false,"fork":false,"pushed_at":"2025-07-23T03:59:26.000Z","size":43439,"stargazers_count":52,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-12-25T04:06:39.200Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.lmms-lab.com/posts/highres_visual_reasoning","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EvolvingLMMs-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-27T05:12:49.000Z","updated_at":"2025-12-01T11:10:16.000Z","dependencies_parsed_at":"2025-06-11T12:47:18.892Z","dependency_job_id":null,"html_url":"https://github.com/EvolvingLMMs-Lab/MGPO","commit_stats":null,"previous_names":["evolvinglmms-lab/mgpo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EvolvingLMMs-Lab/MGPO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FMGPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FMGPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FMGPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FMGPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EvolvingLMMs-Lab","download_url":"https://codeload.github.com/EvolvingLMMs-Lab/MGPO/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FMGPO/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30175911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T11:48:51.886Z","status":"ssl_error","status_checked_at":"2026-03-06T11:48:51.460Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-01T17:36:33.703Z","updated_at":"2026-03-06T12:03:22.912Z","avatar_url":"https://github.com/EvolvingLMMs-Lab.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning\n\n[![MGPO](https://img.shields.io/badge/Paper-MGPO-red)](https://arxiv.org/abs/2507.05920) [![MGPO](https://img.shields.io/badge/Blog-MGPO-blue)](https://www.lmms-lab.com/posts/highres_visual_reasoning) [![机器之心](https://img.shields.io/badge/公众号-机器之心-green)](https://mp.weixin.qq.com/s/K_MOiW2wgVGf5tkjSleyKQ) \n\n\n\u003c!-- Authors: [Xinyu Huang](https://xinyu1205.github.io/), [Yuhao Dong](https://scholar.google.com/citations?user=kMui170AAAAJ\u0026hl=zh-CN), Wei Li, Jinming Wu, Zihao Deng, [Bo Li](https://brianboli.com/), Zejun Ma --\u003e\n\n\n## 💡 Introduction\n\nInspired by the human visual system's top-down, task-driven search, we propose **Multi-turn Grounding-based Policy Optimization (MGPO)**. MGPO equips LMMs with interpretable, iterative visual grounding: the model predicts key regions, crops sub-images, and reasons over both the original and focused views.\n\n**Key advantages:**\n- **Interpretable, Top-down Visual Reasoning:** MGPO highlights which image regions are attended to at each step.\n- **Breaks Pixel Limits:** Even if the full image is blurry due to resizing, MGPO identifies and crops clear sub-images for further analysis.\n- **No Extra Grounding Annotations Needed:** MGPO is trained only with binary answer correctness, yet learns robust grounding.\n\n## 🚀 Training Code\n\nOur code is based on verl, training code and script are available at \n\nhttps://github.com/xinyu1205/verl/blob/mgpo/examples/grpo_trainer/run_qwen2_5_vl-7b_mgpo.sh\n\n\n\n## 🧰 Experiments\n\n### Visualizations\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/2.png\" width=\"800\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  (Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.)\n\u003c/p\u003e\n\n\n### Main Results\n\n- **MGPO outperforms both SFT and GRPO** on high-resolution tasks.\n- **+5.4%** on MME-Realworld (ID), **+5.2%** on V* Bench (OOD) over GRPO baseline.\n- Surpasses OpenAI’s o1 and GPT-4o on V* Bench, despite using a smaller model and less data.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/7.png\" width=\"800\"\u003e\n\u003c/p\u003e\n\n\n\n## ✒️ Citation\nIf you find our work to be useful for your research, please consider citing.\n\n```bibtex\n@misc{huang2025highresolutionvisualreasoningmultiturn,\n      title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning}, \n      author={Xinyu Huang and Yuhao Dong and Weiwei Tian and Bo Li and Rui Feng and Ziwei Liu},\n      year={2025},\n      eprint={2507.05920},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2507.05920}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fmgpo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevolvinglmms-lab%2Fmgpo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fmgpo/lists"}