{"id":28653870,"url":"https://github.com/tiger-ai-lab/vl-rethinker","last_synced_at":"2025-06-13T07:07:55.977Z","repository":{"id":286159928,"uuid":"960567532","full_name":"TIGER-AI-Lab/VL-Rethinker","owner":"TIGER-AI-Lab","description":"The official code of \"VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning\"","archived":false,"fork":false,"pushed_at":"2025-06-05T05:13:57.000Z","size":5209,"stargazers_count":105,"open_issues_count":2,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-05T07:24:33.063Z","etag":null,"topics":["llm","multimodal","reasoning","vlm"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/VL-Rethinker/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-04T16:55:29.000Z","updated_at":"2025-06-05T05:13:58.000Z","dependencies_parsed_at":"2025-06-05T06:31:32.955Z","dependency_job_id":"1881f1a1-75ef-4f91-8fd6-bb7ef40e15ed","html_url":"https://github.com/TIGER-AI-Lab/VL-Rethinker","commit_stats":null,"previous_names":["tiger-ai-lab/mm-rethinker","tiger-ai-lab/vl-rethinker"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/VL-Rethinker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVL-Rethinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVL-Rethinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVL-Rethinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVL-Rethinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/VL-Rethinker/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FVL-Rethinker/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","multimodal","reasoning","vlm"],"created_at":"2025-06-13T07:07:55.400Z","updated_at":"2025-06-13T07:07:55.957Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning\n\n\u003ca target=\"_blank\" href=\"https://arxiv.org/abs/2504.08837\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Paper-red?style=flat\u0026logo=arxiv\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"#\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Code-green?style=flat\u0026logo=github\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://tiger-ai-lab.github.io/VL-Rethinker/\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🌐%20Website-blue?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/TIGER-Lab/VL-Rethinker-7B\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Models-red?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/datasets/TIGER-Lab/ViRL39K\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Dataset-blue?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/spaces/TIGER-Lab/VL-Rethinker\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Demo-yellow?style=flat\"\u003e\u003c/a\u003e\n\u003cbr\u003e\n\n\n\u003cspan style=\"color:#183385; font-size: 14pt; font-family: Roboto, Helvetica, Arial, Heveltica Neue, sans-serif\"\u003e\n     \u003cb\u003eAuthors:\u003c/b\u003e\n     \u003ca class=\"name\" target=\"_blank\" href=\"https://HaozheH3.github.io\"\u003eHaozhe Wang\u003c/a\u003e, \n     \u003ca class=\"name\" target=\"_blank\" href=\"\"\u003eChao Qu\u003c/a\u003e, \n     \u003ca class=\"name\" target=\"_blank\" href=\"\"\u003eZuming Huang\u003c/a\u003e, \n     \u003ca class=\"name\" target=\"_blank\" href=\"https://weichu.github.io/\"\u003eWei Chu\u003c/a\u003e, \n     \u003ca class=\"name\" target=\"_blank\" href=\"https://cse.hkust.edu.hk/~flin/\"\u003eFangzhen Lin\u003c/a\u003e,\n     \u003ca class=\"name\" target=\"_blank\" href=\"https://wenhuchen.github.io/\"\u003eWenhu Chen\u003c/a\u003e\u0026nbsp; \n\n## 🔥News\n\n- [2025/4/22] We release the dataset [🤗 ViRL39K](https://huggingface.co/datasets/TIGER-Lab/ViRL39K). It covers **comprehensive collection** of 39K queries including **eight categories**, and provides fine-grained **model-capability annotations** for data selection.\n\n\n## Overview\n![overview](./assets/overview-2.jpg)\n\n\u003cdetails\u003e\u003csummary\u003eAbstract\u003c/summary\u003e \nRecently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. \nTo further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, \\model, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve significantly to achieve 80.3\\%, 61.8\\% and 43.9\\% respectively. \\model also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. Our empirical results show the effectiveness of our approaches.\n\n\u003c/details\u003e\n\n## Release Progress\n- [x] models.\n- [x] data.\n- [ ] inference and evaluation code.\n- [x] training code.\n\n### Dataset\n**[ViRL39K](https://huggingface.co/datasets/TIGER-Lab/ViRL39K)** lays the foundation for our RL training. It has the following merits:\n- **high-quality** and **verifiable**: the QAs undergo rigorous filtering and quality control, removing problematic queries or ones that cannot be verified by rules.\n- covering **comprehensive** topics and categories: from grade school problems to broader STEM and Social topics; reasoning with charts, diagrams, tables, documents, spatial relationships, etc. \n- with fine-grained **model-capability annotations**: it tells you what queries to use when training models at different scales.\n\n\n### RL-ed Models\n- [VL-Rethinker-7B](https://huggingface.co/TIGER-Lab/VL-Rethinker-7B): undergoes the proposed SSR and Forced Rethinking training from Qwen2.5-VL-7B-Instruct.\n- [VL-Rethinker-72B](https://huggingface.co/TIGER-Lab/VL-Rethinker-72B): undergoes the proposed SSR and Forced Rethinking training from Qwen2.5-VL-72B-Instruct.\n\nWe are training 32B and further enhancing these models. Stay Tuned!\n\n\n## Performance\nSee our [website](https://tiger-ai-lab.github.io/VL-Rethinker/) or [paper](https://arxiv.org/abs/2504.08837) for detailed performance report.\n\n\n## Selective Sample Replay (SSR)\n\nTraining 72B models on publicly collected queries reveals \"vanishing advantages,\" a phenomenon where rapid saturation in large models drastically reduces effective training samples. The concurrent work [DAPO](https://arxiv.org/abs/2503.14476) on LLMs, made a similar observation.\n\nDAPO combats this by filtering ineffective queries for gradient stability.Different from this gradient perspective, our method, Selective Sample Replay (SSR), takes an active learning perspective. Drawing a similar merit from Prioritized Experience Replay, SSR re-arranges training samples based on their informativeness -- examples with high advantages, which lie near the model's capability limits (i.e., correct responses to queries the model likely fails), are particularly informative. This active selection focuses training on samples most likely to contribute to model improvement, thereby pushing training efficiency.\n\nThe implementation for SSR is also simple. In addition to code in `active_sampling() @openrlhf/trainer/ppo_utils/replay_buffer.py`. Here is a pseudocode for the key idea of SSR.\n```python\neffective_qas = rule_out_zero(candidates)\np = normalize_adv(effective_qas, alpha=1)\nselection = np.random.choice(np.arange(len(effective_qas)), size=size, p=p))\n```\n\nNote: For different scenarios, e.g., on-policy or off-policy, the choice of `candidates`, `size` can be different. \n\n## Inference \nOur models are established on top of the Qwen2.5-VL family. So we include a simple use case here, and refer the readers to [the standard inference procedure of Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).\n\n\n```python\nfrom transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor\nfrom qwen_vl_utils import process_vision_info\n\n# default: Load the model on the available device(s)\nmodel = Qwen2_5_VLForConditionalGeneration.from_pretrained(\n    \"TIGER-Lab/VL-Rethinker-7B\", torch_dtype=\"auto\", device_map=\"auto\"\n)\n\n# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.\n# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(\n#     \"Qwen/Qwen2.5-VL-7B-Instruct\",\n#     torch_dtype=torch.bfloat16,\n#     attn_implementation=\"flash_attention_2\",\n#     device_map=\"auto\",\n# )\n\n# default processor\n# processor = AutoProcessor.from_pretrained(\"TIGER-Lab/VL-Rethinker-7B\")\n\n\nmin_pixels = 256*28*28\nmax_pixels = 1280*28*28\nprocessor = AutoProcessor.from_pretrained(\"TIGER-Lab/VL-Rethinker-7B\", min_pixels=min_pixels, max_pixels=max_pixels)\n\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"image\",\n                \"image\": \"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\",\n            },\n            {\"type\": \"text\", \"text\": \"Describe this image.\"},\n        ],\n    }\n]\n\n# Preparation for inference\ntext = processor.apply_chat_template(\n    messages, tokenize=False, add_generation_prompt=True\n)\nimage_inputs, video_inputs = process_vision_info(messages)\ninputs = processor(\n    text=[text],\n    images=image_inputs,\n    videos=video_inputs,\n    padding=True,\n    return_tensors=\"pt\",\n)\ninputs = inputs.to(model.device)\n\n# Inference: Generation of the output\ngenerated_ids = model.generate(**inputs, max_new_tokens=128)\ngenerated_ids_trimmed = [\n    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n]\noutput_text = processor.batch_decode(\n    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n)\nprint(output_text)\n\n```\n\n**Important Notes**:\n\nBased on the training configurations of the VL-Rethinker family, it's recommended to:\n- *Prompt*: \n\n    append `\\n\\nPlease reason step by step, and put your final answer within \\\\boxed{}` after the use queries.\n- *Resolutions*: \n    ```\n    min_pixels = 256*28*28\n    max_pixels = 1280*28*28\n    ```\n\n\n## 🚀Quick Start\nThe proposed algorithm is implemented with the [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) framework.\n\n### Installations\nPlease see [the installation instructions](installation.md).\n\n### Evaluation\nOur models can be evaluated like Qwen2.5-VL using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).\n\nHere we provide an alternative evaluation approach. It offers the following benefits:\n- Fast: Batch inference using vLLM for 1K queries on 8 A800 within 30 mins.\n- Convenient: Evaluation without time-consuming API calls. Judgement made by our rule-based functions align with LLM Judges.\n- Train-Test Aligned: the evaluation re-uses the correctness judgement of training to minimize the gap between training and test-time evaluation.\n\nThe evaluation is integrated with the OpenRLHF framework. \n```bash\nbash ./scripts/eval_7b.sh [benchmark] [modelname] [modelpath]\n```\n**Note: for MMMU-Val we cannot reproduce Qwen2.5-VL with neither lmms_eval, vlmevalkit or our native evaluation. We greatly appreciate it if you could provide any insights into the correct means of reproducing it.**\n\n\n### Training\nRun the following.\n```bash\nbash ./scripts/train_vlm_multi.sh\n```\n\n\n## Acknowledgement\nThis project adapts from [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [LMM-R1](https://github.com/TideDra/lmm-r1), released under the Apache License 2.0. Thanks for their open-source contributions!\n\n## Citation\nIf you find this work useful, please give us a free cite:\n```bibtex\n@article{vl-rethinker,\n      title={VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning},\n      author = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu},\n      journal={arXiv preprint arXiv:2504.08837},\n      year={2025}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvl-rethinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fvl-rethinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fvl-rethinker/lists"}