{"id":29219667,"url":"https://github.com/dvlab-research/visionreasoner","last_synced_at":"2025-07-03T02:06:36.348Z","repository":{"id":294021496,"uuid":"968933448","full_name":"dvlab-research/VisionReasoner","owner":"dvlab-research","description":"The official implement of \"VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning\"","archived":false,"fork":false,"pushed_at":"2025-05-30T08:30:08.000Z","size":12850,"stargazers_count":190,"open_issues_count":0,"forks_count":12,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-06-29T22:02:58.586Z","etag":null,"topics":["counting-objects","multimodal","multimodal-large-language-models","object-detection","reasoning-language-models","reinforcement-learning","segmentation","visual-perception"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dvlab-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-19T02:44:15.000Z","updated_at":"2025-06-26T03:03:36.000Z","dependencies_parsed_at":"2025-05-18T14:47:36.883Z","dependency_job_id":null,"html_url":"https://github.com/dvlab-research/VisionReasoner","commit_stats":null,"previous_names":["dvlab-research/visionreasoner"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dvlab-research/VisionReasoner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionReasoner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionReasoner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionReasoner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionReasoner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dvlab-research","download_url":"https://codeload.github.com/dvlab-research/VisionReasoner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvlab-research%2FVisionReasoner/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263245318,"owners_count":23436514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["counting-objects","multimodal","multimodal-large-language-models","object-detection","reasoning-language-models","reinforcement-learning","segmentation","visual-perception"],"created_at":"2025-07-03T02:06:33.143Z","updated_at":"2025-07-03T02:06:36.332Z","avatar_url":"https://github.com/dvlab-research.png","language":"Python","readme":"# VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning\n\n\u003e Current VLMs are primarily used for visual captioning or visual QA tasks. In this project, we take a step further by demonstrating the potential of a single VLM to solve diverse vision tasks. We hope this work will advance the frontier of VLM research and expand the boundaries of what these models can achieve.  \n\nPaper: [📖 VisionReasoner](https://arxiv.org/pdf/2505.12081) [📖 Seg-Zero](https://arxiv.org/pdf/2503.06520)         \nHuggingFace Daily: [🤗 VisionReasoner](https://huggingface.co/papers/2505.12081)  \nModel: [🤗 VisionReasoner-7B](https://huggingface.co/Ricky06662/VisionReasoner-7B) [🤗 TaskRouter-1.5B](https://huggingface.co/Ricky06662/TaskRouter-1.5B)  \nRelative Link: [Seg-Zero![[code]](https://img.shields.io/github/stars/dvlab-research/Seg-Zero)](https://github.com/dvlab-research/Seg-Zero)   \n\nOverview of VisionReasoner:\n\n\u003cdiv align=center\u003e\n\u003cimg width=\"98%\" src=\"assets/overview.png\"/\u003e\n\u003c/div\u003e\n\nVisionReasoner demonstrates following features:\n1. **VisionReasoner** is a unified framework for visual perception tasks. Through carefully crafted rewards and training strategy, VisionReasoner has strong multi-task capability, addressing diverse visual perception tasks within a shared model.  \n2. We select several representative tasks to evaluate models unified visual ability, including detection tasks (e.g., [COCO](https://cocodataset.org/#home), [RefCOCOg](https://github.com/lichengunc/refer)), segmentation tasks (e.g., [ReasonSeg](https://github.com/dvlab-research/LISA)), counting tasks (e.g., [CountBench](https://teaching-clip-to-count.github.io/)) and VQA tasks (e.g. [DocVQA](https://www.docvqa.org/)).   \n3. Experimental results show that VisionReasoner achieves superior performance across ten diverse visual perception tasks within a single unified framework, outperforming baseline models by a significant margin.   \n4. We have reformulated dozens of visual task types categoried in [Papers With Code](https://paperswithcode.com/datasets?mod=images\u0026page=1). Please refer to [task categorization](task_categorization.md) for details. These task types are categoried as four fundamental task types: detection, segmentation, counting and VQA. More supported task types and more fundamental task types can be added in this framework, such as 3D or medical image processing.  \n\n\n## News\n\n[May 17th, 2025] 🔥 [📖 Paper](https://arxiv.org/pdf/2505.12081) is coming!   \n[May 17th, 2025] 🔥 VisionReasoner is coming! VisionReasoner is based on our previous [Seg-Zero](https://github.com/dvlab-research/Seg-Zero).  \n\n\n## Contents\n- [Model](#model)\n- [Installation](#installation)\n- [Inference](#inference)\n- [Hybrid Mode](#hybrid-mode)\n- [Image Generation](#image-generation)\n- [Evaluation](#evaluation)\n- [Training](#training)\n- [Citation](#citation)\n- [Acknowledgement](#acknowledgement)\n\n\n\n## Model\n\u003cdiv align=center\u003e\n\u003cimg width=\"98%\" src=\"assets/pipeline.png\"/\u003e\n\u003c/div\u003e\n\nVisionReasoner model incorporates a reasoning module, which processing image and locates targeted objects, and a segmentation module that produces segmentation masks if needed.   \nBesides, we also train a task router that convert diverse vision tasks into given four fundamental task types.\n\n\n\u003c!-- ## Examples\n\n\u003cdiv align=center\u003e\n\u003cimg width=\"98%\" src=\"assets/examples.png\"/\u003e\n\u003c/div\u003e --\u003e\n\n\n## Installation\n\u003e [!NOTE]\n\u003e If you train VisionReasoner using codes in [Seg-Zero](https://github.com/dvlab-research/Seg-Zero), you can directly use the environment of the training codes.  \n\n```bash\ngit clone https://github.com/dvlab-research/VisionReasoner.git\ncd VisionReasoner\nconda create -n visionreasoner_test python=3.12\nconda activate visionreasoner_test\npip3 install torch torchvision\npip install -r requirements.txt\n```\n\n\n## Inference\nDownload model using the following scripts: \n```bash\nmkdir pretrained_models\ncd pretrained_models\ngit lfs install\ngit clone https://huggingface.co/Ricky06662/VisionReasoner-7B\ngit clone https://huggingface.co/Ricky06662/TaskRouter-1.5B\n```\n\u003e [!TIP]\n\u003e If you encounter issues with connecting to Hugging Face, consider using `export HF_ENDPOINT=https://hf-mirror.com`.   \n\n\nThen run inference using:\n```bash\npython vision_reasoner/inference.py\n```\n### The default task is a counting task.  \n\u003e \"How many airplanes are there in this image?\"\n\n\u003cdiv align=center\u003e\n\u003cimg width=\"30%\" src=\"assets/airplanes.png\"/\u003e\n\u003c/div\u003e\n\n\nYou will get the thinking process in command line, like:\n\n\u003e \"The image shows a formation of airplanes flying in the sky. Each airplane is distinct and can be counted individually. The planes are arranged in a specific pattern, and there are visible trails of smoke behind them, which is typical for airshows or demonstrations.\"\n\nAnd you will get the final answer in command line, like:\n\n\u003e \"Total number of interested objects is:  10\"\n\n\n### You can also try a detection / segmentation task by:  \n```bash\npython vision_reasoner/inference.py --image_path \"assets/donuts.png\" --query \"please segment the donuts\"\n```\n\nYou will get the thinking process in command line, like:\n\n\u003e \"The task involves identifying and segmenting individual donuts in the image. Each donut is distinct in its color, glaze, and toppings, which helps in distinguishing them from one another. The goal is to identify each donut as a separate object and provide bounding boxes for them.\"\n\nAnd the result will be presented in result_visualization.png. \n\n\u003cdiv align=center\u003e\n\u003cimg width=\"98%\" src=\"assets/donuts_output.png\"/\u003e\n\u003c/div\u003e\n\n### Or some tasks that need reasoning: \n\n```bash\npython vision_reasoner/inference.py --image_path \"assets/stand_higher.png\" --query \"find what can make the woman stand higher?\"\n```\n\nYou will get the thinking process in command line, like:\n\n\u003e \"The question asks for objects that can make the woman stand higher. The woman is already standing on a ladder, which is the object that elevates her. The ladder is the most closely matched object to what can make her stand higher.\"\n\nAnd the result will be presented in result_visualization.png. \n\n\u003cdiv align=center\u003e\n\u003cimg width=\"98%\" src=\"assets/stand_higher_output.png\"/\u003e\n\u003c/div\u003e\n\n\n### We also support naive visual QA / captioning task:\n```bash\npython vision_reasoner/inference.py --image_path \"assets/company_name.png\" --query \"What is name of the company?\"\n``` \n\n\u003cdiv align=center\u003e\n\u003cimg width=\"20%\" src=\"assets/company_name.png\"/\u003e\n\u003c/div\u003e\n\nIn VQA, there are no reasoning, and you will get the final answer in command line, like:\n\n\u003e \"The answer is: The name of the company is ITC (Indian Tobacco Company Limited).\"\n\n### You can also provide your own image_path and text by:\n```bash\npython vision_reasoner/inference.py --image_path \"your_image_path\" --query \"your question text\"\n```\n\n## Hybrid Mode:\nWhen hybrid reasoning mode is enabled, VisionReasoner intelligently switches between direct detection (using YOLO-World) and reasoning-based approaches based on the complexity of the query. This allows for faster responses on simple queries while maintaining detailed reasoning for complex tasks.\n\n### Simple Query Example:\nFor straightforward queries that can be directly answered by object detection:\n\n```bash\npython vision_reasoner/inference.py --image_path \"assets/crowd.png\" --query \"person\" --hybrid_mode \n```\n\nOutput:\n\n\u003cdiv align=center\u003e\n\u003cimg width=\"100%\" src=\"assets/crowd_output_1.png\"/\u003e\n\u003c/div\u003e\n\nIn this case, the model directly uses YOLO-World for detection without going through the reasoning process, resulting in faster response times.\n\n### Complex Query Example:\nFor queries that require spatial reasoning or complex understanding:\n\n```bash\npython vision_reasoner/inference.py --image_path \"assets/crowd.png\" --query \"the person who is facing to the camera\" --hybrid_mode \n```\n\nOutput:\n\u003e Thinking process: The task involves identifying the person who is facing the camera and then finding the most closely matched object. In the image, there is a person in the center wearing a white shirt and a black vest, who appears to be facing the camera directly. The other individuals are walking away from the camera, so they are not the target. The person in the white shirt and black vest is the closest match to the description of facing the camera.\n\n\n\u003cdiv align=center\u003e\n\u003cimg width=\"100%\" src=\"assets/crowd_output_2.png\"/\u003e\n\u003c/div\u003e\n\nIn this case, the model switches to the reasoning-based approach because the query requires understanding spatial relationships and visual attributes.\n\n\n## Image Generation:\nOur framework can also incorporate generation tasks. We adopt [gpt-image-1](https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1) for generation in current version.    \n\n\u003e [!NOTE]\n\u003e Bugs might arise from API version mismatches. Please debug and customize based on your API key and version.\n\n### Text-to-image generation \nFor text to image generation, you can only input a prompt\n```bash\npython vision_reasoner/inference.py --image_prompt \"Draw a image of a cute dog.\" --generation_model_name [your openAI api key] --generation_mode \n```\n\n### Image reference generation \nFor image reference generation, you should input a prompt and reference image\n```bash\npython vision_reasoner/inference.py  --refer_image_path \"assets/dog.png\" --image_prompt \"Generate a cute dog in a forest\" --generation_model_name [your openAI api key] --generation_mode \n```\n\n## Evaluation\n\nThe evaluation scripts allow you to test VisionReasoner on various datasets. We provide scripts for evaluating segmentation, detection, and counting tasks.\n\n### Using the Evaluation Scripts\n\nEach evaluation script accepts either a HuggingFace dataset path or a local dataset path:\n\n```bash\n# Using HuggingFace dataset paths (default in examples)\nbash evaluation/eval_segmentation.sh Ricky06662/refcoco_val\n\n# Using local dataset paths\nbash evaluation/eval_segmentation.sh /path/to/your/local/refcoco_val\n```\n\nAdditionally, you can customize model paths with the following parameters:\n\n```bash\n# Using local model paths (instead of downloading from HuggingFace)\nbash evaluation/eval_segmentation.sh [dataset_path] \\\n  --model_path /path/to/local/VisionReasoner-7B \\\n  --task_router_model_path /path/to/local/TaskRouter-1.5B\n```\n\n### Available Evaluation Scripts\n\n- `eval_segmentation.sh`: Evaluates segmentation performance on RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg datasets. When the dataset contains bounding box ground truth annotations, it will also output detection metrics.\n- `eval_coco.sh`: Evaluates detection performance on COCO dataset\n- `eval_count.sh`: Evaluates counting performance on counting benchmarks\n\n### Example Commands\n\n```bash \n# Segmentation/Detection evaluation\nbash evaluation/eval_segmentation.sh Ricky06662/refcoco_val\nbash evaluation/eval_segmentation.sh Ricky06662/refcoco_testA\nbash evaluation/eval_segmentation.sh Ricky06662/refcocoplus_val\nbash evaluation/eval_segmentation.sh Ricky06662/refcocoplus_testA\nbash evaluation/eval_segmentation.sh Ricky06662/refcocog_val\nbash evaluation/eval_segmentation.sh Ricky06662/refcocog_test\nbash evaluation/eval_segmentation.sh Ricky06662/ReasonSeg_val\nbash evaluation/eval_segmentation.sh Ricky06662/ReasonSeg_test\n\n# COCO evaluation\nbash evaluation/eval_coco.sh Ricky06662/coco_val\n\n# Counting evaluation\nbash evaluation/eval_count.sh Ricky06662/counting_pixmo_validation\nbash evaluation/eval_count.sh Ricky06662/counting_pixmo_test\nbash evaluation/eval_count.sh Ricky06662/counting_countbench\n```\n\n## Training\n\nWe recommend you to [Seg-Zero](https://github.com/dvlab-research/Seg-Zero) for training the VisionReasoner.  \n\n\n## Citation\n\n```bibtex\n@article{liu2025segzero,\n  title        = {Seg-Zero: Reasoning-Chain Guided  Segmentation via Cognitive Reinforcement},\n  author       = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},\n  journal      = {arXiv preprint arXiv:2503.06520},\n  year         = {2025}\n}\n\n@article{liu2025visionreasoner,\n  title        = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},\n  author       = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},\n  journal      = {arXiv preprint arXiv:2505.12081},\n  year         = {2025}\n}\n```\n\n## Acknowledgement\nWe would like to thank the following repos for their great work: \n\n- This work is built upon the [Seg-Zero](https://github.com/dvlab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1) and [veRL](https://github.com/volcengine/verl).\n- This work utilizes models from  [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), [SAM2](https://huggingface.co/facebook/sam2-hiera-large) and [YOLO-World](https://github.com/AILab-CVC/YOLO-World). \n\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=dvlab-research/VisionReasoner\u0026type=Date)](https://star-history.com/#dvlab-research/VisionReasoner\u0026Date)","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2Fvisionreasoner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdvlab-research%2Fvisionreasoner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvlab-research%2Fvisionreasoner/lists"}