{"id":26619846,"url":"https://github.com/Liuziyu77/Visual-RFT","last_synced_at":"2025-03-24T09:04:15.791Z","repository":{"id":280560476,"uuid":"937900824","full_name":"Liuziyu77/Visual-RFT","owner":"Liuziyu77","description":"Official repository of ’Visual-RFT: Visual Reinforcement Fine-Tuning’","archived":false,"fork":false,"pushed_at":"2025-03-19T04:09:18.000Z","size":10661,"stargazers_count":1327,"open_issues_count":48,"forks_count":58,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-03-19T05:22:24.320Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Liuziyu77.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-24T05:06:58.000Z","updated_at":"2025-03-19T04:09:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"4621b048-275c-436b-8c03-d74bbecad7d4","html_url":"https://github.com/Liuziyu77/Visual-RFT","commit_stats":null,"previous_names":["liuziyu77/visual-rft"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Liuziyu77%2FVisual-RFT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Liuziyu77%2FVisual-RFT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Liuziyu77%2FVisual-RFT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Liuziyu77%2FVisual-RFT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Liuziyu77","download_url":"https://codeload.github.com/Liuziyu77/Visual-RFT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245240914,"owners_count":20583101,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-24T09:01:25.307Z","updated_at":"2025-03-24T09:04:15.772Z","avatar_url":"https://github.com/Liuziyu77.png","language":"Python","funding_links":[],"categories":["Projects","Summary","多模态大模型","RelatedRepos"],"sub_categories":["Multimodal and Agents","资源传输下载","Advanced Reasoning for Multi-Modal"],"readme":"\u003cp align=\"center\"\u003e\n\u003c!--   \u003ch1 align=\"center\"\u003e\u003cimg src=\"assets/logo.png\" width=\"256\"\u003e\u003c/h1\u003e --\u003e\n  \u003ch1 align=\"center\"\u003eVisual-RFT: Visual Reinforcement Fine-Tuning\u003c/h1\u003e\n    \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/Liuziyu77\"\u003e\u003cstrong\u003eZiyu Liu*\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://github.com/SunzeY\"\u003e\u003cstrong\u003eZeyi Sun*\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://yuhangzang.github.io/\"\u003e\u003cstrong\u003eYuhang Zang\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://lightdxy.github.io/\"\u003e\u003cstrong\u003eXiaoyi Dong\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://scholar.google.com/citations?user=sJkqsqkAAAAJ\"\u003e\u003cstrong\u003eYuhang Cao\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://kennymckormick.github.io/\"\u003e\u003cstrong\u003eHaodong Duan\u003c/strong\u003e\u003c/a\u003e\n    ·\n     \u003ca href=\"http://dahua.site/\"\u003e\u003cstrong\u003eDahua Lin\u003c/strong\u003e\u003c/a\u003e\n    ·\n     \u003ca href=\"https://myownskyw7.github.io/\"\u003e\u003cstrong\u003eJiaqi Wang\u003c/strong\u003e\u003c/a\u003e\n  \u003c/p\u003e\n\u003c!--   \u003ch2 align=\"center\"\u003eAccepted By ICLR 2025!\u003c/h2\u003e --\u003e\n\u003c!-- 🏠\u003ca href=\"https://liuziyu77.github.io/MIA-DPO/\"\u003eHomepage\u003c/a\u003e\u003c/h3\u003e| --\u003e\n  📖\u003ca href=\"https://arxiv.org/abs/2503.01785\"\u003ePaper\u003c/a\u003e |\n  🤗\u003ca href=\"https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df\"\u003eDatasets\u003c/a\u003e | 🤗\u003ca href=\"https://huggingface.co/papers/2503.01785\"\u003eDaily Paper\u003c/a\u003e\u003c/h3\u003e\n\u003cdiv align=\"center\"\u003e\u003c/div\u003e\n\u003cp align=\"center\"\u003e\n  \u003cp\u003e\n🌈We introduce \u003cstrong\u003eVisual Reinforcement Fine-tuning (Visual-RFT)\u003c/strong\u003e, the first comprehensive adaptation of \u003cstrong\u003eDeepseek-R1's RL strategy\u003c/strong\u003e to the \u003cstrong\u003emultimodal field\u003c/strong\u003e. We use the Qwen2-VL-2/7B model as our base model and design a \u003cstrong\u003erule-based verifiable reward\u003c/strong\u003e, which is integrated into a \u003cstrong\u003eGRPO-based reinforcement fine-tuning framework\u003c/strong\u003e to enhance the performance of LVLMs across various visual perception tasks. \u003cstrong\u003eViRFT\u003c/strong\u003e extends R1's reasoning capabilities to multiple visual perception tasks, including various detection tasks like \u003cstrong\u003eOpen Vocabulary Detection, Few-shot Detection, Reasoning Grounding, and Fine-grained Image Classification\u003c/strong\u003e.\n  \u003c/p\u003e\n\u003c!--     \u003ca href=\"\"\u003e\n      \u003cimg src=\"assets/teaser.png\" alt=\"Logo\" width=\"100%\"\u003e \n    \u003c/a\u003e --\u003e\n\u003cbr\u003e\n\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/radar.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n## 📢 News\n- 🚀 [03/12/2025] We release the code to build the \u003ca href=\"https://github.com/Liuziyu77/Visual-RFT/tree/main/dataset\"\u003edataset\u003c/a\u003e on your own data.\n- 🚀 [03/04/2025] We release our \u003ca href=\"https://arxiv.org/abs/2503.01785\"\u003ePaper\u003c/a\u003e.\n- 🚀 [03/04/2025] We upload our training datasets to \u003ca href=\"https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df\"\u003eHuggingface\u003c/a\u003e.\n- 🚀 [03/04/2025] We release **ViRFT** repository and our training code.\n\n## 💡 Highlights\n- 🔥 **Visual Reinforcement Fine-tuning (Visual-RFT)**: We introduce Visual Reinforcement Fine-tuning (**Visual-RFT**), which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.\n- 🔥 **Verified Rewards**: We design different **verified rewards** for different visual tasks that enable efficient, high-quality reward computation at a negligible cost. This allows the seamless transfer of DeepSeek R1's style reinforcement learning strategy to the multi-modal domain.\n- 🔥 **Extensive Experiments**: We conduct **extensive experiments** on various visual perception tasks, including fine-grained image classification, open vocabulary object detection, few-shot object detection, and reasoning grounding.\n- 🔥 **Open Source**: We fully **open-source** the training code, training data, and evaluation scripts on Github to facilitate further research.\n\n\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/teaser.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n\n## Framework\n**Visual-RFT** framework is shown below. The policy model generates a group of responses based on the input. Each response is passed through a verifiable reward function to compute the reward. After group computation of the rewards for each output, the quality of each response is evaluated and used to update the policy model. To ensure the stability of the policy model training, **Visual-RFT** use KL divergence to limit the difference between the policy model and the reference model. For ***more implementation details***, including data generation, the design of the ***verifiable reward***, and other aspects, please refer to our paper.\n\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/framework.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n## 🛠️ Setup\n```\ngit clone https://github.com/Liuziyu77/Visual-RFT.git\nconda create -n Visual-RFT python=3.10\nconda activate Visual-RFT\nbash setup.sh\n```\n\n## Inference\nWe have uploaded the model trained on 200+ samples from the LISA dataset (\u003ca href=\"https://huggingface.co/Zery/Qwen2-VL-7B_visual_rft_lisa_IoU_reward\"\u003e🤗Huggingface\u003c/a\u003e). You can use it to evaluate the inference performance of **Reasoning Grounding**. More details refer to `demo`.\n\n## Training\n### Datasets\nTo train on our various visual perception tasks, first visit \u003ca href=\"https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df\"\u003eHuggingface Datasets\u003c/a\u003e to download the datasets. We have uploaded different datasets for different tasks.\n| Datasets             |Task  |Setting          | Description                                                                 |\n|------------------------------|------|----|-----------------------------------------------------------------------------|\n| laolao77/ViRFT_COCO   |Detection | -                 | It includes all categories from COCO, with a total of 6k entries.            |\n| laolao77/ViRFT_COCO_base65     | Detection |Open Vocabulary       | It includes 65 basic categories from COCO, with a total of 6k entries.      |\n| laolao77/ViRFT_COCO_8_cate_4_shot |  Detection| Few-shot | It includes 8 selected categories from COCO.                                 |\n| laolao77/ViRFT_LVIS_few_shot     |  Detection| Few-shot      | It includes 6 selected categories from COCO.                                 |\n| laolao77/ViRFT_CLS_flower_4_shot |  Classification| Few-shot     | It includes the 102 categories from the Flower102 dataset, with 4 images per category. |\n| laolao77/ViRFT_CLS_fgvc_aircraft_4_shot|  Classification| Few-shot | It includes the 100 categories from the FGVC-Aircraft dataset, with 4 images per category. |\n| laolao77/ViRFT_CLS_car196_4shot   |  Classification| Few-shot   | It includes the 196 categories from the Stanford Cars dataset, with 4 images per category. |\n| laolao77/ViRFT_CLS_pets37_4shot  |  Classification| Few-shot    | It includes the 37 categories from the Pets37 dataset, with 4 images per category. |\n| LISA dataset | Grounding | - | Reasoning Grounding|\n\u003e 🔔 If your want to build a dataset on your own data, you can refere to `dataset/build_dataset.ipynb`. Just provide a `json` file with `image`, `promble` and 'solution'.\n\n### GRPO\nAfter downloading the dataset, you can start training using the following example bash script. Our bash scripts are in ```/src/scripts```\n\u003e 🔔 There's no need for prolonged training. For a dataset with only a few hundred samples, 200 steps should be sufficient.\n```\n# There's no need for prolonged training. For a dataset with only a few hundred samples, 200 steps should be sufficient.\nexport DEBUG_MODE=\"true\"\nexport LOG_PATH=\"./debug_log_2b_GRPO_coco_base65cate_6k.txt\"\n\nexport DATA_PATH=./share_data/ViRFT_COCO_base65   ### your local dataset downloading from huggingface\nexport CKPT_PATH=./share_models/Qwen2-VL-2B-Instruct    ### Qwen2-VL-2B checkpoint path\nexport SAVE_PATH=./share_models/Qwen2-VL-2B-Instruct_GRPO_coco_base65cate_6k    ### save path\n\ntorchrun --nproc_per_node=\"8\" \\\n    --nnodes=\"1\" \\\n    --node_rank=\"0\" \\\n    --master_addr=\"127.0.0.1\" \\\n    --master_port=\"12345\" \\\n    src/open_r1/grpo.py \\\n    --output_dir ${SAVE_PATH}  \\\n    --model_name_or_path ${CKPT_PATH} \\\n    --dataset_name ${DATA_PATH} \\\n    --deepspeed local_scripts/zero3.json \\\n    --max_prompt_length 1024 \\\n    --per_device_train_batch_size 1 \\\n    --gradient_accumulation_steps 2 \\\n    --logging_steps 1 \\\n    --bf16 \\\n    --report_to wandb \\\n    --gradient_checkpointing false \\\n    --attn_implementation flash_attention_2 \\\n    --max_pixels 401408 \\\n    --num_train_epochs 1 \\\n    --run_name Qwen2-VL-2B_GRPO_coco_base65cate_6k \\\n    --save_steps 100 \\\n    --save_only_model true \\\n    --num_generations 8 '\n```\n\nIt is important to note that if you encounter an OOM (Out of Memory) issue during training, you can resolve it by configuring `zero3.json`. For the 7B model, if the issue persists after enabling `zero3.json`, you can try lowering the `num_generations` to 4.\n```\n--deepspeed ./local_scripts/zero3.json\n```\nMoreover, setting `--gradient_checkpointing` to `true` can save memory, allowing for a higher `--num_generations` limit, which leads to better training performance. However, it will slow down the training process.\n```\n--gradient_checkpointing True\n```\n### SFT\nWe use \u003ca href=\"https://github.com/hiyouga/LLaMA-Factory\"\u003eLLaMa-Factory\u003c/a\u003e for supervised fine-tuning (SFT) of the model. You can convert the downloaded dataset into the corresponding Qwen SFT format for training.\n\n## Evaluation\nWe conducted extensive experiments on various visual perception tasks, including **fine-grained image classification**, **open vocabulary object detection**, **few-shot object detection**, and **reasoning grounding**. **ViRFT** achieves remarkable performance improvements across these tasks with minimal data and computational cost, significantly surpassing supervised fine-tuning baselines.\n\n\u003e We provide a step-by-step tutorial for using the evaluation code. If you encounter any issues, feel free to open an issue.\n\n### COCO Evaluation\nYou can use the files in the ```coco_evaluation``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n\nFor ***inference***: \n```\ncd ./coco_evaluation\npython Qwen2_VL_coco_infere.py\n```\nPlease note that some file paths and model paths in ```Qwen2_VL_coco_infere.py``` need to be modified.\n```\n### line 167-168, change for your model path and model base.\nmodel_path = \"./share_models/Qwen2-VL-2B-Instruct_RL/\"  # RL model\nmodel_base = \"./share_models/Qwen2-VL-2B-Instruct/\"  # original Qwen2-VL model\n### line 182, change for your coco val annnotation path\nwith open('./data/coco/annotations/instances_val2017.json', 'r') as json_file:\n### line 224, Modify according to your own image path.\nimage_path = './data/coco/val2017/'+image['file_name']    \n### line 231-241, selecte the categories you want to evaluation\nselected_cate = ['bus', 'train', 'fire hydrant', 'stop sign', 'cat', 'dog', 'bed', 'toilet']\n### line 350, results save path\nwith open(f'prediction_results.json', 'w') as json_file:\n```\nThe inference results will be saved in `JSON` format and later used for evaluation.\n\nFor ***evaluation***, just run ```./coco_evaluation/evaluation.ipynb``` step by step.\n\n### LVIS Evaluation\nYou can use the files in the ```lvis_evaluation``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n\nFor ***inference***: \n```\ncd ./lvis_evaluation\npython Qwen2_VL_lvis_infere.py\n```\nPlease note that some file paths and model paths in ```Qwen2_VL_lvis_infere.py``` need to be modified.\n```\n### line 169-170, change for your model path and model base\nmodel_path = \"./share_models/Qwen2-VL-2B-Instruct_RL/\"  # RL model\nmodel_base = \"./share_models/Qwen2-VL-2B-Instruct/\"  # original Qwen2-VL model\n### line 184, change for your lvis val annnotation path\nwith open('./data/lvis/annotations/lvis_v1_val.json', 'r') as json_file:\n### line 228, Modify according to your own image path.\nimage_path = './data/lvis/' + \"/\".join(parts[-2:])   \n### line 234-242, selecte the categories you want to evaluation\nselected_cate = ['horse_buggy', 'die', 'kitchen_table', 'omelet', 'papaya', 'stepladder']\n### line 346, results save path\nwith open(f'prediction_results.json', 'w') as json_file:\n```\nThe inference results will be saved in `JSON` format and later used for evaluation.\n\nFor ***evaluation***, just run ```./lvis_evaluation/lvis_evaluation.ipynb``` step by step.\n\n### Classification Evaluation\nYou can use the files in the ```classification``` directory for model inference and obtain evaluation results. Our code supports multi-GPU evaluation, and it requires at least two GPUs.\n```\ncd ./classification\npython Qwen2_VL_classification_infere.py\n```\nPlease note that the model paths in ```Qwen2_VL_classification_infere.py``` need to be modified.\n```\n### line 61-63, change for your model path and model base\nmodel_path = \"./share_models/Qwen2-VL-2B-Instruct_RL/\"  # after RL\nmodel_base = \"./share_models/Qwen2-VL-2B-Instruct/\"  # original Qwen2-VL\n```\nInference and result computation are performed simultaneously. After the program finishes running, the number of correctly classified items will be displayed in the command line, and the accuracy is obtained by dividing it by the length of the validation set. (Flower102: 2463, Pets37: 3669, stanford cars: 8041, fgvc-aircraft: 3333)\n\n\u003e 🔔 Sometimes, due to environment issues, the model may produce incorrect inferences when `use_cache = None`. You might consider explicitly setting `use_cache = True`.\n\u003e `generated_ids = model.generate(**inputs, max_new_tokens=1024, use_cache=True)`\n\n### Evaluation Results\n*We have conducted **extensive experiments**; please refer to our paper for further details*.\n\n\n### Case Study\nIn the following figure, we present some inference examples from **ViRFT**. We observe that the thinking process significantly enhances the reasoning and grounding ability with **ViRFT**. Through **ViRFT**, Qwen2-VL learns to think critically and carefully examine the image to produce accurate grounding results.\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/case_lisa.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\nWe also present some inference cases of the model when handling *fine-grained classification tasks*. These results not demonstrate the strong generalization ability of **ViRFT** across various visual tasks.\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/case_cls.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n\n\n## ✒️Citation\n```\n@article{liu2025visual,\n  title={Visual-RFT: Visual Reinforcement Fine-Tuning},\n  author={Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},\n  journal={arXiv preprint arXiv:2503.01785},\n  year={2025}\n}\n```\n\n## 📄 License\n![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only.\nLicense: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use\n\n## Acknowledgement\nWe sincerely thank projects \u003ca href=\"https://github.com/Deep-Agent/R1-V\"\u003eR1-V\u003c/a\u003e, \u003ca href=\"https://github.com/huggingface/open-r1\"\u003eOpen-R1\u003c/a\u003e, and \u003ca href=\"https://github.com/EvolvingLMMs-Lab/open-r1-multimodal\"\u003eOpen-r1-multimodal\u003c/a\u003e for providing their open-source resources.\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiuziyu77%2FVisual-RFT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLiuziyu77%2FVisual-RFT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiuziyu77%2FVisual-RFT/lists"}