{"id":31660381,"url":"https://github.com/internlm/caprl","last_synced_at":"2025-10-07T17:06:32.177Z","repository":{"id":317062819,"uuid":"1062451970","full_name":"InternLM/CapRL","owner":"InternLM","description":"Captioning Reinforcement Learning","archived":false,"fork":false,"pushed_at":"2025-09-28T15:14:10.000Z","size":23344,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-28T16:38:05.410Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InternLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-23T09:20:07.000Z","updated_at":"2025-09-28T15:14:13.000Z","dependencies_parsed_at":"2025-09-29T22:34:22.689Z","dependency_job_id":null,"html_url":"https://github.com/InternLM/CapRL","commit_stats":null,"previous_names":["internlm/caprl"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/InternLM/CapRL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FCapRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FCapRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FCapRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FCapRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InternLM","download_url":"https://codeload.github.com/InternLM/CapRL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InternLM%2FCapRL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278811851,"owners_count":26050183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-07T17:06:28.695Z","updated_at":"2025-10-07T17:06:32.172Z","avatar_url":"https://github.com/InternLM.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n\u003c!--   \u003ch1 align=\"center\"\u003e\u003cimg src=\"assets/logo.png\" width=\"256\"\u003e\u003c/h1\u003e --\u003e\n  \u003ch1 align=\"center\"\u003eCapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning\u003c/h1\u003e\n    \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/Cooperx521\"\u003e\u003cstrong\u003eLong Xing*\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://lightdxy.github.io/\"\u003e\u003cstrong\u003eXiaoyi Dong*\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://yuhangzang.github.io/\"\u003e\u003cstrong\u003eYuhang Zang\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://scholar.google.com/citations?user=sJkqsqkAAAAJ\"\u003e\u003cstrong\u003eYuhang Cao\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://scholar.google.com/citations?user=P4yNnSkAAAAJ\u0026hl=zh-TW\"\u003e\u003cstrong\u003eJianze Liang\u003c/strong\u003e\u003c/a\u003e\n    ·\n    \u003ca href=\"https://github.com/shikiw\"\u003e\u003cstrong\u003eQidong Huang\u003c/strong\u003e\u003c/a\u003e\n    ·\n  \u003ca href=\"https://myownskyw7.github.io/\"\u003e\u003cstrong\u003eJiaqi Wang\u003c/strong\u003e\u003c/a\u003e ·\n  \u003ca href=\"https://scholar.google.com/citations?user=5bInRDEAAAAJ\u0026hl=zh-CN\"\u003e\u003cstrong\u003eFeng Wu\u003c/strong\u003e\u003c/a\u003e ·\n  \u003ca href=\"http://dahua.site/\"\u003e\u003cstrong\u003eDahua Lin\u003c/strong\u003e\u003c/a\u003e\n\n  \u003c/p\u003e\n  📖\u003ca href=\"https://arxiv.org/abs/2509.22647\"\u003ePaper\u003c/a\u003e |🤗\u003ca href=\"https://huggingface.co/internlm/CapRL-3B\"\u003eCapRL-3B Model\u003c/a\u003e |\n  🤗\u003ca href=\"https://huggingface.co/datasets/internlm/CapRL-2M\"\u003eCapRL-2M Dataset\u003c/a\u003e |🤗\u003ca href=\"https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189\"\u003eCapRL Collection\u003c/a\u003e | 🤗\u003ca href=\"https://huggingface.co/papers/2509.22647\"\u003eDaily Paper\u003c/a\u003e\u003c/h3\u003e\n\u003cdiv align=\"center\"\u003e\u003c/div\u003e\n\u003cp align=\"center\"\u003e\n  \u003cp\u003e\n\n\n\n🌈We are excited to introduce \u003cstrong\u003eCapRL-3B\u003c/strong\u003e, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.\nBy employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.\n\n\n\n  \u003c/p\u003e\n\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/teaser.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/performance.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n\n## 📢 News\n- 🚀 [09/25/2025] We release **CapRL** repository, model, evaluation code and dataset.\n\n\n## 💡 Highlights\n- 🔥 **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.\n- 🔥 **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.\n- 🔥 **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.\n\n\n\n## 👨‍💻 Todo\n\n- [ ] Release training code.\n- [ ] Release 75k QA dataset.\n- [ ] Release CapRL-series on stronger base model.\n\n## 🛠️ Setup\n```\ngit clone https://github.com/InternLM/CapRL.git\nconda create -n CapRL python=3.10\nconda activate CapRL\nbash setup.sh\n```\n\n## ⭐️ Quick Start\nIf you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).\n\nWe recommend using **vLLM** to speed up inference.\n\n\n\n### Start an OpenAI API Service\n\nRun the command below to start an OpenAI-compatible API service:\n\n```bash\nvllm serve \"/PATH/CapRL-3B\" \\\n    --trust-remote-code \\\n    --tensor-parallel-size=1 \\\n    --pipeline-parallel-size=1 \\\n    --gpu_memory_utilization=0.95 \\\n    --served-model-name=caprl \\\n    --port 8000 \\\n    --host 0.0.0.0\n```\n\nThen you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):\n```python\nimport base64\nfrom openai import OpenAI\n# Set OpenAI's API key and API base to use vLLM's API server.\nopenai_api_key = \"EMPTY\"\nopenai_api_base = \"http://localhost:8000/v1\"\nclient = OpenAI(\n    api_key=openai_api_key,\n    base_url=openai_api_base,\n)\nimage_path = \"/path/to/local/image.png\"\nwith open(image_path, \"rb\") as f:\n    encoded_image = base64.b64encode(f.read())\nencoded_image_text = encoded_image.decode(\"utf-8\")\nbase64_qwen = f\"data:image;base64,{encoded_image_text}\"\nchat_response = client.chat.completions.create(\n    model=\"caprl\",\n    messages=[\n        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": base64_qwen\n                    },\n                },\n                {\"type\": \"text\", \"text\": \"What is the text in the illustrate?\"},\n            ],\n        },\n    ],\n    temperature=1.0,\n    max_tokens=max_tokens,\n    top_p=1.0,\n    extra_body={\n        \"repetition_penalty\": 1.0,\n        },\n)\nprint(\"Chat response:\", chat_response)\n```\n\n## Pretraining\n\n### Datasets\n\nOur **CapRL-2M** dataset is available on :\n[🔗 Hugging Face](https://huggingface.co/datasets/internlm/CapRL-2M)\n\nIt includes images from [ShareGPT-1M](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) and [DenseFusion-1M](https://huggingface.co/datasets/BAAI/DenseFusion-1M), with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.\n\nIn our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.\n\n\n\n### Reproducing Pretraining Experiments\n\nTo reproduce the pretraining experiments presented in our paper:\n\n1. **Initialize Qwen2.5-VL.**\n   Follow the steps in the notebook [`initiallize_vlm_3b.ipynb`](https://github.com/Cooperx521/ScaleCap/blob/892ad0682defa37f54833c3c4284a9d9a5c3451e/grocery_file/initiallize_vlm_3b.ipynb) to set up the Qwen2.5-VL model for training.\n\n2. **Training.**\n   You can then use [LLaMAFactory](https://github.com/hiyouga/LLaMA-Factory) directly to run the training process.\n\n\n## Comparing Caption Quality via Prism Framework\n\nWe evaluate caption quality by **decoupling the traditional VQA (Visual Question Answering) task**:\n\n1. First, a model generates a **caption** for the image.\n2. Then, a **language model** answers questions based solely on the generated caption.\n\nThis approach allows us to assess the **informational quality and completeness** of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.\n\nThe complete evaluation scripts can be found in the `Prism_Evaluation` folder, with the core implementation located in `Eval_CapRL.py`.\n\n\nThe model used for answering questions based on captions is [CapRL-Eval-3B](https://huggingface.co/internlm/CapRL-Eval-3B), which is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting.\n\nYou can specify `--reward-model-path` as the path to **CapRL-Eval-3B** in `Eval_CapRL.py`.\n\n\n### Cases\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/comparison.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/info_caprl.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/info_caprl2.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\u003ca href=\"\"\u003e\n  \u003cimg src=\"assets/natural_caprl.png\" alt=\"Logo\" \u003e\n\u003c/a\u003e\n\n## 📄 License\n![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) \n\n**Usage and License Notices**: The data and code are intended and licensed for research use only.\nLicense: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use\n\n## ❤️ Acknowledgments\n- [Open-LLaVA-NeXT](https://github.com/xiaoachen98/Open-LLaVA-NeXT): Thanks for the impressive open-source dataset.\n- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternlm%2Fcaprl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finternlm%2Fcaprl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternlm%2Fcaprl/lists"}