{"id":28447790,"url":"https://github.com/opengvlab/tpo","last_synced_at":"2026-02-06T06:02:57.388Z","repository":{"id":270292331,"uuid":"908480979","full_name":"OpenGVLab/TPO","owner":"OpenGVLab","description":"Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment","archived":false,"fork":false,"pushed_at":"2025-07-22T07:47:48.000Z","size":12244,"stargazers_count":61,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-11-16T16:25:02.654Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-26T07:13:32.000Z","updated_at":"2025-11-10T16:45:21.000Z","dependencies_parsed_at":"2024-12-30T04:26:45.968Z","dependency_job_id":"c588887e-db16-4fd4-a9d0-dd9ccd05cdd4","html_url":"https://github.com/OpenGVLab/TPO","commit_stats":null,"previous_names":["opengvlab/tpo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenGVLab/TPO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FTPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FTPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FTPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FTPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/TPO/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FTPO/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29153144,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T02:39:25.012Z","status":"ssl_error","status_checked_at":"2026-02-06T02:37:22.784Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-06T12:07:13.484Z","updated_at":"2026-02-06T06:02:57.383Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 👫 TPO\n\n\u003ca src=\"https://img.shields.io/badge/cs.CV-2412.19326-b31b1b?logo=arxiv\u0026logoColor=red\" href=\"https://arxiv.org/abs/2412.19326\"\u003e \u003cimg src=\"https://img.shields.io/badge/cs.CV-2412.19326-b31b1b?logo=arxiv\u0026logoColor=red\"\u003e\n\u003c/a\u003e | \u003ca src=\"https://img.shields.io/twitter/follow/opengvlab?style=social\" href=\"https://twitter.com/opengvlab\"\u003e\n    \u003cimg src=\"https://img.shields.io/twitter/follow/opengvlab?style=social\"\u003e \u003c/a\u003e\n\u003c/a\u003e | [![Hugging Face Model](https://img.shields.io/badge/Model-VideoChat--TPO-yellow?logo=Huggingface)](https://huggingface.co/OpenGVLab/VideoChat-TPO)\n\n## 💡 Introduction\nTask Preference Optimization (TPO) is a new method designed to enhance the performance of multimodal large language models (MLLMs) in handling visual tasks. Current MLLMs face challenges in precisely understanding visuals despite their capabilities in various vision applications. TPO addresses this by integrating differentiable task preferences from fine-grained visual tasks, introducing learnable task tokens to bridge the gap between task-specific heads and the MLLM. This results in improved multimodal capabilities and task-specific performance, with significant improvements demonstrated across multiple benchmarks and tasks.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"medium/image.png\" width=\"400\" alt=\"TPO uses differentiable task preferences from dense visual supervisions via task-specific heads to enhance MLLMs in\n    fine-grained understanding.\"/\u003e\n    \u003cp\u003eFigure 1: TPO uses differentiable task preferences from dense visual supervisions via task-specific heads to enhance MLLMs in fine-grained understanding.\u003c/p\u003e\n\u003c/p\u003e\n\n- Enhanced Multimodal Performance: Achieves an average **14.6%** improvement in multimodal performance compared to baseline models on various image and video tasks, and demonstrates scalability across different MLLM architectures such as [VideoChat](https://github.com/OpenGVLab/TPO?tab=readme-ov-file#-model-zoo) and LLaVA.\n- Robust Zero-Shot Capabilities: Performs comparably to state-of-the-art supervised models in zero-shot scenarios across various vision tasks.\n- Synergistic Training: Multi-task co-training within TPO leads to mutual benefits, enhancing individual task performance beyond single-task training.\n\n\u003cp /\u003e\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"medium/frame.png\" width=\"640\" alt=\"TPO uses differentiable task preferences from dense visual supervisions via task-specific heads to enhance MLLMs in\n    fine-grained understanding.\"/\u003e\n    \u003cp\u003eFigure 2: Overall Pipeline of TPO. The architecture of Task Preference Optimization (TPO) consists of four main components: (1) a vision encoder, (2) a connector, (3) a large language model, and (4) a series of visual task heads. Differently colored flame symbols indicate which components are unfrozen at various stages of the training process.\u003c/p\u003e\n\u003c/p\u003e\n\n## 🏃 Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/OpenGVLab/TPO.git\n```\n1. Navigate to the project directory:\n```bash\ncd TPO\n```\n3. Install the required dependencies:\n```\npip install -r requirements.txt\n```\n4. Try the demo\n```\npython app.py\n```\n\n## 🤖 Model Zoo\n\n| MLLM | Link |  MVBench |\n| ---  | ---  | --- |\n| VideoChat-TPO| [huggingface](https://huggingface.co/OpenGVLab/VideoChat-TPO)| 66.8 |\n| LlaVA-OV-TPO | TBD | 64.8 |\n\n## Citation\n\n```\n@article{yan2024tpo,\n  title={Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment},\n  author={Yan, Ziang and Li, Zhilin and He, Yinan and Wang, Chenting and Li, Kunchang and Li, Xinhao and Zeng, Xiangyu and Wang, Zilei and Wang, Yali and Qiao, Yu, and Wang, Limin and Wang, Yi},\n  journal={arXiv preprint arXiv:2412.19326},\n  year={2024}\n}\n```\n\n## Acknowledgement\n\nTPO is built with reference to the following projects: [VideoChat](https://github.com/OpenGVLab/Ask-Anything), [Llava-OV](https://github.com/LLaVA-VL/LLaVA-NeXT), [UMT](https://github.com/LAION-AI/CLIP_benchmark), [InternVideo2](https://github.com/OpenGVLab/InternVideo), [CG-DETR](https://github.com/wjun0830/CGDETR), and [SAM2](https://github.com/facebookresearch/sam2). Thanks for their work!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Ftpo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopengvlab%2Ftpo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Ftpo/lists"}