{"id":29009874,"url":"https://github.com/tencentarc/video-holmes","last_synced_at":"2025-06-25T15:33:35.810Z","repository":{"id":295039722,"uuid":"988896451","full_name":"TencentARC/Video-Holmes","owner":"TencentARC","description":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","archived":false,"fork":false,"pushed_at":"2025-06-03T08:44:49.000Z","size":11625,"stargazers_count":49,"open_issues_count":1,"forks_count":0,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-06-03T16:40:11.731Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://video-holmes.github.io/Page.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-23T08:38:24.000Z","updated_at":"2025-06-03T10:21:24.000Z","dependencies_parsed_at":"2025-05-23T10:49:02.289Z","dependency_job_id":"cf83422b-a030-4af1-9ffe-fd256b245e85","html_url":"https://github.com/TencentARC/Video-Holmes","commit_stats":null,"previous_names":["tencentarc/video-holmes"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/Video-Holmes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FVideo-Holmes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FVideo-Holmes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FVideo-Holmes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FVideo-Holmes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/Video-Holmes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FVideo-Holmes/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261901405,"owners_count":23227593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-25T15:33:23.461Z","updated_at":"2025-06-25T15:33:35.794Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"assets/name.png\" height=200\u003e\r\n\u003c/p\u003e\r\n\u003chr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n\r\n## Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?\r\n\r\n\r\n**[Junhao Cheng\u003csup\u003e1,2\u003c/sup\u003e](https://donahowe.github.io/), \r\n[Yuying Ge\u003csup\u003e1,\u0026#9993;\u003c/sup\u003e](https://geyuying.github.io/), \r\n[Teng Wang\u003csup\u003e1,\u0026#9993;\u003c/sup\u003e](http://ttengwang.com/), \r\n[Yixiao Ge\u003csup\u003e1\u003c/sup\u003e](https://geyixiao.com/), \r\n[Jing Liao\u003csup\u003e2\u003c/sup\u003e](https://scholar.google.com/citations?user=3s9f9VIAAAAJ\u0026hl=en), \r\n[Ying Shan\u003csup\u003e1\u003c/sup\u003e](https://scholar.google.com/citations?user=4oXBp9UAAAAJ\u0026hl=en)**\r\n\u003cbr\u003e\r\n\u003csup\u003e1\u003c/sup\u003eARC Lab, Tencent PCG, \r\n\u003csup\u003e2\u003c/sup\u003eCity University of Hong Kong\r\n\u003cbr\u003e\r\n\r\n\u003ca href=\"https://video-holmes.github.io/Page.github.io/\" target=\"_blank\"\u003e\r\n    \u003cimg alt=\"Website\" src=\"https://img.shields.io/badge/🌎_Website-Video--Holmes-blue.svg\" height=\"20\" /\u003e\r\n\u003c/a\u003e\r\n\r\n\u003ca href=\"http://arxiv.org/abs/2505.21374\" target=\"_blank\"\u003e\r\n    \u003cimg alt=\"arXiv\" src=\"https://img.shields.io/badge/arXiv-Video--Holmes-red?logo=arxiv\" height=\"20\" /\u003e\r\n\u003c/a\u003e\r\n\r\n\u003ca href=\"https://huggingface.co/datasets/TencentARC/Video-Holmes\" target=\"_blank\"\u003e\r\n    \u003cimg alt=\"HF Dataset: Video--Holmes\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20_Benchmark-Video--Holmes-ffc107?color=ffc107\u0026logoColor=white\" height=\"20\" /\u003e\r\n\u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\n## 🔎 Introduction\r\n\r\nVideo-Holmes is \u003cb\u003ea benchmark designed to evaluate the complex video reasoning capabilities of MLLMs\u003c/b\u003e. \r\n\r\nVideo-Holmes consists of 1,837 questions derived from 270 manually annotated \u003cb\u003esuspense short films\u003c/b\u003e (ranging from 1 to 5 minutes), which spans \u003cb\u003eseven carefully designed tasks\u003c/b\u003e. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to \u003cb\u003eactively locate and connect multiple relevant visual clues scattered across different video segments\u003c/b\u003e.\r\n\r\n⭐ Key Aspects of Video-Holmes:\r\n\r\n\u003cul style=\"list-style-type: disc; padding-left: 20px;\"\u003e\r\n\u003cli\u003e\u003cb\u003eOne-Click Evaluation:\u003c/b\u003e Videos, questions, and evaluation codes are packaged on GitHub and \u003ca href=\"https://huggingface.co/datasets/TencentARC/Video-Holmes\" target=\"_blank\"\u003eHuggingface\u003c/a\u003e.\u003c/li\u003e\r\n\u003cli\u003e\u003cb\u003eHigh Reasoning Demand:\u003c/b\u003e Significant performance gap between reasoning models and non-reasoning models.\u003c/li\u003e\r\n\u003cli\u003e\u003cb\u003eReasoning Process Analysis:\u003c/b\u003e Clearly visualizes the reasons behind correct and incorrect model responses.\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nWe aim that Video-Holmes can serve as a \u003ci\u003e\"Holmes-test\"\u003c/i\u003e for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. Please visit our [homepage](https://video-holmes.github.io/Page.github.io/) for more details!\r\n\r\n\u003cimg src=\"assets/Teaser.png\" alt=\"Teaser Image\" style=\"width: 100%; height: auto;\"\u003e\r\n\r\n\r\n## 📅 News\r\n\r\n* [2025-05-29] 🔥We released the training set of Video-Holmes, which consists of 233 videos and 1,551 questions.\r\n* [2025-05-28] 🔥We released Video-Holmes and corresponding evaluation codes.\r\n\r\n## 🚩 Plan\r\n- [x] Release suspense short film annotations\r\n- [x] Release benchmark construction codes\r\n- [x] Release training data\r\n- [x] Support evaluation from [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)\r\n## 🏆 Leaderboard\r\n🏅 Best performance model: [Gemini-2.5-Pro](https://gemini.google.com/)\r\n\r\n🏅 Best thinking model based on Qwen2.5-VL-7B: [Video-R1](https://github.com/tulerfeng/Video-R1)\r\n\r\n➡️ [Full leaderboard](https://video-holmes.github.io/Page.github.io#leaderboard)\r\n\r\n\u003e Welcome to contact us at Howe4884@outlook.com to upload your model to the leaderboard.\r\n\r\n\r\n## 🚀 Quick Start\r\n\r\nTo download Video-Holmes, you can run the following commands:\r\n```shell\r\ngit clone https://github.com/TencentARC/Video-Holmes.git\r\ncd Video-Holmes\r\npip install huggingface_hub\r\npython download.py --hf_token YOUR_HUGGINGFACE_ACCESS_TOKEN\r\nunzip Benchmark/videos.zip -d Benchmark/\r\nunzip Benchmark/annotations.zip -d Benchmark/\r\n```\r\n\r\nWe provide all-in-one evaluation codes for baseline models:\r\n```shell\r\npython evaluate.py --model_name YOUR_MODEL_NAME --model_path YOUR_MODEL_PATH (optional)\r\n```\r\n\r\nSupported Model List:\r\n\r\n| QwenVL | QwenVL-RL | InternVL | Gemini |\r\n|----------------|----------------|----------------|----------------|\r\n| Qwen2.5-VL-7B  | VideoChat-R1  | InternVL2.5-8B | gemini-2.0-flash |\r\n| Qwen2.5-VL-32B | Video-R1  | InternVL3-8B | gemini-2.0-pro-exp | \r\n\r\nYou can also customize your model by specifying the `--model_path` argument, or by implementing the following functions: `prepare_your_model` (line 388) and `generate_your_model` (line 439).\r\n\r\n\u003cdetails\u003e\r\n\u003csummary\u003e\u003cb\u003e🧐 Reasoning Process Analysis\u003c/b\u003e\u003c/summary\u003e\r\n  \r\nYou first need to apply a [DeepSeek API key](https://platform.deepseek.com/api_keys) and then you can run the following commands to analyze the reasoning process of your models:\r\n\r\n```shell\r\npython evaluate_reasoning.py --model_name YOUR_MODEL_NAME --api_key YOUR_API_KEY\r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\r\n\u003csummary\u003e\u003cb\u003e🪄 Generate Your Holmes-Test\u003c/b\u003e\u003c/summary\u003e\r\n  \r\nTo generate questions for your videos with annotations, you can run the following commands:\r\n\r\n```shell\r\ncd Pipeline\r\npython generate_questions.py --api_key YOUR_API_KEY\r\n```\r\n\r\n\u003e Note: You can down load the video on YouTube according to the `VIDEO_ID` by `https://www.youtube.com/watch?v=VIDEO_ID`\r\n\u003c/details\u003e\r\n\r\n## 🧠 Training (\u003cspan style=\"font-family:serif;\"\u003e𝓃𝑒𝓌🔥\u003c/span\u003e)\r\n\r\nWe release the [training set](https://huggingface.co/datasets/TencentARC/Video-Holmes/blob/main/train_Video-Holmes.json) of Video-Holmes, which consists of 233 videos and 1,551 questions. Experimental results (as shown in the table below) demonstrate that performing RL post-training on this training set can further enhance the model's complex reasoning ability.\r\n\r\n| Model | SR |        IMC        | TCI        | TA        |        MHR        |        PAR        |        CTI        |        Avg |\r\n|---|----|----|----|----|----|----|----|----|\r\n| Qwen2.5-VL-7B  | 38.4|        34.8|        17.6|        30.0|        27.1|        18.6|        25.2|        27.8|\r\n| [Qwen2.5-VL-7B-GRPO-CARE](https://github.com/TencentARC/SEED-Bench-R1) |42.8|        35.1|        25.6|        40.5|        29.2|        29.9|        32.6|  33.5 |\r\n| [Qwen2.5-VL-7B-GRPO-CARE*](https://github.com/TencentARC/SEED-Bench-R1) | **46.2** | **44.9** |**31.5** |**49.5** |**39.2** |**37.1** |**37.4**|**40.7** |\r\n\r\n\r\n\u003e \\* denotes models training on Video-Holmes.\r\n\r\n## 🛠️ Construction Pipeline\r\n\r\nWe select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).\r\n\u003cimg src=\"assets/pipeline.png\" alt=\"Teaser Image\" style=\"width: 100%; height: auto;\"\u003e\r\n\r\n## 🗝️ Question Types\r\n\r\nExisting benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.\r\n\u003cimg src=\"assets/Teaser2.png\" alt=\"Teaser Image\" style=\"width: 100%; height: auto;\"\u003e\r\n\r\n## :closed_book: License\r\n- Video-Holmes is released under the Apache-2.0 license for academic purpose only.\r\n- All videos of the Video-Holmes are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos. The copyright remains with the original owners of the video.\r\n- If any video in our dataset infringes upon your rights, please contact us for removal.\r\n  \r\n## 📜 Citation\r\n\r\nIf you find our work helpful, please consider giving a star ⭐ and citation 📝\r\n\r\n```BibTeXw\r\n@article{cheng2025video,\r\n  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},\r\n  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},\r\n  journal={arXiv preprint arXiv:2505.21374},\r\n  year={2025}\r\n}\r\n```\r\n\r\n## 🤗 Acknowledgements\r\n\r\nWe refer to [MovieDreamer](https://github.com/aim-uofa/MovieDreamer) and [VCR-Bench](https://github.com/zhishuifeiqian/VCR-Bench) to build our codebase and homepage. Thanks for their wonderful project.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fvideo-holmes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fvideo-holmes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fvideo-holmes/lists"}