{"id":31833313,"url":"https://github.com/thuml/miniveo3-reasoner","last_synced_at":"2026-04-01T21:06:21.125Z","repository":{"id":318491708,"uuid":"1070632808","full_name":"thuml/MiniVeo3-Reasoner","owner":"thuml","description":"Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give it a star 🌟 if you find it useful.","archived":false,"fork":false,"pushed_at":"2025-10-12T08:11:00.000Z","size":17154,"stargazers_count":219,"open_issues_count":0,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-28T00:36:43.253Z","etag":null,"topics":["chain-of-frames","maze","veo3","video-diffusion-model","video-reasoning","visual-planning","visual-reasoning","wan","world-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thuml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-06T08:19:08.000Z","updated_at":"2026-03-08T08:58:26.000Z","dependencies_parsed_at":"2025-10-07T15:21:46.279Z","dependency_job_id":null,"html_url":"https://github.com/thuml/MiniVeo3-Reasoner","commit_stats":null,"previous_names":["thuml/miniveo3-reasoner"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/thuml/MiniVeo3-Reasoner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FMiniVeo3-Reasoner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FMiniVeo3-Reasoner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FMiniVeo3-Reasoner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FMiniVeo3-Reasoner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thuml","download_url":"https://codeload.github.com/thuml/MiniVeo3-Reasoner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thuml%2FMiniVeo3-Reasoner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31291995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chain-of-frames","maze","veo3","video-diffusion-model","video-reasoning","visual-planning","visual-reasoning","wan","world-model"],"created_at":"2025-10-11T23:52:01.114Z","updated_at":"2026-04-01T21:06:21.118Z","avatar_url":"https://github.com/thuml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"assets/miniveo3-reasoner-logo-pure.png\" width=\"200px\" alt=\"MiniVeo3-Reasoner icon\" /\u003e\u003c/p\u003e\n\u003ch1 align=\"center\"\u003e MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors \u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/thuml/MiniVeo3-Reasoner\"\u003e\u003cimg src=\"https://img.shields.io/badge/GitHub-000000?style=for-the-badge\u0026logo=github\u0026logoColor=white\" alt=\"GitHub\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/thuml/MiniVeo3-Reasoner-Maze-5B\"\u003e\u003cimg src=\"https://img.shields.io/badge/🤗_HuggingFace-fcd022?style=for-the-badge\" alt=\"HuggingFace\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## 🎯 Overview\n\nAdvanced video models have recently demonstrated remarkable [zero-shot capabilities of visual reasoning](https://video-zero-shot.github.io/), solving tasks like maze, symmetry, and analogy completion through a **chain-of-frames (CoF)** reasoning process.\n\nThis project shows that such CoF capability can be **acquired by fine-tuning open-source video models** like [Wan2.2](https://github.com/Wan-Video/Wan2.2).\n\nIn the maze domain, the fine-tuned models—dubbed **MiniVeo3-Reasoner**—exhibit **surprisingly strong visual reasoning performance**, achieving **near-perfect accuracy** on in-distribution tests and **robust out-of-distribution generalization**.\n\nUnder controlled comparisons, MiniVeo3-Reasoner **significantly outperforms baseline approaches** that reason in other modalities such as text or images.\n\nWe further envision that this visual reasoning capability can be **enhanced through reinforcement learning of video models**.\n\n![method overview](assets/overview.png)\n\n## 🔥 News\n\n- 🚩 **2025.10**: We are thrilled to release MiniVeo3-Reasoner, with mazes as a testbed for visual reasoning!\n\n## 🤗 Models\n\n| Models                    | Download Links                                                           | Description                                                                                                                                                                             |\n| ------------------------- | ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| MiniVeo3-Reasoner-Maze-5B | 🤗 [HuggingFace](https://huggingface.co/thuml/MiniVeo3-Reasoner-Maze-5B) | Fine-tuned LoRA for [Maze](https://github.com/understanding-search/maze-dataset) tasks (3x3 to 6x6 sizes) from the base model [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) |\n\n## ✨ Examples\n\n\u003ctable style=\"width: 100%; text-align: center; margin-top: 20px;\"\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e \u003cb\u003eProblem Setup\u003c/b\u003e\u003c/td\u003e\n        \u003ctd colspan=2\u003e \u003cb\u003eExamples\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eMaze 3x3\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/2621f354-b180-4d9a-b508-bbb39a9eda74\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/c7984e4b-24dd-4f84-9132-22d2c60f38f9\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n   \u003ctr\u003e\n        \u003ctd\u003eMaze 4x4\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/eb07653b-223d-47ac-aa6a-3d8eef371c46\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/b759a98f-50ea-425a-9aea-221585a5b96b\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n     \u003ctr\u003e\n        \u003ctd\u003eMaze 5x5\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/9ee1e2f0-11a5-4d94-8c42-b7dd49f245d2\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/fc54cda0-c4ea-4804-a4f5-276a4eba13a2\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eMaze 6x6\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/3b1e8a42-bffc-43ef-a600-65e263104408\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/2c69b4d5-3818-4179-8714-7de9d2107122\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### OOD Generalization\n\nOOD Solution Lengths:\n\n\u003ctable style=\"width: 100%; text-align: center; margin-top: 20px;\"\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e \u003cb\u003eProblem Setup\u003c/b\u003e\u003c/td\u003e\n        \u003ctd colspan=2\u003e \u003cb\u003eExamples\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eMaze 6x6 \u003cbr/\u003e (solution len \u003e 12)\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/5974d363-a928-404b-8c8a-b51c92778f1b\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/46fda423-80c6-4831-a53a-3c9f817ff594\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nOOD Maze Sizes:\n\n\u003ctable style=\"width: 100%; text-align: center; margin-top: 20px;\"\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e \u003cb\u003eProblem Setup\u003c/b\u003e\u003c/td\u003e\n        \u003ctd colspan=2\u003e \u003cb\u003eExamples\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eMaze 7x7\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/d83174ba-7dbf-4397-a33b-de995450dcfa\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/7c24b7ee-65e9-4dfc-8dab-aeca7cd0f631\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n   \u003ctr\u003e\n        \u003ctd\u003eMaze 8x8\u003c/td\u003e\n      \u003ctd \u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/04fdd1aa-cd01-4a87-8a6f-0f398d51cf5b\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n        \u003ctd\u003e\n          \u003cvideo src=\"https://github.com/user-attachments/assets/07e63fd8-d224-4c64-b2e6-c483c2069857\" width=\"100%\" controls autoplay loop\u003e\u003c/video\u003e\n      \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## 📊 Performance\n\nFollowing [Visual Planning: Let's Think Only with Images](https://arxiv.org/abs/2505.11409), we report two metrics:\n\n- **Exact Match (EM)** measures whether the model **successfully generates the complete and correct trajectory** that aligns with the shortest optimal valid path.\n- **Progress Rate (PR)** measures the **number of consecutively correct steps** (valid forward moves) from the start to the number of steps in the optimal path.\n\n| MiniVeo3-Reasoner-Maze-5B      | EM (%) | PR (%) |\n| ------------------------------ | ------ | ------ |\n| Maze 3x3                       | 100    | 100    |\n| Maze 4x4                       | 100    | 100    |\n| Maze 5x5                       | 100    | 100    |\n| Maze 6x6                       | 98.4   | 98.7   |\n| Maze 6x6 (OOD solution length) | 53.6   | 59.7   |\n| Maze 7x7 (OOD size)            | 86.8   | 90.1   |\n| Maze 8x8 (OOD size)            | 60.4   | 67.8   |\n\n### Comparisons\n\nUnder the same amount of training data, we include performance metrics reported in [Visual Planning](https://arxiv.org/abs/2505.11409) for reference and comparison.\n\n| Model                            | Thinking Modality | Maze EM (%) | Maze PR (%) |\n| -------------------------------- | ----------------- | ----------- | ----------- |\n| Gemini 2.0 Flash - Direct        | Text              | 8.3         | 31.4        |\n| Gemini 2.0 Flash - CoT           | Text              | 6.9         | 29.8        |\n| Gemini 2.0 Pro (think)           | Text              | 21.5        | 35.5        |\n| Qwen 2.5-VL-Instruct-3B - Direct | Text              | 0.5         | 13.6        |\n| Qwen 2.5-VL-Instruct-3B - CoT    | Text              | 0.8         | 8.2         |\n| Qwen 2.5-VL-Instruct-3B - SFT    | Text              | 33.3        | 52.7        |\n| LVM-3B - VPFT                    | Image             | 59.0        | 64.0        |\n| LVM-3B - VPRL                    | Image             | 74.5        | 77.6        |\n| MiniVeo3-Reasoner-Maze-5B        | Video             | **99.6**    | **99.7**    |\n\n## 🚀 Get Started\n\n### Environment Setup\n\n```bash\nconda create -n miniveo3_reasoner python==3.12\nconda activate miniveo3_reasoner\npip install -r requirements.txt\n```\n\nWe use [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio/tree/ed256ef8be195d5deae2846a7e9f025670d99db3) for diffusion model training and inference. You need also install it:\n\n```bash\ngit clone https://github.com/modelscope/DiffSynth-Studio.git\ncd DiffSynth-Studio\ngit checkout ed256ef8be195d5deae2846a7e9f025670d99db3\npip install -e .\n```\n\n### Data Preparation\n\nOur data generator produces a series of mazes with configurable size, path length and amount, outputting a `.mp4` video file and a `.png` image (the first frame of the video).\n\nWe use a customized version of [maze-dataset](https://github.com/understanding-search/maze-dataset). You can install it as follows:\n\n```bash\npip install -e data/maze/maze-dataset\n```\n\nAfter installation, use the script below to generate mazes with custom configurations:\n\n```bash\npython data/maze/maze_generator.py\n```\n\nTo reproduce the same data distribution used in our experiments, simply run:\n\n```bash\nbash scripts/generate_maze_dataset.sh\n```\n\nThe result will be in `dataset/maze_train` and `dataset/maze_test` respectively.\n\n### Inference\n\nDownload our [LoRA weights](https://huggingface.co/thuml/MiniVeo3-Reasoner-Maze-5B):\n\n```bash\npip install \"huggingface_hub[cli]\"\nhuggingface-cli download thuml/MiniVeo3-Reasoner-Maze-5B --local-dir models/thuml/MiniVeo3-Reasoner-Maze-5B\n```\n\nTo run inference on a single file or directory, use:\n\n```bash\npython inference/maze/inference_maze.py [-r] filename_or_directory\n```\n\n\u003e 💡 The first run may take additional time to automatically download the base model files.\n\nTo perform inference on all test samples, simply run:\n\n```bash\nbash scripts/inference_maze_testset.sh\n```\n\n### Success Evaluation\n\nOur evaluator compares the predicted trajectory with the ground truth, computing the distance between the two paths.\n\nWe implement our own versions of Exact Match (EM) and Progress Rate (PR) metrics for video-based evaluation.\n\nIf your generated results are stored in `dataset/maze_test` and named properly, you can evaluate all test samples by running:\n\n```bash\nbash scripts/evaluate_maze.sh\n```\n\n### Training Models\n\nWe train Wan2.2-TI2V-5B with LoRA, following the instructions provided in DiffSynth-Studio. You can easily fine-tune your own models using the same framework.\n\nFor your convenience, if you follow ours, you can copy the train dataset `dataset/maze_train` directly into `DiffSynth-Studio/data/example_video_dataset`.\n\n## 🤝 Contributors\n\n[Jialong Wu](https://manchery.github.io/)\\*, [Tianhao Huang](https://github.com/MrH2T)\\*, [Changjing He](https://github.com/hcjqwq)\\*, [Mingsheng Long](https://ise.thss.tsinghua.edu.cn/~mlong/). (\\*: Equal Contribution)\n\nWe welcome contributions! Feel free to open [GitHub issues](https://github.com/thuml/MiniVeo3-Reasoner/issues) for bug reports or feature requests.\n\n## 💡 Acknowledgements\n\n- [Veo 3](https://video-zero-shot.github.io/): This project is inspired by the impressive zero-shot performance of Veo 3!\n- [Wan](https://github.com/Wan-Video/Wan2.2): Powerful open-source video diffusion models used as base models.\n- [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo): Video diffusion model training.\n- [maze-dataset](https://github.com/understanding-search/maze-dataset): Data generation for maze reasoning tasks.\n- [Visual Planning](https://github.com/yix8/VisualPlanning): Baseline benchmark for performance comparison.\n- [Nano Banana](https://aistudio.google.com/models/gemini-2-5-flash-image): Help in generating the project logo.\n\n## 📜 Citation\n\nThere is currently no technical report available.\n\nIf you find MiniVeo3-Reasoner useful, we would appreciate it if you could cite our work:\n\n```\n@misc{miniveo3reasoner,\n    title = {MiniVeo3-Reasoner: Thinking with Videos from Open-Source Priors},\n    author = {Jialong Wu, Tianhao Huang, Changjing He, Mingsheng Long},\n    year = {2025},\n    publisher = {GitHub},\n    journal = {GitHub repository},\n    howpublished = {\\url{https://github.com/thuml/MiniVeo3-Reasoner}},\n}\n```\n\n## 🌟 Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=thuml/MiniVeo3-Reasoner\u0026type=Date)](https://www.star-history.com/#thuml/MiniVeo3-Reasoner\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthuml%2Fminiveo3-reasoner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthuml%2Fminiveo3-reasoner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthuml%2Fminiveo3-reasoner/lists"}