{"id":29009902,"url":"https://github.com/tencentarc/divot","last_synced_at":"2025-06-25T15:33:40.158Z","repository":{"id":266371881,"uuid":"897739150","full_name":"TencentARC/Divot","owner":"TencentARC","description":"Diffusion Powers Video Tokenizer for Comprehension and Generation (CVPR 2025)","archived":false,"fork":false,"pushed_at":"2025-02-27T05:03:12.000Z","size":22849,"stargazers_count":64,"open_issues_count":2,"forks_count":2,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-02-27T06:19:06.319Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-03T06:38:42.000Z","updated_at":"2025-02-27T05:03:15.000Z","dependencies_parsed_at":"2024-12-04T00:33:07.683Z","dependency_job_id":null,"html_url":"https://github.com/TencentARC/Divot","commit_stats":null,"previous_names":["tencentarc/divot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/Divot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDivot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDivot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDivot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDivot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/Divot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2FDivot/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261901407,"owners_count":23227593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-25T15:33:38.778Z","updated_at":"2025-06-25T15:33:40.135Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation\n[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2412.04432)\n[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/Divot)\n\n\n\u003eWe introduce [Divot](https://arxiv.org/abs/2412.04432), a **Di**ffusion-Powered **V**ide**o** **T**okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.\nBuilding upon the Divot tokenizer, we present **Divot-LLM** through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.\n\nAll models, training code and inference code are released! **Divot is accepted by CVPR 2025.** :star_struck:\n\n\n## TODOs\n- [x] Release the pretrained tokenizer and de-tokenizer of Divot.\n- [x] Release the pretrained and instruction tuned model of Divot-LLM.\n- [x] Release inference code of Divot.\n- [x] Release training and inference code of Divot-LLM.\n- [ ] Release training code of Divot.\n- [ ] Release de-tokenizer adaptation training code.\n\n## Introduction\n![image](assets/method.jpg?raw=true)\n\nWe utilize the diffusion procedure to learn **a video tokenizer** in a self-supervised manner for unified comprehension and\ngeneration, where the spatiotemporal representations serve as the\ncondition of a diffusion model to de-noise video clips. Additionally,\nthe proxy diffusion model functions as a **de-tokenizer** to decode\nrealistic video clips from the video representations.\n\nAfter training the the Divot tokenizer, video features from the Divot tokenizer are fed into the LLM to perform next-word prediction for video comprehension, while learnable queries are input into the LLM to model the distributions of Divot features using **a Gaussian Mixture Model (GMM)** for video generation. During inference,\nvideo features are sampled from the predicted GMM distribution to\ndecode videos using the de-tokenizer.\n\n## Showcase\n### Video Reconstruction of Divot\n\u003ctable class=\"center\" style=\"width: 100%; text-align: center;\"\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n                    \u003cspan style=\"text-align: center;\"\u003eInput\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/beer_gt.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cspan style=\"text-align: center;\"\u003eInput\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/camel_gt.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cspan style=\"text-align: center;\"\u003eInput\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/car_gt.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n     \u003cspan style=\"text-align: center;\"\u003eInput\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/george_gt.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n       \u003cspan\u003eReconstructed\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/beer_recon.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n       \u003cspan\u003eReconstructed\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/camel_recon.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n       \u003cspan\u003eReconstructed\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/car_recon.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n       \u003cspan\u003eReconstructed\u003c/span\u003e\n      \u003cdiv style=\"display: flex; flex-direction: column; align-items: center;\"\u003e\n        \u003cimg src=\"assets/recon/george_recon.gif\" width=\"170\"\u003e\n      \u003c/div\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### Video Comprehension of Divot-LLM\n![image](assets/comp.jpg?raw=true)\n\n\n### Video Generaion of Divot-LLM\n\n\u003ctable class=\"center\" style=\"width: 100%; text-align: center;\"\u003e\n\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA person is applying eye makeup\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA gorgeous girl is smiling\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA time-lapse of clouds passing over a peaceful mountain lake with reflections of the peaks\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eBack view of a young woman dressed in a yellow\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/make_up.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/smile.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/lake.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/walk.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003eWater is slowly filling a glass\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003ePeople cheer at fireworks display\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA cute dog observing out the window\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eAn oil painting depicting a beach with wave\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/water.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/firework.gif\" width=\"170\"\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/dog.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/wave.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA drone view of a stunning waterfall in a rainforest\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA digital art piece of a cyberpunk cityscape at night\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eA guy wearing a jacket is driving a car\u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003eAn aerial shot of a vibrant hot air balloon festival\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/waterfall.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/city.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n        \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/car.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 25%;\"\u003e\n      \u003cimg src=\"assets/showcase/balloon.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### Video StoryTelling of Divot-LLM\n\n\u003ctable class=\"center\" style=\"width: 100%; text-align: center;\"\u003e\n\n\u003eInstruction: Generate a story about George's visit to the dentist.\n  \u003ctr\u003e\n    \u003ctd style=\"width: 33%;\"\u003eGeorge felt nervous as the kind dentist explained the check-up to make him comfortable.\u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003eAt the dentist's office, George opened wide so the dentist could examine his teeth.\u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003eGeorge then learns about dental hygiene from a friendly dentist showing him the tools and techniques.\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/dentist1.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/dentist2.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/dentist3.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n \u003ctable class=\"center\" style=\"width: 100%; text-align: center;\"\u003e\n\n  \u003eInstruction: Generate a story about George‘s fun-filled day in the kitchen.\n\n  \u003ctr\u003e\n    \u003ctd style=\"width: 33%;\"\u003eGeorge and his pal joyfully cook in the kitchen, creating a tasty snack with a big blue book.\u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003eA woman in the kitchen shows\nGeorge how to use a new kitchen gadget.\u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003eGeorge spreads the thick, peanutty\npaste, making a yummy snack.\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/kitchen1.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/kitchen2.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"width: 33%;\"\u003e\n      \u003cimg src=\"assets/story/kitchen3.gif\" width=\"170\"\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\n\u003c/table\u003e\n\n## Usage\n\n### Dependencies\n- Python \u003e= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))\n- [PyTorch \u003e=2.1.0](https://pytorch.org/)\n- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)\n\n### Installation\nClone the repo and install dependent packages\n\n  ```bash\n  git clone https://github.com/TencentARC/Divot.git\n  cd Divot\n  pip install -r requirements.txt\n  ```\n\n### Model Weights\nWe release the pretrained tokenizer and de-tokenizer, pre-trained and instruction-tuned Divot-LLM in [Divot](https://huggingface.co/TencentARC/Divot/). Please download the checkpoints and save them under the folder `./pretrained`. For example, `./pretrained/Divot_tokenizer_detokenizer`.\n\n\nYou also need to download [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and [CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K), and save them under the folder `./pretrained`.\n\n### Inference\n#### Video Reconstruction with Divot\n```bash\npython3 src/tools/eval_Divot_video_recon.py\n```\n\n#### Video Comprehension with Divot-LLM\n```bash\npython3 src/tools/eval_Divot_video_comp.py\n```\n\n#### Video Generation with Divot-LLM\n```bash\npython3 src/tools/eval_Divot_video_gen.py\n```\n\n\n### Training\n#### Pre-training\n1. Download the checkpoints of pre-trained [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and [CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) , and save them under the folder `./pretrained`.\n2. Prepare the training data in the format of webdataset.\n3. Run the following script.\n```bash\nsh scripts/train_Divot_pretrain_comp_gen.sh\n```\n\n#### Instruction-tuning\n1. Download the checkpoints of pre-trained Divot tokenizer and Divot-LLM in [Divot](https://huggingface.co/TencentARC/Divot/), and save them under the folder `./pretrained`.\n2. Prepare the instruction data in the format of webdataset (for generation) and jsonl (for comprehension, where each line stores a dictionary used to specify the video_path, question, and answer).\n3. Run the following script.\n```bash\n### For video comprehension\nsh scripts/train_Divot_sft_comp.sh\n\n### For video generation\nsh scripts/train_Divot_sft_gen.sh\n```\n\n#### Inference with your own model\n1. Obtain \"pytorch_model.bin\" with the following script.\n```bash\ncd train_output/sft_comp/checkpoint-xxxx\npython3 zero_to_fp32.py . pytorch_model.bin\n```\n2. Merge your trained lora with the original LLM model using the following script.\n```bash\npython3 src/tools/merge_agent_lora_weight.py\n```\n3. Load your merged model in \"mistral7b_merged_xxx\" and and corresponding \"agent\" path, For example,\n```bash\nllm_cfg_path = 'configs/clm_models/mistral7b_merged_sft_comp.yaml'\nagent_cfg_path = 'configs/clm_models/agent_7b_in64_out64_video_gmm_sft_comp.yaml'\n```\n\n\n## License\n`Divot` is licensed under the Apache License Version 2.0 for academic purpose only except for the third-party components listed in [License](License.txt).\n\n## Citation\nIf you find the work helpful, please consider citing:\n```bash\n@article{ge2024divot,\n  title={Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation},\n  author={Ge, Yuying and Li, Yizhuo and Ge, Yixiao and Shan, Ying},\n  journal={arXiv preprint arXiv:2412.04432},\n  year={2024}\n}\n```\n\n## Acknowledge\nOur code for Divot tokenizer and de-tokenizer is built upon [DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter). Thanks for their excellent work!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fdivot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fdivot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fdivot/lists"}