{"id":13563201,"url":"https://github.com/google-research/magvit","last_synced_at":"2025-03-27T20:15:35.710Z","repository":{"id":64707645,"uuid":"572747506","full_name":"google-research/magvit","owner":"google-research","description":"Official JAX implementation of MAGVIT: Masked Generative Video Transformer","archived":false,"fork":false,"pushed_at":"2024-01-17T18:06:44.000Z","size":148,"stargazers_count":980,"open_issues_count":21,"forks_count":44,"subscribers_count":63,"default_branch":"main","last_synced_at":"2025-03-20T18:15:53.824Z","etag":null,"topics":["generative-model","transformers","video-generation"],"latest_commit_sha":null,"homepage":"https://magvit.cs.cmu.edu","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-01T00:09:33.000Z","updated_at":"2025-03-20T09:44:52.000Z","dependencies_parsed_at":"2024-08-01T13:19:07.996Z","dependency_job_id":"904acc05-9da5-4426-8771-0c1cdeae5472","html_url":"https://github.com/google-research/magvit","commit_stats":null,"previous_names":["google-research/magvit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fmagvit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fmagvit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fmagvit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fmagvit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/magvit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245916732,"owners_count":20693398,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-model","transformers","video-generation"],"created_at":"2024-08-01T13:01:16.361Z","updated_at":"2025-03-27T20:15:35.675Z","avatar_url":"https://github.com/google-research.png","language":"Python","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# MAGVIT: Masked Generative Video Transformer\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=language-model-beats-diffusion-tokenizer-is)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/video-prediction-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-prediction-on-kinetics-600-12-frames?p=language-model-beats-diffusion-tokenizer-is)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-prediction-on-bair-robot-pushing-1)](https://paperswithcode.com/sota/video-prediction-on-bair-robot-pushing-1?p=magvit-masked-generative-video-transformer)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-generation-on-bair-robot-pushing)](https://paperswithcode.com/sota/video-generation-on-bair-robot-pushing?p=magvit-masked-generative-video-transformer)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/video-prediction-on-something-something-v2)](https://paperswithcode.com/sota/video-prediction-on-something-something-v2?p=magvit-masked-generative-video-transformer)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/magvit-masked-generative-video-transformer/text-to-video-generation-on-something)](https://paperswithcode.com/sota/text-to-video-generation-on-something?p=magvit-masked-generative-video-transformer)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/image-generation-on-imagenet-512x512)](https://paperswithcode.com/sota/image-generation-on-imagenet-512x512?p=language-model-beats-diffusion-tokenizer-is)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/image-generation-on-imagenet-256x256)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=language-model-beats-diffusion-tokenizer-is)\n\n\n[[Paper](https://arxiv.org/abs/2212.05199)] | [[Project Page](https://magvit.cs.cmu.edu)] | [[Colab]()]\n\nOfficial code and models for the CVPR 2023 paper:\n\n**[MAGVIT: Masked Generative Video Transformer](https://arxiv.org/abs/2212.05199)** \\\nLijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang\\\nCVPR 2023\n\n## Summary\n\nWe introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.\n\nIf you find this code useful in your research, please cite\n\n```\n@inproceedings{yu2023magvit,\n  title={{MAGVIT}: Masked generative video transformer},\n  author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  year={2023}\n}\n```\n\n## Disclaimers\n\n*Please note that this is not an officially supported Google product.*\n\n*Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.*\n\n## Installation\n\nThere is a conda environment file for running with GPUs.\nCUDA 11 and CuDNN 8.6 is required for JAX.\n[This VM Image](https://console.cloud.google.com/marketplace/product/nvidia-ngc-public/nvidia-gpu-optimized-vmi) has been tested.\n\n```sh\nconda env create -f environment.yaml\nconda activate magvit\n```\n\nAlternatively, you can install the dependencies via\n\n```sh\npip install -r requirements.txt\n```\n\n## Pretrained models\n\nAs for the model pretrained weight release, please see this [note](https://github.com/google-research/magvit/issues/16).\n\n### MAGVIT 3D-VQ models\n\n**Model**|**Size**|**Input**|**Output**|**Codebook size**|**Dataset**\n:-----:|:-----:|:-----:|:-----:|:-----:|:-----:\n3D-VQ|B|16 frames x 64x64|4x16x16|1024| BAIR Robot Pushing\n3D-VQ|L|16 frames x 64x64|4x16x16|1024| BAIR Robot Pushing\n3D-VQ|B|16 frames x 128x128|4x16x16|1024| UCF-101\n3D-VQ|L|16 frames x 128x128|4x16x16|1024| UCF-101\n3D-VQ|B|16 frames x 128x128|4x16x16|1024| Kinetics-600\n3D-VQ|L|16 frames x 128x128|4x16x16|1024| Kinetics-600\n3D-VQ|B|16 frames x 128x128|4x16x16|1024| Something-Something-v2\n3D-VQ|L|16 frames x 128x128|4x16x16|1024| Something-Something-v2\n\n### MAGVIT transformers\n\nEach transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.\n\n**Model**|**Task**|**Size**|**Dataset**|**FVD**\n:-----:|:-----:|:-----:|:-----:|:-----:\nTransformer|Class-conditional|B|UCF-101 |159\nTransformer|Class-conditional|L|UCF-101 |76\nTransformer|Frame prediction | B | BAIR Robot Pushing |76 (48)\nTransformer|Frame prediction | L | BAIR Robot Pushing |62 (31)\nTransformer|Frame prediction (5) |B| Kinetics-600 |24.5\nTransformer|Frame prediction (5) |L| Kinetics-600 |9.9\nTransformer|Multi-task-8 | B | BAIR Robot Pushing |32.8\nTransformer|Multi-task-8 | L | BAIR Robot Pushing |22.8\nTransformer|Multi-task-10 | B | Something-Something-v2 | 43.4\nTransformer|Multi-task-10 | L | Something-Something-v2 | 27.3\n\n\u003c!-- ## Usage\n\n### Inference\nInference pretrained models in the [colab]().\n\n### Training new models\nInstructions for training new models can be [found here](). --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fmagvit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fmagvit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fmagvit/lists"}