{"id":13994242,"url":"https://github.com/lucidrains/phenaki-pytorch","last_synced_at":"2025-04-11T23:16:07.398Z","repository":{"id":60944303,"uuid":"543297533","full_name":"lucidrains/phenaki-pytorch","owner":"lucidrains","description":"Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch","archived":false,"fork":false,"pushed_at":"2024-07-29T23:21:44.000Z","size":272,"stargazers_count":769,"open_issues_count":13,"forks_count":81,"subscribers_count":37,"default_branch":"main","last_synced_at":"2025-04-11T23:15:53.192Z","etag":null,"topics":["artificial-intelligence","attention-mechanisms","deep-learning","imagination-machine","text-to-video","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-29T19:58:11.000Z","updated_at":"2025-04-08T18:48:29.000Z","dependencies_parsed_at":"2024-03-24T08:33:39.405Z","dependency_job_id":"46ccbe9f-c969-45eb-90df-6d98d1d5c563","html_url":"https://github.com/lucidrains/phenaki-pytorch","commit_stats":{"total_commits":142,"total_committers":2,"mean_commits":71.0,"dds":"0.028169014084507005","last_synced_commit":"9415d4e6e808d941f827c2926e72f43a434eedb1"},"previous_names":[],"tags_count":74,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fphenaki-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fphenaki-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fphenaki-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fphenaki-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/phenaki-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248492884,"owners_count":21113163,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanisms","deep-learning","imagination-machine","text-to-video","transformers"],"created_at":"2024-08-09T14:02:47.175Z","updated_at":"2025-04-11T23:16:07.375Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"./phenaki.png\" width=\"450px\"\u003e\u003c/img\u003e\n\n## \u003ca href=\"https://en.wikipedia.org/wiki/Phenakistiscope\"\u003ePhenaki\u003c/a\u003e - Pytorch\n\nImplementation of \u003ca href=\"https://phenaki.video/\"\u003ePhenaki Video\u003c/a\u003e, which uses \u003ca href=\"https://arxiv.org/abs/2202.04200\"\u003eMask GIT\u003c/a\u003e to produce text guided videos of up to 2 minutes in length, in Pytorch. It will also combine another technique involving a \u003ca href=\"https://arxiv.org/abs/2209.04439\"\u003etoken critic\u003c/a\u003e for potentially even better generations\n\nPlease join \u003ca href=\"https://discord.gg/xBPBXfcFHd\"\u003e\u003cimg alt=\"Join us on Discord\" src=\"https://img.shields.io/discord/823813159592001537?color=5865F2\u0026logo=discord\u0026logoColor=white\"\u003e\u003c/a\u003e if you are interested in replicating this work in the open\n\n\u003ca href=\"https://www.youtube.com/watch?v=RYLomvaPWa4\"\u003eAI Coffeebreak explanation\u003c/a\u003e\n\n## Appreciation\n\n- \u003ca href=\"https://stability.ai/\"\u003eStability.ai\u003c/a\u003e for the generous sponsorship to work on cutting edge artificial intelligence research\n\n- \u003ca href=\"https://huggingface.co/\"\u003e🤗 Huggingface\u003c/a\u003e for their amazing transformers and accelerate library\n\n- \u003ca href=\"https://github.com/gmegh\"\u003eGuillem\u003c/a\u003e for his ongoing contributions\n\n- You? If you are a great machine learning engineer and / or researcher, feel free to contribute to the frontier of open source generative AI\n\n## Install\n\n```bash\n$ pip install phenaki-pytorch\n```\n\n## Usage\n\nC-ViViT\n\n```python\nimport torch\nfrom phenaki_pytorch import CViViT, CViViTTrainer\n\ncvivit = CViViT(\n    dim = 512,\n    codebook_size = 65536,\n    image_size = 256,\n    patch_size = 32,\n    temporal_patch_size = 2,\n    spatial_depth = 4,\n    temporal_depth = 4,\n    dim_head = 64,\n    heads = 8\n).cuda()\n\ntrainer = CViViTTrainer(\n    cvivit,\n    folder = '/path/to/images/or/videos',\n    batch_size = 4,\n    grad_accum_every = 4,\n    train_on_images = False,  # you can train on images first, before fine tuning on video, for sample efficiency\n    use_ema = False,          # recommended to be turned on (keeps exponential moving averaged cvivit) unless if you don't have enough resources\n    num_train_steps = 10000\n)\n\ntrainer.train()               # reconstructions and checkpoints will be saved periodically to ./results\n\n```\n\nPhenaki\n\n```python\nimport torch\nfrom phenaki_pytorch import CViViT, MaskGit, Phenaki\n\ncvivit = CViViT(\n    dim = 512,\n    codebook_size = 65536,\n    image_size = (256, 128),  # video with rectangular screen allowed\n    patch_size = 32,\n    temporal_patch_size = 2,\n    spatial_depth = 4,\n    temporal_depth = 4,\n    dim_head = 64,\n    heads = 8\n)\n\ncvivit.load('/path/to/trained/cvivit.pt')\n\nmaskgit = MaskGit(\n    num_tokens = 5000,\n    max_seq_len = 1024,\n    dim = 512,\n    dim_context = 768,\n    depth = 6,\n)\n\nphenaki = Phenaki(\n    cvivit = cvivit,\n    maskgit = maskgit\n).cuda()\n\nvideos = torch.randn(3, 3, 17, 256, 128).cuda() # (batch, channels, frames, height, width)\nmask = torch.ones((3, 17)).bool().cuda() # [optional] (batch, frames) - allows for co-training videos of different lengths as well as video and images in the same batch\n\ntexts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles'\n]\n\nloss = phenaki(videos, texts = texts, video_frame_mask = mask)\nloss.backward()\n\n# do the above for many steps, then ...\n\nvideo = phenaki.sample(texts = 'a squirrel examines an acorn', num_frames = 17, cond_scale = 5.) # (1, 3, 17, 256, 128)\n\n# so in the paper, they do not really achieve 2 minutes of coherent video\n# at each new scene with new text conditioning, they condition on the previous K frames\n# you can easily achieve this with this framework as so\n\nvideo_prime = video[:, :, -3:] # (1, 3, 3, 256, 128) # say K = 3\n\nvideo_next = phenaki.sample(texts = 'a cat watches the squirrel from afar', prime_frames = video_prime, num_frames = 14) # (1, 3, 14, 256, 128)\n\n# the total video\n\nentire_video = torch.cat((video, video_next), dim = 2) # (1, 3, 17 + 14, 256, 128)\n\n# and so on...\n```\n\nOr just import the `make_video` function\n\n```python\n# ... above code\n\nfrom phenaki_pytorch import make_video\n\nentire_video, scenes = make_video(phenaki, texts = [\n    'a squirrel examines an acorn buried in the snow',\n    'a cat watches the squirrel from a frosted window sill',\n    'zoom out to show the entire living room, with the cat residing by the window sill'\n], num_frames = (17, 14, 14), prime_lengths = (5, 5))\n\nentire_video.shape # (1, 3, 17 + 14 + 14 = 45, 256, 256)\n\n# scenes - List[Tensor[3]] - video segment of each scene\n```\n\nThat's it!\n\n## Token Critic\n\nA \u003ca href=\"https://arxiv.org/abs/2209.04439\"\u003enew paper\u003c/a\u003e suggests that instead of relying on the predicted probabilities of each token as a measure of confidence, one can train an extra critic to decide what to iteratively mask during sampling. You can optionally train this critic for potentially better generations as shown below\n\n```python\nimport torch\nfrom phenaki_pytorch import CViViT, MaskGit, TokenCritic, Phenaki\n\ncvivit = CViViT(\n    dim = 512,\n    codebook_size = 65536,\n    image_size = (256, 128),\n    patch_size = 32,\n    temporal_patch_size = 2,\n    spatial_depth = 4,\n    temporal_depth = 4,\n    dim_head = 64,\n    heads = 8\n)\n\nmaskgit = MaskGit(\n    num_tokens = 65536,\n    max_seq_len = 1024,\n    dim = 512,\n    dim_context = 768,\n    depth = 6,\n)\n\n# (1) define the critic\n\ncritic = TokenCritic(\n    num_tokens = 65536,\n    max_seq_len = 1024,\n    dim = 512,\n    dim_context = 768,\n    depth = 6,\n    has_cross_attn = True\n)\n\ntrainer = Phenaki(\n    maskgit = maskgit,\n    cvivit = cvivit,\n    critic = critic    # and then (2) pass it into Phenaki\n).cuda()\n\ntexts = [\n    'a whale breaching from afar',\n    'young girl blowing out candles on her birthday cake',\n    'fireworks with blue and green sparkles'\n]\n\nvideos = torch.randn(3, 3, 3, 256, 128).cuda() # (batch, channels, frames, height, width)\n\nloss = trainer(videos = videos, texts = texts)\nloss.backward()\n```\n\nOr even simpler, just reuse `MaskGit` itself as a \u003ca href=\"https://aclanthology.org/2021.naacl-main.409.pdf\"\u003eSelf Critic (Nijkamp et al)\u003c/a\u003e, by setting `self_token_critic = True` on the initialization of `Phenaki`\n\n```python\nphenaki = Phenaki(\n    ...,\n    self_token_critic= True  # set this to True\n)\n```\n\nNow your generations should be greatly improved!\n\n## Phenaki Trainer\n\nThis repository will also endeavor to allow the researcher to train on text-to-image and then text-to-video. Similarly, for unconditional training, the researcher should be able to first train on images and then fine tune on video. Below is an example for text-to-video\n\n\n```python\nimport torch\nfrom torch.utils.data import Dataset\nfrom phenaki_pytorch import CViViT, MaskGit, Phenaki, PhenakiTrainer\n\ncvivit = CViViT(\n    dim = 512,\n    codebook_size = 65536,\n    image_size = 256,\n    patch_size = 32,\n    temporal_patch_size = 2,\n    spatial_depth = 4,\n    temporal_depth = 4,\n    dim_head = 64,\n    heads = 8\n)\n\ncvivit.load('/path/to/trained/cvivit.pt')\n\nmaskgit = MaskGit(\n    num_tokens = 5000,\n    max_seq_len = 1024,\n    dim = 512,\n    dim_context = 768,\n    depth = 6,\n    unconditional = False\n)\n\nphenaki = Phenaki(\n    cvivit = cvivit,\n    maskgit = maskgit\n).cuda()\n\n# mock text video dataset\n# you will have to extend your own, and return the (\u003cvideo tensor\u003e, \u003ccaption\u003e) tuple\n\nclass MockTextVideoDataset(Dataset):\n    def __init__(\n        self,\n        length = 100,\n        image_size = 256,\n        num_frames = 17\n    ):\n        super().__init__()\n        self.num_frames = num_frames\n        self.image_size = image_size\n        self.len = length\n\n    def __len__(self):\n        return self.len\n\n    def __getitem__(self, idx):\n        video = torch.randn(3, self.num_frames, self.image_size, self.image_size)\n        caption = 'video caption'\n        return video, caption\n\ndataset = MockTextVideoDataset()\n\n# pass in the dataset\n\ntrainer = PhenakiTrainer(\n    phenaki = phenaki,\n    batch_size = 4,\n    grad_accum_every = 4,\n    train_on_images = False, # if your mock dataset above return (images, caption) pairs, set this to True\n    dataset = dataset,       # pass in your dataset here\n    sample_texts_file_path = '/path/to/captions.txt' # each caption should be on a new line, during sampling, will be randomly drawn\n)\n\ntrainer.train()\n```\n\nUnconditional is as follows\n\nex. unconditional images and video training\n\n```python\nimport torch\nfrom phenaki_pytorch import CViViT, MaskGit, Phenaki, PhenakiTrainer\n\ncvivit = CViViT(\n    dim = 512,\n    codebook_size = 65536,\n    image_size = 256,\n    patch_size = 32,\n    temporal_patch_size = 2,\n    spatial_depth = 4,\n    temporal_depth = 4,\n    dim_head = 64,\n    heads = 8\n)\n\ncvivit.load('/path/to/trained/cvivit.pt')\n\nmaskgit = MaskGit(\n    num_tokens = 5000,\n    max_seq_len = 1024,\n    dim = 512,\n    dim_context = 768,\n    depth = 6,\n    unconditional = False\n)\n\nphenaki = Phenaki(\n    cvivit = cvivit,\n    maskgit = maskgit\n).cuda()\n\n# pass in the folder to images or video\n\ntrainer = PhenakiTrainer(\n    phenaki = phenaki,\n    batch_size = 4,\n    grad_accum_every = 4,\n    train_on_images = True,                # for sake of example, bottom is folder of images\n    dataset = '/path/to/images/or/video'\n)\n\ntrainer.train()\n```\n\n## Todo\n\n- [x] pass mask probability into maskgit and auto-mask and get cross entropy loss\n- [x] cross attention + get t5 embeddings code from imagen-pytorch and get classifier free guidance wired up\n- [x] wire up full vqgan-vae for c-vivit, just take what is in parti-pytorch already, but make sure to use a stylegan discriminator as said in paper\n- [x] complete token critic training code\n- [x] complete first pass of maskgit scheduled sampling + token critic (optionally without if researcher does not want to do extra training)\n- [x] inference code that allows for sliding time + conditioning on K past frames\n- [x] alibi pos bias for temporal attention\n- [x] give spatial attention the most powerful positional bias\n- [x] make sure to use stylegan-esque discriminator\n- [x] 3d relative positional bias for maskgit\n- [x] make sure maskgit can also support training of images, and make sure it works on local machine\n- [x] also build option for token critic to be conditioned with the text\n- [x] should be able to train for text to image generation first\n- [x] make sure critic trainer can take in cvivit and automatically pass in video patch shape for relative positional bias - make sure critic also gets optimal relative positional bias\n- [x] training code for cvivit\n- [x] move cvivit into own file\n- [x] unconditional generations (both video and images)\n- [x] wire up accelerate for multi-gpu training for both c-vivit and maskgit\n- [x] add depthwise-convs to cvivit for position generating\n- [x] some basic video manipulation code, allow for sampled tensor to be saved as gif\n- [x] basic critic training code\n- [x] add position generating dsconv to maskgit too\n- [x] outfit customizable self attention blocks to stylegan discriminator\n- [x] add all top of the line research for stabilizing transformers training\n\n- [ ] get some basic critic sampling code, show comparison of with and without critic\n- [ ] bring in concatenative token shift (temporal dimension)\n- [ ] add a DDPM upsampler, either port from imagen-pytorch or just rewrite a simple version here\n- [ ] take care of masking in maskgit\n- [ ] test maskgit + critic alone on oxford flowers dataset\n- [ ] support rectangular sized videos\n- [ ] add flash attention as an option for all transformers and cite @tridao\n\n## Citations\n\n```bibtex\n@article{Villegas2022PhenakiVL,\n    title   = {Phenaki: Variable Length Video Generation From Open Domain Textual Description},\n    author  = {Ruben Villegas and Mohammad Babaeizadeh and Pieter-Jan Kindermans and Hernan Moraldo and Han Zhang and Mohammad Taghi Saffar and Santiago Castro and Julius Kunze and D. Erhan},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs/2210.02399}\n}\n```\n\n```bibtex\n@article{Chang2022MaskGITMG,\n    title   = {MaskGIT: Masked Generative Image Transformer},\n    author  = {Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman},\n    journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    year    = {2022},\n    pages   = {11305-11315}\n}\n```\n\n```bibtex\n@article{Lezama2022ImprovedMI,\n    title   = {Improved Masked Image Generation with Token-Critic},\n    author  = {Jos{\\'e} Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs/2209.04439}\n}\n```\n\n```bibtex\n@misc{ding2021cogview,\n    title   = {CogView: Mastering Text-to-Image Generation via Transformers},\n    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},\n    year    = {2021},\n    eprint  = {2105.13290},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2002.05202}\n}\n```\n\n```bibtex\n@misc{press2021ALiBi,\n    title   = {Train Short, Test Long: Attention with Linear Biases Enable Input Length Extrapolation},\n    author  = {Ofir Press and Noah A. Smith and Mike Lewis},\n    year    = {2021},\n    url     = {https://ofir.io/train_short_test_long.pdf}\n}\n```\n\n```bibtex\n@article{Liu2022SwinTV,\n    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},\n    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},\n    journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n    year    = {2022},\n    pages   = {11999-12009}\n}\n```\n\n```bibtex\n@inproceedings{Nijkamp2021SCRIPTSP,\n    title   = {SCRIPT: Self-Critic PreTraining of Transformers},\n    author  = {Erik Nijkamp and Bo Pang and Ying Nian Wu and Caiming Xiong},\n    booktitle = {North American Chapter of the Association for Computational Linguistics},\n    year    = {2021}\n}\n```\n\n```bibtex\n@misc{https://doi.org/10.48550/arxiv.2302.01327,\n    doi     = {10.48550/ARXIV.2302.01327},\n    url     = {https://arxiv.org/abs/2302.01327},\n    author  = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},\n    title   = {Dual PatchNorm},\n    publisher = {arXiv},\n    year    = {2023},\n    copyright = {Creative Commons Attribution 4.0 International}\n}\n```\n\n```bibtex\n@misc{gilmer2023intriguing\n    title  = {Intriguing Properties of Transformer Training Instabilities},\n    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},\n    year   = {2023},\n    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}\n}\n```\n\n```bibtex\n@misc{mentzer2023finite,\n    title   = {Finite Scalar Quantization: VQ-VAE Made Simple},\n    author  = {Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen},\n    year    = {2023},\n    eprint  = {2309.15505},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@misc{yu2023language,\n    title   = {Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation},\n    author  = {Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang},\n    year    = {2023},\n    eprint  = {2310.05737},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fphenaki-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fphenaki-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fphenaki-pytorch/lists"}