{"id":20661068,"url":"https://github.com/ai-forever/kandinskyvideo","last_synced_at":"2025-08-24T02:11:31.146Z","repository":{"id":208597773,"uuid":"720711884","full_name":"ai-forever/KandinskyVideo","owner":"ai-forever","description":"KandinskyVideo — multilingual end-to-end text2video latent diffusion model","archived":false,"fork":false,"pushed_at":"2024-05-28T11:44:45.000Z","size":405508,"stargazers_count":184,"open_issues_count":6,"forks_count":20,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-07-31T12:29:42.681Z","etag":null,"topics":["kandinsky","latent-diffusion","text-to-video","video-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ai-forever.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-19T11:11:13.000Z","updated_at":"2025-06-11T10:25:40.000Z","dependencies_parsed_at":"2024-05-28T11:39:32.426Z","dependency_job_id":null,"html_url":"https://github.com/ai-forever/KandinskyVideo","commit_stats":null,"previous_names":["ai-forever/kandinskyvideo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ai-forever/KandinskyVideo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai-forever%2FKandinskyVideo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai-forever%2FKandinskyVideo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai-forever%2FKandinskyVideo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai-forever%2FKandinskyVideo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ai-forever","download_url":"https://codeload.github.com/ai-forever/KandinskyVideo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai-forever%2FKandinskyVideo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271780403,"owners_count":24819292,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-24T02:00:11.135Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kandinsky","latent-diffusion","text-to-video","video-generation"],"created_at":"2024-11-16T19:07:00.160Z","updated_at":"2025-08-24T02:11:31.104Z","avatar_url":"https://github.com/ai-forever.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kandinsky Video 1.1 — a new text-to-video generation model \n## SoTA quality among open-source solutions on \u003ca href=\"https://evalcrafter.github.io/\"\u003eEvalCrafter\u003c/a\u003e benchmark\n\nThis repository is the official implementation of Kandinsky Video 1.1 model.\n\n[![Hugging Face Spaces](https://img.shields.io/badge/🤗-Huggingface-yello.svg)](https://huggingface.co/ai-forever/KandinskyVideo_1_1) | [Telegram-bot](https://t.me/kandinsky21_bot) | [Habr post](https://habr.com/ru/companies/sberbank/articles/817667/) | [Our text-to-image model](https://github.com/ai-forever/Kandinsky-3/tree/main) | [Project page](https://ai-forever.github.io/KandinskyVideo/K11/)\n\n\u003cp\u003e\n\u003c!-- \u003cimg src=\"__assets__/title.jpg\" width=\"800px\"/\u003e --\u003e\n\u003c!-- \u003cbr\u003e --\u003e\nOur \u003cB\u003eprevious\u003c/B\u003e model \u003ca href=\"https://github.com/ai-forever/KandinskyVideo/tree/kandinsky_video_1_0\"\u003eKandinsky Video 1.0\u003c/a\u003e, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In \u003cB\u003eKandinsky Video 1.1\u003c/B\u003e, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image \u003ca href=\"https://github.com/ai-forever/Kandinsky-3\"\u003eKandinsky 3.0\u003c/a\u003e, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.\n\u003c/p\u003e\n\n\n\n## Pipeline\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/pipeline.png\" width=\"800px\"/\u003e\n\u003cbr\u003e\n\u003cem\u003eIn the \u003ca href=\"https://github.com/ai-forever/KandinskyVideo/tree/kandinsky_video_1_0\"\u003eKandinsky Video 1.0\u003c/a\u003e, the encoded text prompt enters the text-to-video U-Net3D keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are sent to the latent interpolation model to predict three interpolation frames between\ntwo keyframes. An image MoVQ-GAN decoder is used to obtain the final video result. In \u003cB\u003eKandinsky Video 1.1\u003c/B\u003e, text-to-video U-Net3D is also conditioned on text-to-image U-Net2D, which helps to improve the content quality. A temporal MoVQ-GAN decoder is used to decode the final video.\u003c/em\u003e\n\u003c/p\u003e\n\n\n**Architecture details**\n\n+ Text encoder (Flan-UL2) - 8.6B\n+ Latent Diffusion U-Net3D - 4.15B\n+ The interpolation model (Latent Diffusion U-Net3D) - 4.0B \n+ Image MoVQ encoder/decoder - 256M\n+ Video (temporal) MoVQ decoder - 556M\n\n## How to use\n\n\u003c!--Check our jupyter notebooks with examples in `./examples` folder --\u003e\n\n### 1. text2video\n\n```python\nfrom kandinsky_video import get_T2V_pipeline\n\ndevice_map = 'cuda:0'\nt2v_pipe = get_T2V_pipeline(device_map)\n\nprompt = \"A cat wearing sunglasses and working as a lifeguard at a pool.\"\n\nfps = 'medium' # ['low', 'medium', 'high']\nmotion = 'high' # ['low', 'medium', 'high']\n\nvideo = t2v_pipe(\n    prompt,\n    width=512, height=512, \n    fps=fps, \n    motion=motion,\n    key_frame_guidance_scale=5.0,\n    guidance_weight_prompt=5.0,\n    guidance_weight_image=3.0,\n)\n\npath_to_save = f'./__assets__/video.gif'\nvideo[0].save(\n    path_to_save,\n    save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0\n)\n```\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"__assets__/video.gif\" raw=true\u003e\n    \u003cbr\u003e\u003cem\u003eGenerated video\u003c/em\u003e\n\u003c/p\u003e\n\n### 2. image2video\n\n```python\nfrom kandinsky_video import get_T2V_pipeline\n\ndevice_map = 'cuda:0'\nt2v_pipe = get_T2V_pipeline(device_map)\n\nfrom PIL import Image\nimport requests\nfrom io import BytesIO\n\nurl = 'https://media.cnn.com/api/v1/images/stellar/prod/gettyimages-1961294831.jpg'\nresponse = requests.get(url)\nimg = Image.open(BytesIO(response.content))\nimg.show()\n\nprompt = \"A panda climbs up a tree.\"\n\nfps = 'medium' # ['low', 'medium', 'high']\nmotion = 'medium' # ['low', 'medium', 'high']\n\nvideo = t2v_pipe(\n    prompt,\n    image=img,\n    width=640, height=384, \n    fps=fps, \n    motion=motion,\n    key_frame_guidance_scale=5.0,\n    guidance_weight_prompt=5.0,\n    guidance_weight_image=3.0,\n)\n\npath_to_save = f'./__assets__/video2.gif'\nvideo[0].save(\n    path_to_save,\n    save_all=True, append_images=video[1:], duration=int(5500/len(video)), loop=0\n)\n```\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://media.cnn.com/api/v1/images/stellar/prod/gettyimages-1961294831.jpg\" raw=true width=\"50%\"\u003e\u003cbr\u003e\n\u003cem\u003eInput image.\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/video2.gif\" raw=true\u003e\u003cbr\u003e\n\u003cem\u003eGenerated Video.\u003c/em\u003e\n\u003c/p\u003e\n\n## Motion score and Noise Augmentation conditioning\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/motion-score.gif\" raw=true\u003e\u003cbr\u003e\n\u003cem\u003eVariations in generations based on different motion scores and noise augmentation levels. The horizontal axis shows noise augmentation levels (NA), while the vertical axis displays motion scores (MS).\u003c/em\u003e\n\u003c/p\u003e\n\n##  Results\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/eval crafter.png\" raw=true align=\"center\" width=\"50%\"\u003e\n\u003cbr\u003e    \n\u003cem\u003e Kandinsky Video 1.1 achieves second place overall and best open source model on \u003ca href=\"https://evalcrafter.github.io/\"\u003eEvalCrafter\u003c/a\u003e text to video benchmark. VQ: visual quality, TVA: text-video alignment, MQ: motion quality, TC: temporal consistency and FAS: final average score.\n\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/polygon.png\" raw=true align=\"center\" width=\"50%\"\u003e\n\u003cbr\u003e\n\u003cem\u003e Polygon-radar chart representing the performance of Kandinsky Video 1.1 on \u003ca href=\"https://evalcrafter.github.io/\"\u003eEvalCrafter\u003c/a\u003e benchmark.\n\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"__assets__/human eval.png\" raw=true align=\"center\" width=\"50%\"\u003e\n\u003cbr\u003e\n\u003cem\u003e Human evaluation study results. The bars in the plot correspond to the percentage of “wins” in the side-by-side comparison of model generations. We compare our model with \u003ca href=\"https://arxiv.org/abs/2304.08818\"\u003eVideo LDM\u003c/a\u003e.\n\u003c/em\u003e\n\u003c/p\u003e\n\n# Authors\n\n+ Zein Shaheen: [Github](https://github.com/zeinsh), [Google Scholar](https://scholar.google.ru/citations?user=bxlgMxMAAAAJ\u0026hl=en)\n+ Vladimir Arkhipkin: [Github](https://github.com/oriBetelgeuse), [Google Scholar](https://scholar.google.com/citations?user=D-Ko0oAAAAAJ\u0026hl=ru)\n+ Viacheslav Vasilev: [Github](https://github.com/vivasilev), [Google Scholar](https://scholar.google.com/citations?user=redAz-kAAAAJ\u0026hl=ru\u0026oi=sra)\n+ Igor Pavlov: [Github](https://github.com/boomb0om)\n+ Elizaveta Dakhova: [Github](https://github.com/LizaDakhova)\n+ Anastasia Lysenko: [Github](https://github.com/LysenkoAnastasia)\n+ Sergey Markov\n+ Denis Dimitrov: [Github](https://github.com/denndimitrov), [Google Scholar](https://scholar.google.com/citations?user=3JSIJpYAAAAJ\u0026hl=ru\u0026oi=ao)\n+ Andrey Kuznetsov: [Github](https://github.com/kuznetsoffandrey), [Google Scholar](https://scholar.google.com/citations?user=q0lIfCEAAAAJ\u0026hl=ru)\n\n\n## BibTeX\nIf you use our work in your research, please cite our publication:\n```\n@article{arkhipkin2023fusionframes,\n  title     = {FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline},\n  author    = {Arkhipkin, Vladimir and Shaheen, Zein and Vasilev, Viacheslav and Dakhova, Elizaveta and Kuznetsov, Andrey and Dimitrov, Denis},\n  journal   = {arXiv preprint arXiv:2311.13073},\n  year      = {2023}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-forever%2Fkandinskyvideo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fai-forever%2Fkandinskyvideo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-forever%2Fkandinskyvideo/lists"}