{"id":13678441,"url":"https://github.com/ali-vilab/dreamtalk","last_synced_at":"2025-05-15T18:07:05.204Z","repository":{"id":214565446,"uuid":"736513336","full_name":"ali-vilab/dreamtalk","owner":"ali-vilab","description":"Official implementations for paper: DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models","archived":false,"fork":false,"pushed_at":"2024-01-15T19:11:17.000Z","size":33164,"stargazers_count":1704,"open_issues_count":44,"forks_count":206,"subscribers_count":32,"default_branch":"main","last_synced_at":"2025-04-07T23:07:41.954Z","etag":null,"topics":["audio-visual-learning","face-animation","talking-head","video-generation"],"latest_commit_sha":null,"homepage":"https://dreamtalk-project.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ali-vilab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-28T05:39:31.000Z","updated_at":"2025-04-06T17:56:02.000Z","dependencies_parsed_at":"2024-01-22T07:58:46.304Z","dependency_job_id":null,"html_url":"https://github.com/ali-vilab/dreamtalk","commit_stats":null,"previous_names":["ali-vilab/dreamtalk"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2Fdreamtalk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2Fdreamtalk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2Fdreamtalk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ali-vilab%2Fdreamtalk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ali-vilab","download_url":"https://codeload.github.com/ali-vilab/dreamtalk/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254394720,"owners_count":22063984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-visual-learning","face-animation","talking-head","video-generation"],"created_at":"2024-08-02T13:00:53.664Z","updated_at":"2025-05-15T18:07:05.168Z","avatar_url":"https://github.com/ali-vilab.png","language":"Python","funding_links":[],"categories":["Python","\u003cspan id=\"avatar\"\u003eAvatar\u003c/span\u003e","虚拟角色"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"\u003ch2 align=\"center\"\u003eDreamTalk: When Expressive Talking Head Generation \u003cbr\u003e Meets Diffusion Probabilistic Models\u003c/h2\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href='https://dreamtalk-project.github.io/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e \u003ca href='https://arxiv.org/abs/2312.09767'\u003e\u003cimg src='https://img.shields.io/badge/Paper-Arxiv-red'\u003e\u003c/a\u003e \u003ca href='https://youtu.be/VF4vlE6ZqWQ'\u003e\u003cimg src='https://badges.aleen42.com/src/youtube.svg'\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n![teaser](media/teaser.gif \"teaser\")\n\nDreamTalk is a diffusion-based audio-driven expressive talking head generation framework that can produce high-quality talking head videos across diverse speaking styles. DreamTalk exhibits robust performance with a diverse array of inputs, including songs, speech in multiple languages, noisy audio, and out-of-domain portraits.\n\n## News\n- __[2023.12]__ Release inference code and pretrained checkpoint.\n\n## Installation\n\n```\nconda create -n dreamtalk python=3.7.0\nconda activate dreamtalk\npip install -r requirements.txt\nconda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge\nconda update ffmpeg\n\npip install urllib3==1.26.6\npip install transformers==4.28.1\npip install dlib\n```\n\n## Download Checkpoints\nIn light of the social impact, we have ceased public download access to checkpoints. If you want to obtain the checkpoints, please request it by emailing mayf18@mails.tsinghua.edu.cn . It is important to note that sending this email implies your consent to use the provided method **solely for academic research purposes**.\n\nPut the downloaded checkpoints into `checkpoints` folder.\n\n\n## Inference\nRun the script:\n\n```\npython inference_for_demo_video.py \\\n--wav_path data/audio/acknowledgement_english.m4a \\\n--style_clip_path data/style_clip/3DMM/M030_front_neutral_level1_001.mat \\\n--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \\\n--image_path data/src_img/uncropped/male_face.png \\\n--cfg_scale 1.0 \\\n--max_gen_len 30 \\\n--output_name acknowledgement_english@M030_front_neutral_level1_001@male_face\n```\n\n`wav_path` specifies the input audio. The input audio file extensions such as wav, mp3, m4a, and mp4 (video with sound) should all be compatible.\n\n`style_clip_path` specifies the reference speaking style and `pose_path` specifies head pose. They are 3DMM parameter sequences extracted from reference videos. You can follow [PIRenderer](https://github.com/RenYurui/PIRender) to extract 3DMM parameters from your own videos. Note that the video frame rate should be 25 FPS. Besides, videos used for head pose reference should be first cropped to $256\\times256$ using scripts in [FOMM video preprocessing](https://github.com/AliaksandrSiarohin/video-preprocessing).\n\n`image_path` specifies the input portrait. Its resolution should be larger than $256\\times256$. Frontal portraits, with the face directly facing forward and not tilted to one side, usually achieve satisfactory results. The input portrait will be cropped to $256\\times256$. If your portrait is already cropped to $256\\times256$ and you want to disable cropping, use option `--disable_img_crop` like this:\n\n```\npython inference_for_demo_video.py \\\n--wav_path data/audio/acknowledgement_chinese.m4a \\\n--style_clip_path data/style_clip/3DMM/M030_front_surprised_level3_001.mat \\\n--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \\\n--image_path data/src_img/cropped/zp1.png \\\n--disable_img_crop \\\n--cfg_scale 1.0 \\\n--max_gen_len 30 \\\n--output_name acknowledgement_chinese@M030_front_surprised_level3_001@zp1\n```\n\n`cfg_scale` controls the scale of classifer-free guidance. It can adjust the intensity of speaking styles.\n\n`max_gen_len` is the maximum video generation duration, measured in seconds. If the input audio exceeds this length, it will be truncated.\n\nThe generated video will be named `$(output_name).mp4` and put in the output_video folder. Intermediate results, including the cropped portrait, will be in the `tmp/$(output_name)` folder.\n\nSample inputs are presented in `data` folder. Due to copyright issues, we are unable to include the songs we have used in this folder.\n\nIf you want to run this program on CPU, please add `--device=cpu` to the command line arguments. (Thank [lukevs](https://github.com/lukevs) for adding CPU support.)\n\n## Ad-hoc solutions to improve resolution\nThe main goal of this method is to achieve accurate lip-sync and produce vivid expressions across diverse speaking styles. The resolution was not considered in the initial design process. There are two ad-hoc solutions to improve resolution. The first option is to utilize [CodeFormer](https://github.com/sczhou/CodeFormer), which can achieve a resolution of $1024\\times1024$; however, it is relatively slow, processing only one frame per second on an A100 GPU, and suffers from issues with temporal inconsistency. The second option is to employ the Temporal Super-Resolution Model from [MetaPortrait](https://github.com/Meta-Portrait/MetaPortrait), which attains a resolution of $512\\times512$, offers a faster performance of 10 frames per second, and maintains temporal coherence. However, these super-resolution modules may reduce the intensity of facial emotions.\n\nThe sample results after super-resolution processing are in the `output_video` folder.\n\n## Acknowledgements\n\nWe extend our heartfelt thanks for the invaluable contributions made by preceding works to the development of DreamTalk. This includes, but is not limited to:\n[PIRenderer](https://github.com/RenYurui/PIRender)\n,[AVCT](https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face)\n,[StyleTalk](https://github.com/FuxiVirtualHuman/styletalk)\n,[Deep3DFaceRecon_pytorch](https://github.com/sicxu/Deep3DFaceRecon_pytorch)\n,[Wav2vec2.0](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english)\n,[diffusion-point-cloud](https://github.com/luost26/diffusion-point-cloud)\n,[FOMM video preprocessing](https://github.com/AliaksandrSiarohin/video-preprocessing). We are dedicated to advancing upon these foundational works with the utmost respect for their original contributions.\n\n## Citation\nIf you find this codebase useful for your research, please use the following entry.\n```BibTeX\n@article{ma2023dreamtalk,\n  title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},\n  author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng, Zhidong},\n  journal={arXiv preprint arXiv:2312.09767},\n  year={2023}\n}\n```\n## Disclaimer\n\nThis method is intended for \u003cstrong\u003eRESEARCH/NON-COMMERCIAL USE ONLY\u003c/strong\u003e. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fali-vilab%2Fdreamtalk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fali-vilab%2Fdreamtalk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fali-vilab%2Fdreamtalk/lists"}