{"id":21692501,"url":"https://github.com/jdh-algo/JoyVASA","last_synced_at":"2025-07-18T08:31:26.695Z","repository":{"id":262834514,"uuid":"888450044","full_name":"jdh-algo/JoyVASA","owner":"jdh-algo","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-13T14:01:53.000Z","size":40816,"stargazers_count":596,"open_issues_count":15,"forks_count":51,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-01-13T14:48:25.335Z","etag":null,"topics":["audio-driven-talking-face","generative-ai","lip-sync","portrait-animation","talking-head"],"latest_commit_sha":null,"homepage":"https://jdh-algo.github.io/JoyVASA/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jdh-algo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-14T12:23:27.000Z","updated_at":"2025-01-13T14:01:56.000Z","dependencies_parsed_at":"2025-01-13T14:44:16.065Z","dependency_job_id":null,"html_url":"https://github.com/jdh-algo/JoyVASA","commit_stats":null,"previous_names":["jdh-algo/joyvasa"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jdh-algo/JoyVASA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdh-algo%2FJoyVASA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdh-algo%2FJoyVASA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdh-algo%2FJoyVASA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdh-algo%2FJoyVASA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jdh-algo","download_url":"https://codeload.github.com/jdh-algo/JoyVASA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdh-algo%2FJoyVASA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265728813,"owners_count":23818729,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-driven-talking-face","generative-ai","lip-sync","portrait-animation","talking-head"],"created_at":"2024-11-25T18:02:11.632Z","updated_at":"2025-07-18T08:31:26.669Z","avatar_url":"https://github.com/jdh-algo.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch1 align='center'\u003eJoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation\u003c/h1\u003e\n\n\u003cdiv align='center'\u003e\n    \u003ca href='https://github.com/xuyangcao' target='_blank'\u003eXuyang Cao\u003c/a\u003e\u003csup\u003e1*\u003c/sup\u003e\u0026emsp;\n    Guoxin Wang\u003csup\u003e12*\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://github.com/DBDXSS' target='_blank'\u003eSheng Shi\u003c/a\u003e\u003csup\u003e1*\u003c/sup\u003e\u0026emsp;\n    \u003ca href='https://github.com/zhaojun060708' target='_blank'\u003eJun Zhao\u003c/a\u003e\u003csup\u003e1\u003c/sup\u003e\u0026emsp;\n    Yang Yao\u003csup\u003e1\u003c/sup\u003e\n\u003c/div\u003e\n\u003cdiv align='center'\u003e\n    Jintao Fei\u003csup\u003e1\u003c/sup\u003e\u0026emsp;\n    Minyu Gao\u003csup\u003e1\u003c/sup\u003e\n\u003c/div\u003e\n\u003cdiv align='center'\u003e\n    \u003csup\u003e1\u003c/sup\u003eJD Health International Inc.  \u003csup\u003e2\u003c/sup\u003eZhejiang University\n\u003c/div\u003e\n\n\u003cbr\u003e\n\u003cdiv align='center'\u003e\n    \u003ca href='https://github.com/jdh-algo/JoyVASA'\u003e\u003cimg src='https://img.shields.io/github/stars/jdh-algo/JoyVASA?style=social'\u003e\u003c/a\u003e\n    \u003ca href='https://jdh-algo.github.io/JoyVASA'\u003e\u003cimg src='https://img.shields.io/badge/Project-HomePage-Green'\u003e\u003c/a\u003e\n    \u003ca href='https://arxiv.org/abs/2411.09209'\u003e\u003cimg src='https://img.shields.io/badge/Paper-Arxiv-red'\u003e\u003c/a\u003e\n    \u003ca href='https://huggingface.co/jdh-algo/JoyVASA'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'\u003e\u003c/a\u003e\n    \u003c!-- \u003ca href='https://huggingface.co/spaces/jdh-algo/JoyHallo'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow'\u003e\u003c/a\u003e --\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n## 📖 Introduction\n\nAudio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the framework’s applications in portrait animation.\n\n## 🧳 Framework\n\n![Inference Pipeline](assets/imgs/pipeline_inference.png)\n\n**Inference Pipeline of the proposed JoyVASA.** Given a reference image, we first extract the 3D facial appearance feature using the appearance encoder in LivePortrait, and also a series of learned 3D keypoints using the motion encoder. For the input speech, the audio features are initially extracted using the wav2vec2 encoder. The audio-driven motion sequences are then sampled using a diffusion model trained in the second stage in a sliding window fashion. Using the 3D keypoints of reference image, and the sampled target motion sequences, the target keypoints are computed. Finally, the 3D facial appearance feature is warped based on the source and target keypoints and rendered by a generator to produce the final output video.\n\n## ⚙️ Installation\n\n**System requirements:**\n\nUbuntu:\n\n- Tested on Ubuntu 20.04, CUDA 12.1\n- Tested GPUs: A100\n\nWindows:\n\n- Tested on Windows 11, CUDA 12.1\n- Tested GPUs: RTX 4060 Laptop 8GB VRAM GPU\n\n**Create environment:**\n\n```bash\n# 1. Create base environment\nconda create -n joyvasa python=3.10 -y\nconda activate joyvasa \n\n# 2. Install requirements\npip install -r requirements.txt\n\n# 3. Install ffmpeg\nsudo apt-get update  \nsudo apt-get install ffmpeg -y\n\n# 4. Optional: Install MultiScaleDeformableAttention for animal image animation\ncd src/utils/dependencies/XPose/models/UniPose/ops\npython setup.py build install\ncd - # equal to cd ../../../../../../../\n```\n\n## 🎒 Prepare model checkpoints\n\nMake sure you have [git-lfs](https://git-lfs.com) installed and download all the following checkpoints to `pretrained_weights`:\n\n### 1. Download JoyVASA motion generator checkpoints\n\n```bash\ngit lfs install\ngit clone https://huggingface.co/jdh-algo/JoyVASA\n```\n\n### 2. Download audio encoder checkpoints\n\nWe suport two types of audio encoders, including [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h), and [hubert-chinese](https://huggingface.co/TencentGameMate/chinese-hubert-base).\n\nRun the following commands to download [hubert-chinese](https://huggingface.co/TencentGameMate/chinese-hubert-base) pretrained weights:\n\n```bash\ngit lfs install\ngit clone https://huggingface.co/TencentGameMate/chinese-hubert-base\n```\n\nTo get the [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base-960h) pretrained weights, run the following commands:\n\n```bash\ngit lfs install\ngit clone https://huggingface.co/facebook/wav2vec2-base-960h\n```\n\n\u003e [!NOTE]\n\u003e The motion generation model with wav2vec2 encoder will be supported later.\n\n### 3. Download LivePortraits checkpoints\n\n```bash\n# !pip install -U \"huggingface_hub[cli]\"\nhuggingface-cli download KwaiVGI/LivePortrait --local-dir pretrained_weights --exclude \"*.git*\" \"README.md\" \"docs\"\n```\n\nRefering to [Liveportrait](https://github.com/KwaiVGI/LivePortrait/tree/main) for more download methods.\n\n### 4. `pretrained_weights` contents\n\nThe final `pretrained_weights` directory should look like this:\n\n```text\n./pretrained_weights/\n├── insightface                                                                                                                                                 \n│   └── models                                                                                                                                                  \n│       └── buffalo_l                                                                                                                                           \n│           ├── 2d106det.onnx                                                                                                                                   \n│           └── det_10g.onnx   \n├── JoyVASA\n│   ├── motion_generator\n│   │   └── iter_0020000.pt\n│   └── motion_template\n│       └── motion_template.pkl\n├── liveportrait\n│   ├── base_models\n│   │   ├── appearance_feature_extractor.pth\n│   │   ├── motion_extractor.pth\n│   │   ├── spade_generator.pth\n│   │   └── warping_module.pth\n│   ├── landmark.onnx\n│   └── retargeting_models\n│       └── stitching_retargeting_module.pth\n├── liveportrait_animals\n│   ├── base_models\n│   │   ├── appearance_feature_extractor.pth\n│   │   ├── motion_extractor.pth\n│   │   ├── spade_generator.pth\n│   │   └── warping_module.pth\n│   ├── retargeting_models\n│   │   └── stitching_retargeting_module.pth\n│   └── xpose.pth\n├── TencentGameMate:chinese-hubert-base\n│   ├── chinese-hubert-base-fairseq-ckpt.pt\n│   ├── config.json\n│   ├── gitattributes\n│   ├── preprocessor_config.json\n│   ├── pytorch_model.bin\n│   └── README.md\n└── wav2vec2-base-960h               \n    ├── config.json                  \n    ├── feature_extractor_config.json\n    ├── model.safetensors\n    ├── preprocessor_config.json\n    ├── pytorch_model.bin\n    ├── README.md\n    ├── special_tokens_map.json\n    ├── tf_model.h5\n    ├── tokenizer_config.json\n    └── vocab.json\n```\n\n\u003e [!NOTE]\n\u003e The folder `TencentGameMate:chinese-hubert-base` in Windows should be renamed `chinese-hubert-base`.\n\n## 🚀 Inference\n\n### 1. Inference with command line\n\nAnimal:\n\n```python\npython inference.py -r assets/examples/imgs/joyvasa_001.png -a assets/examples/audios/joyvasa_001.wav --animation_mode animal --cfg_scale 2.0\n```\n\nHuman:\n\n```python\npython inference.py -r assets/examples/imgs/joyvasa_003.png -a assets/examples/audios/joyvasa_003.wav --animation_mode human --cfg_scale 2.0\n```\n\nYou can change cfg_scale to get results with different expressions and poses.\n\n\u003e [!NOTE]\n\u003e Mismatching Animation Mode and Reference Image may result in incorrect results.\n\n### 2. Inference with web demo\n\nUse the following command to start web demo:\n\n```python\npython app.py\n```\n\nThe demo will be create at http://127.0.0.1:7862.\n\n\n## ⚓️ Train Motion Generator with Your Own Data\n\nThe motion generater should be trained using human talking face videos.\n\n\n### 1. Prepare train and validation data\n\nChnage the `root_dir` in `01_extract_motions.py` with you own dataset path, then run the following commands to generate training and validation data:\n\n```bash\ncd src/prepare_data\npython 01_extract_motions.py\npython 05_extract_audio.py\npython 02_gen_labels.py\npyhton 03_merge_motions.py\npython 04_gen_template.py\n\nmv motion_templete.pkl motions.pkl train.json test.json ../../data\ncd ../..\n```\n\n### 2. Train\n\n```bash\npython train.py\n```\n\nThe experimental results is located in `experiments/`.\n\n## 📝 Citations\n\nIf you find our work helpful, please consider citing us:\n\n```\n@misc{cao2024joyvasaportraitanimalimage,\n      title={JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation}, \n      author={Xuyang Cao and Guoxin Wang and Sheng Shi and Jun Zhao and Yang Yao and Jintao Fei and Minyu Gao},\n      year={2024},\n      eprint={2411.09209},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2411.09209}, \n}\n```\n\n## 🤝 Acknowledgments\n\nWe would like to thank the contributors to the [LivePortrait](https://github.com/KwaiVGI/LivePortrait), [Open Facevid2vid](https://github.com/zhanglonghao1992/One-Shot_Free-View_Neural_Talking_Head_Synthesis), [InsightFace](https://github.com/deepinsight/insightface), [X-Pose](https://github.com/IDEA-Research/X-Pose), [DiffPoseTalk](https://github.com/DiffPoseTalk/DiffPoseTalk), [Hallo](https://github.com/fudan-generative-vision/hallo), [wav2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec), [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain), [Q-Align](https://github.com/Q-Future/Q-Align), [Syncnet](https://github.com/joonson/syncnet_python), and [VBench](https://github.com/Vchitect/VBench) repositories, for their open research and extraordinary work.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdh-algo%2FJoyVASA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjdh-algo%2FJoyVASA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdh-algo%2FJoyVASA/lists"}