https://github.com/KangweiiLiu/Awesome_Audio-driven_Talking-Face-Generation

A curated list of resources of audio-driven talking face generation
https://github.com/KangweiiLiu/Awesome_Audio-driven_Talking-Face-Generation

List: Awesome_Audio-driven_Talking-Face-Generation

audio-driven-talking-face controllable-generation generative-adversarial-networks paperlist speech-driven-talking-face talking-face-generation

Last synced: 7 months ago
JSON representation

A curated list of resources of audio-driven talking face generation

Host: GitHub
URL: https://github.com/KangweiiLiu/Awesome_Audio-driven_Talking-Face-Generation
Owner: KangweiiLiu
Created: 2022-08-13T09:43:44.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2022-08-25T03:21:00.000Z (over 3 years ago)
Last Synced: 2025-04-24T08:02:11.605Z (7 months ago)
Topics: audio-driven-talking-face, controllable-generation, generative-adversarial-networks, paperlist, speech-driven-talking-face, talking-face-generation
Homepage:
Size: 16.6 KB
Stars: 141
Watchers: 2
Forks: 11
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-generative-ai - KangweiiLiu/Awesome_Audio-driven_Talking-Face-Generation - driven talking face generation (Inbox: Text-to-speech (TTS) and avatars / Creative Uses of Generative AI Image Synthesis Tools)
ultimate-awesome - Awesome_Audio-driven_Talking-Face-Generation - A curated list of resources of audio-driven talking face generation. (Other Lists / TeX Lists)

README

          # Awesome Audio-driven Talking Face Generation

## 2D Encoder-Decoder Based

- StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN [F Yin 2022] [arXiv] [demo](https://feiiyin.github.io/StyleHEAT/) [project page](https://feiiyin.github.io/StyleHEAT/)

- Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [Hang Zhou 2021] [CVPR] [demo](https://www.youtube.com/watch?v=lNQQHIggnUg) [project page](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)

- Talking Head Generation with Audio and Speech Related Facial Action Units [S Chen 2021]  [BMVC]

- Speech Driven Talking Face Generation from a Single Image and an Emotion Condition [SE Eskimez 2021] [arXiv] [project page](https://github.com/eeskimez/emotalkingface)

- HeadGAN: Video-and-Audio-Driven Talking Head Synthesis [MC Doukas 2021] [arXiv] [demo](https://crossminds.ai/video/headgan-video-and-audio-driven-talking-head-synthesis-6062842b40ac1ab106a4849e/) [project page]()

- Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning [Hao Zhu 2020] [IJCAI]

- A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild [K R Prajwal 2020] [ACMMM] [demo](https://crossminds.ai/video/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild-5fecb0d974cbe5b2a4175b62/) [project page](https://github.com/Rudrabha/Wav2Lip)

- Talking Face Generation with Expression-Tailored Generative Adversarial Network [D Zeng 2020] [ACMMM]

- Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [KR Prajwal 2020] [CVPR] [demo](https://www.youtube.com/watch?v=HziA-jmlk_4) [project page](https://github.com/Rudrabha/Lip2Wav)

- Robust One Shot Audio to Video Generation [N Kumar 2020] [CVPRW] [demo](https://www.facebook.com/wdeepvision2020/videos/925563794582962/) [project page]()

- Talking Face Generation by Adversarially Disentangled Audio-Visual Representation [Hang Zhou 2019] [AAAI] [demo](https://www.youtube.com/watch?v=-J2zANwdjcQ) [project page](https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS)

- Talking face generation by conditional recurrent adversarial network [Yang Song 2019] [IJCAI] [demo](https://www.youtube.com/watch?v=Sr4smQo5BAQ) [project page](https://github.com/susanqq/Talking_Face_Generation)

- Realistic Speech-Driven Facial Animation with GANs [Konstantinos Vougioukas 2019]  [IJCV] [demo](https://sites.google.com/view/facial-animation) [project page](https://github.com/DinoMan/speech-driven-animation)

- Animating Face using Disentangled Audio Representations [G Mittal 2019] [WACV]

- Lip Movements Generation at a Glance [Lele Chen 2018] [ECCV] [demo](https://www.youtube.com/watch?v=7IX_sIL5v0c) [project page](https://github.com/lelechen63/3d_gan)

- X2Face: A network for controlling face generation using images, audio, and pose codes [Olivia Wiles 2018] [ECCV] [demo](https://www.youtube.com/watch?v=q6dt-2izYM4) [project page](https://github.com/oawiles/X2Face)

- Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network [HX Pham 2018] [arXiv]  [demo](https://www.youtube.com/watch?v=Zr9MlAazPpo)

- You said that？ [Chung 2017] [BMVC]  [demo](https://www.youtube.com/watch?v=lXhkxjSJ6p8) [project page](https://github.com/joonson/yousaidthat)

## Landmark Based

- Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation [YUANXUN LU 2021] [SIGGRAPH] [demo](https://replicate.com/yuanxunlu/livespeechportraits) [project page](https://github.com/YuanxunLu/LiveSpeechPortraits)

- Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis [H Wu 2021] [ACMMM] [demo](https://github.com/wuhaozhe/style_avatar) [project page](https://github.com/wuhaozhe/style_avatar)

- MakeItTalk: Speaker-Aware Talking-Head Animation [YANG ZHOU 2020] [SIGGRAPH] [demo](https://www.youtube.com/watch?v=vUMGKASgbf8&) [project page](https://github.com/yzhou359/MakeItTalk)

- Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture  [Dipanjan Das, Sandika Biswas 2020]  [ECCV]

- A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors [R Zheng 2020]  [ICPR]

- Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss [Lele Chen 2019] [CVPR] [demo](https://www.youtube.com/watch?v=eH7h_bDRX2Q&t=50s) [project page](https://github.com/lelechen63/ATVGnet)

- Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks [SA Jalalifar 2018] [arXiv]

- Synthesizing Obama: learning lip sync from audio [SUPASORN SUWAJANAKORN 2017] [SIGGRAPH] [demo](https://www.youtube.com/watch?v=9Yq67CjDqvw) 

## 3D Model Based

- Everybody’s Talkin’: Let Me Talk as You Want [Linsen Song 2022]  [TIFS]  [demo](https://www.youtube.com/watch?v=tNPuAnvijQk) 

- One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning [Suzhen Wang 2022]  [AAAI]  [demo](https://www.youtube.com/watch?v=HHj-XCXXePY) [projectpage](https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face)

- FaceFormer: Speech-Driven 3D Facial Animation with Transformers [Y Fan 2022]  [CVPR]  [demo](https://www.youtube.com/watch?v=NYms53uf9YY) [projectpage](https://github.com/EvelynFan/FaceFormer)

- Iterative Text-based Editing of Talking-heads Using Neural Retargeting [Xinwei Yao 2021]  [ICML]  [demo](https://www.youtube.com/watch?v=oo4tB0f6uqQ) 

- AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [Yudong Guo 2021]  [ICCV]  [demo](https://www.youtube.com/watch?v=TQO2EBYXLyU) [projectpage](https://github.com/YudongGuo/AD-NeRF)

- Audio-driven emotional video portraits [X Ji 2021]  [CVPR]  [demo](https://www.youtube.com/watch?v=o6LQfLkizbw) [projectpage](https://github.com/jixinya/EVP)

- FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning [C Zhang 2021]  [ICCV]  [demo](https://www.youtube.com/watch?v=hl9ek3bUV1E) [projectpage](https://github.com/zhangchenxu528/FACIAL)

- Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset [Z Zhang 2021]  [CVPR]  [demo](https://www.youtube.com/watch?v=uJdBgWYBTww) [projectpage](https://github.com/MRzzm/HDTF)

- Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion [Suzhen Wang 2021]  [IJCAI]  [demo](https://www.youtube.com/watch?v=xvcBJ29l8rA) [projectpage](https://github.com/wangsuzhen/Audio2Head)

- MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [A Richard 2021]  [ICCV]  [demo](https://www.facebook.com/MetaResearch/videos/251508987094387/) [projectpage](https://github.com/facebookresearch/meshtalk)

- 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [Q Wang 2021]  [arXiv]  

- Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation [L Li 2021]  [AAAI]  [demo](https://www.youtube.com/watch?v=weHA6LHv-Ew) [projectpage](https://github.com/FuxiVirtualHuman/Write-a-Speaker)

- Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary [S Zhang 2021 ]  [ICASSP]  [demo](https://twitter.com/_akhaliq/status/1389054381182570497) [projectpage](https://github.com/sibozhang/Text2Video)

- Neural Voice Puppetry: Audio-driven Facial Reenactment [Justus Thies 2020]  [ECCV]  [demo](https://www.youtube.com/watch?v=s74_yQiJMXA) [projectpage](https://github.com/miu200521358/NeuralVoicePuppetryMMD)

- Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose [Ran Yi 2020]  [arXiv]   [projectpage](https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose)

- Talking-head Generation with Rhythmic Head Motion [Lele Chen 2020]  [ECCV]  [demo](https://www.youtube.com/watch?v=kToSgSFoRz8) [projectpage](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion)

- Modality Dropout for Improved Performance-driven Talking Faces [‎Hussen Abdelaziz 2020]  [ICMI]  

- Audio- and Gaze-driven Facial Animation of Codec Avatars [A Richard 2020]  [arXiv]  [demo](https://www.youtube.com/watch?v=1nZjW_xoCDQ) [projectpage](https://research.facebook.com/videos/audio-and-gaze-driven-facial-animation-of-codec-avatars/)

- Text-based editing of talking-head video [OHAD FRIED 2019]  [arXiv]  [demo](https://www.youtube.com/watch?v=0ybLCfVeFL4) 

- Capture, Learning, and Synthesis of 3D Speaking Styles [D Cudeiro 2019]  [CVPR]  [demo](https://www.youtube.com/watch?v=XceCxf_GyW4) [projectpage](https://github.com/TimoBolkart/voca)

- Visemenet: audio-driven animator-centric speech animation [YANG ZHOU 2018]  [TOG]  [demo](https://www.youtube.com/watch?v=kk2EnyMD3mo) 

- Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks [N Sadoughi 2018]  [TAC]  

- Speech-driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach [Hai X. Pham 2017]  [IEEE Trans. Syst. Man Cybern.: Syst.]  

- Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion [TERO KARRAS 2017]  [TOG]  [demo](https://www.youtube.com/watch?v=lDzrfdpGqw4&t) [projectpage](https://research.nvidia.com/publication/2017-07_audio-driven-facial-animation-joint-end-end-learning-pose-and-emotion)

- A deep learning approach for generalized speech animation [SARAH TAYLOR 2017]  [SIGGRAPH]  [demo](https://www.youtube.com/watch?v=GwV1n8v_bpA) 

- End-to-end Learning for 3D Facial Animation from Speech [HX Pham 2017]  [ICMI]  

- JALI: An Animator-Centric Viseme Model for Expressive Lip Synchronization [Pif Edwards 2016]  [SIGGRAPH]  [demo](https://www.youtube.com/watch?v=vniMsN53ZPI) 

## Survey

What comprises a good talking-head video generation?: A Survey and Benchmark [Lele Chen 2020]  [paper](https://arxiv.org/abs/2005.03201)

Deep Audio-Visual Learning: A Survey [Hao Zhu 2020] [paper](https://arxiv.org/abs/2001.04758)

Handbook of Digital Face Manipulation and Detection [Yuxin Wang 2022] [paper](https://library.oapen.org/bitstream/handle/20.500.12657/52835/978-3-030-87664-7.pdf?sequence=1)

Deep Learning for Visual Speech Analysis: A Survey [paper](https://arxiv.org/abs/2205.10839)

## Datasets

- GRID 2006 [project page](http://spandh.dcs.shef.ac.uk/avlombard/)

- TCD-TIMIT 2015 [project page](https://sigmedia.tcd.ie/)

- LRW 2016 [project page](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)

- MODALITY 2017 [project page](http://www.modality-corpus.org/)

- ObamaSet 2017

- Voxceleb1 2017 [project page](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

- Voxceleb2 2018 [project page](https://www.robots.ox.ac.uk/~vgg/data/voxceleb2/)

- LRS2-BBC 2018 [project page](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)

- LRS3-TED 2018 [project page](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html)

- HDTF 2020 [project page](https://github.com/MRzzm/HDTF)

- CREMA-D 2014 [project page](https://github.com/CheyneyComputerScience/CREMA-D)

- MSP-IMPROV 2016 [project page](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html)

- RAVDESS 2018 [project page](https://sites.psychlabs.ryerson.ca/smartlab/resources/speech-song-database-ravdess/)

- MELD 2018 [project page](https://affective-meld.github.io/)

- MEAD 2020 [project page](https://wywu.github.io/projects/MEAD/MEAD.html)

- CAVSR1.0 1998 

- HIT Bi-CAV 2005 

- LRW-1000 2018 [project page](https://github.com/VIPL-Audio-Visual-Speech-Understanding/Lipreading-DenseNet3D)

## Metrics

| Metrics                                              | Paper                                                        |

| ---------------------------------------------------- | ------------------------------------------------------------ |

| PSNR (peak signal-to-noise  ratio)                   | -                                                            |

| SSIM (structural similarity  index measure)          | Image quality  assessment: from error visibility to structural similarity. |

| CPBD(cumulative probability of  blur detection)      | A no-reference image  blur metric based on the cumulative probability of blur detection |

| LPIPS (Learned Perceptual  Image Patch Similarity) - | The Unreasonable Effectiveness of Deep Features as  a Perceptual Metric |

| NIQE (Natural Image Quality  Evaluator)              | Making a ‘Completely  Blind’ Image Quality Analyzer          |

| FID (Fréchet inception  distance)                    | GANs trained by a two  time-scale update rule converge to a local nash equilibrium |

| LMD (landmark distance error)                        | Lip Movements Generation at a Glance                         |

| LRA (lip-reading  accuracy)                          | Talking Face Generation by Conditional Recurrent  Adversarial Network |

| WER(word error rate)                                 | Lipnet: end-to-end sentencelevel lipreading.                 |

| LSE-D (Lip Sync Error -  Distance)                   | Out of time: automated lip sync in the wild                  |

| LSE-C (Lip Sync Error -  Confidence)                 | Out of time: automated lip sync in the wild                  |

| ACD(Average  content distance)                       | Facenet: a unified embedding for face recognition  and clustering. |

| CSIM(cosine similarity)                              | Arcface: additive angular margin loss for deep face  recognition. |

| EAR(eye aspect ratio)                                | Real-time eye blink  detection using facial landmarks. In: Computer Vision Winter Workshop |

| ESD(emotion similarity  distance)                    | What comprises a good talking-head video  generation?: A Survey and Benchmark |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/KangweiiLiu/Awesome_Audio-driven_Talking-Face-Generation

Awesome Lists containing this project

README