https://github.com/YunjinPark/awesome_talking_face_generation

Last synced: 21 days ago
JSON representation
Host: GitHub
URL: https://github.com/YunjinPark/awesome_talking_face_generation
Owner: YunjinPark
Created: 2021-03-04T09:17:13.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2024-02-26T08:19:25.000Z (about 1 year ago)
Last Synced: 2024-05-20T01:00:36.111Z (11 months ago)
Size: 119 KB
Stars: 766
Watchers: 56
Forks: 71
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-conditional-content-generation - here
Awesome-Computer-Vision - 4
awesome-awesome-artificial-intelligence - Awesome Talking Face Generation
README

        # Awesome talking face generation

# papers & codes 

## 2023

| title | | paper | code | dataset |keywords|

| --- | ---| --- | --- | --- | --- |

|CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior|CVPR(23)|[paper](https://arxiv.org/abs/2301.02379)|[code](https://github.com/Doubiiu/CodeTalker)|BIWI, VOCA||

|DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation|CVPR(23)|[paper](https://arxiv.org/abs/2301.03786)||HDTF|Diffusion|

|AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction|CVPR(23)|[paper](https://arxiv.org/abs/2304.13115)||Multiface|3D|

|Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert|CVPR(23)|[paper](https://arxiv.org/abs/2303.17480)|[code](https://github.com/Sxjdwang/TalkLip)|LRS2||

|LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook|CVPR(23)|[paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_LipFormer_High-Fidelity_and_Generalizable_Talking_Face_Generation_With_a_Pre-Learned_CVPR_2023_paper.pdf)||LRS2, FFHQ||

|Parametric Implicit Face Representation for Audio-Driven Facial Reenactment|CVPR(23)|[paper](https://arxiv.org/abs/2306.07579)| |HDTF||

|Identity-Preserving Talking Face Generation with Landmark and Appearance Priors|CVPR(23)|[paper](https://arxiv.org/abs/2305.08293)|[code](https://github.com/Weizhi-Zhong/IP_LAP)|LRS2, LRS3||

|High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning|CVPR(23)|[paper](https://arxiv.org/abs/2305.02572)| |MEAD|emotion|

|Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks|InterSpeech(23)|[paper](https://arxiv.org/abs/2306.03594)| | MEAD | emotion |

|EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation|ICCV(23)|[paper](https://arxiv.org/abs/2303.11089)|[code(not yet)](https://github.com/ZiqiaoPeng/EmoTalk)||emotion|

|Emotionally Enhanced Talking Face Generation||[paper](https://arxiv.org/pdf/2303.11548.pdf)|[code](https://github.com/sahilg06/EmoGen)|CREMA-D|emotion|

|DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video|AAAI(23)|[paper](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)|[code](https://github.com/MRzzm/DINet)|||

|CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior||[paper](https://arxiv.org/pdf/2301.02379.pdf)|[code](https://github.com/Doubiiu/CodeTalker)||3D|

|GENEFACE: GENERALIZED AND HIGH-FIDELITY AUDIO-DRIVEN 3D TALKING FACE SYNTHESIS|ICLR (23)|[paper](https://arxiv.org/pdf/2301.13430.pdf)|[code](https://github.com/yerfor/GeneFace)||NeRF|

|OPT: ONE-SHOT POSE-CONTROLLABLE TALKING HEAD GENERATION||[paper](https://arxiv.org/pdf/2302.08197.pdf)||||

|LipNeRF: What is the right feature space to lip-sync a NeRF?||[paper](https://assets.amazon.science/00/58/6b3a5d7e417bae273191ed9ea1b2/lipnerf-what-is-the-right-feature-space-to-lip-sync-a-nerf.pdf)|||NeRF|

|Audio-Visual Face Reenactment | WACV (23)|[paper](https://arxiv.org/pdf/2210.02755.pdf)| [code](https://github.com/mdv3101/AVFR-Gan)| | |

|Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization|WACV (23)|[paper](https://openaccess.thecvf.com/content/WACV2023/papers/Gupta_Towards_Generating_Ultra-High_Resolution_Talking-Face_Videos_With_Lip_Synchronization_WACV_2023_paper.pdf)| ||

|StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles|AAAI(23)|[paper](https://arxiv.org/pdf/2301.01081.pdf)|[code](https://github.com/FuxiVirtualHuman/styletalk)|||

|DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis||[paper](https://arxiv.org/pdf/2301.03786.pdf)|[proj](https://sstzal.github.io/DiffTalk/)||Diffusion|

|Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation||[paper](https://mstypulkowski.github.io/diffusedheads/diffused_heads.pdf)|[proj](https://mstypulkowski.github.io/diffusedheads/)||Diffusion|

|Speech Driven Video Editing via an Audio-Conditioned Diffusion Model||[paper](https://arxiv.org/pdf/2301.04474.pdf)|[code](https://github.com/DanBigioi/DiffusionVideoEditing)||Diffusion|

|TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles||[paper](https://arxiv.org/pdf/2304.00334.pdf)||Text-Annotated MEAD|Text|

## 2022

| title | | paper | code | dataset |keywords|

| --- | ---| --- | --- | --- | --- |

|Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors||[paper](https://arxiv.org/pdf/2212.04248v1.pdf)|[proj](https://zxyin.github.io/TH-PAD/)|||

|SPACE: Speech-driven Portrait Animation with Controllable Expression|ICCV(23)|[paper](https://arxiv.org/abs/2211.09809)|||Pose, Emotion|

|SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation|CVPR(23)|[paper](https://arxiv.org/pdf/2211.12194v1.pdf)|[code](https://github.com/OpenTalker/SadTalker)|||

|Compressing Video Calls using Synthetic Talking Heads | BMVC (22)|[paper](https://arxiv.org/pdf/2210.03692.pdf)| | |application |

|EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model | SIGGRAPH (22)|[paper](https://arxiv.org/pdf/2205.15278.pdf)|||emotion|

|Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis|ECCV(22)|[paper](https://arxiv.org/abs/2207.11770)|[code](https://github.com/sstzal/DFRF)|||

| Expressive Talking Head Generation with Granular Audio-Visual Control |CVPR(22)| [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expressive_Talking_Head_Generation_With_Granular_Audio-Visual_Control_CVPR_2022_paper.pdf)| |  | |

|Talking Face Generation With Multilingual TTS| CVPR(22)| [paper](https://arxiv.org/abs/2205.06421) | [code](https://huggingface.co/spaces/CVPR/ml-talking-face) | | - |

|Deep Learning for Visual Speech Analysis: A Survey| |[paper](https://arxiv.org/abs/2205.10839)| | |survey|

| StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN | | [paper](https://arxiv.org/abs/2203.04036) | [code](https://github.com/FeiiYin/StyleHEAT) | |stylegan|

|Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation|ECCV(22)|[paper](https://arxiv.org/pdf/2201.07786.pdf)|[code(coming soon)](https://github.com/alvinliu0/SSP-NeRF)||NeRF|

|Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation| |[paper](https://www.aaai.org/AAAI22Papers/AAAI-6163.YangC.pdf) | | | |

|SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory|AAAI(22)| [paper(temp)](https://www.aaai.org/AAAI22Papers/AAAI-7528.ParkS.pdf) | | LRW, LRS2, BBC News  | |

|DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering||[paper](https://arxiv.org/abs/2201.00791)|||NeRF|

|Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos||[paper](https://arxiv.org/pdf/2206.04523.pdf)|||

|Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions||[paper](https://arxiv.org/pdf/2204.06180.pdf)|||

|DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation||[paper](https://arxiv.org/pdf/2203.07931.pdf)|||

|Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion||[paper](https://arxiv.org/pdf/2204.12756.pdf)|||

|StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation| | [paper](https://arxiv.org/abs/2208.10922) | | - | |

|AUTOLV: AUTOMATIC LECTURE VIDEO GENERATOR ||[paper](https://arxiv.org/pdf/2209.08795v1.pdf)||||

|Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement||[paper](https://arxiv.org/pdf/2209.01320v1.pdf)||||

## 2021

| title | | paper | code | dataset |

| --- | ---| --- | --- | --- |

|Depth-Aware Generative Adversarial Network for Talking Head Video Generation||[paper](https://arxiv.org/pdf/2203.06605v2.pdf)|[code](https://github.com/sstzal/DFRF)|||

|||[paper]()|[code]()|||

|Parallel and High-Fidelity Text-to-Lip Generation ||[paper](https://arxiv.org/pdf/2107.06831v2.pdf)|| |

|[Survey]Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis|- |[paper](https://arxiv.org/abs/2109.02081)| | | 

|FaceFormer: Speech-Driven 3D Facial Animation with Transformers|CVPR(22)|[paper](https://arxiv.org/abs/2112.05329)|[code](https://github.com/EvelynFan/FaceFormer)| |

| Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices | | [paper](https://arxiv.org/pdf/2104.10299v1.pdf) | [code](https://github.com/choyingw/Voice2Mesh) | |

|FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning|ICCV|[paper](https://arxiv.org/pdf/2108.07938v1.pdf)|[code](https://github.com/zhangchenxu528/FACIAL)| |

|Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis| |[paper](https://arxiv.org/pdf/2111.00203v1.pdf)|[code](https://github.com/wuhaozhe/style_avatar) | |

|Audio-Driven Emotional Video Portraits|CVPR|[paper](https://arxiv.org/pdf/2104.07452.pdf)|[code](https://github.com/jixinya/EVP/)|MEAD, LRW|

|LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lahiri_LipSync3D_Data-Efficient_Learning_of_Personalized_3D_Talking_Faces_From_Video_CVPR_2021_paper.pdf)| | |

|Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhou_Pose-Controllable_Talking_Face_Generation_by_Implicitly_Modularized_Audio-Visual_Representation_CVPR_2021_paper.pdf)|[code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)|VoxCeleb2, LRW|

|Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Flow-Guided_One-Shot_Talking_Face_Generation_With_a_High-Resolution_Audio-Visual_Dataset_CVPR_2021_paper.pdf)|[code](https://github.com/MRzzm/HDTF)|HDTF|

|MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement|ICCV| [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Richard_MeshTalk_3D_Face_Animation_From_Speech_Using_Cross-Modality_Disentanglement_ICCV_2021_paper.pdf) | [code(coming soon)](https://github.com/facebookresearch/meshtalk) | |

|AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis|ICCV| [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Guo_AD-NeRF_Audio_Driven_Neural_Radiance_Fields_for_Talking_Head_Synthesis_ICCV_2021_paper.pdf) | [code](https://github.com/YudongGuo/AD-NeRF) | |

|Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation |AAAI |[paper](https://arxiv.org/pdf/2104.07995.pdf) |[code(coming soon)](https://github.com/FuxiVirtualHuman/Write-a-Speaker)|Mocap dataset |

|Visual Speech Enhancement Without A Real Visual Stream | | [paper](https://openaccess.thecvf.com/content/WACV2021/papers/Hegde_Visual_Speech_Enhancement_Without_a_Real_Visual_Stream_WACV_2021_paper.pdf)| | |

|Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary| |[paper](https://arxiv.org/pdf/2104.14631v1.pdf)|[code](https://github.com/sibozhang/Text2Video)| |

|Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion|IJCAI|[paper](https://arxiv.org/pdf/2107.09293.pdf)|[code](https://github.com/wangsuzhen/Audio2Head) | VoxCeleb, GRID, LRW |

|3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head| |[paper](https://arxiv.org/pdf/2104.12051.pdf)| | |

|AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person| |[paper](https://arxiv.org/pdf/2108.04325v2.pdf)| | VoxCeleb2, Obama|

## 2020

| title | | paper | code | dataset |

| --- | ---| --- | --- | --- |

|[Survey]What comprises a good talking-head video generation?: A survey and benchmark| |[paper](https://arxiv.org/pdf/2005.03201.pdf)|[code](https://github.com/lelechen63/talking-head-generation-survey)| |

|One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing|CVPR(21)|[paper](https://arxiv.org/abs/2011.15126)|[code](https://github.com/NVlabs/imaginaire)| |

|Speech Driven Talking Face Generation from a Single Image and an Emotion Condition| |[paper](https://arxiv.org/pdf/2008.03592.pdf)|[code](https://github.com/eeskimez/emotalkingface)|CREMA-D|

|A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild | ACMMM|[paper](https://arxiv.org/pdf/2008.10010.pdf) |[code](https://github.com/Rudrabha/Wav2Lip) | LRS2 |

|Talking-head Generation with Rhythmic Head Motion |ECCV |[paper](https://arxiv.org/pdf/2007.08547.pdf) | [code](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion)| Crema, Grid, Voxceleb, Lrs3  |

|MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation|ECCV | [paper](https://wywu.github.io/projects/MEAD/support/MEAD.pdf)| [code](https://github.com/uniBruce/Mead)| VoxCeleb2, AffectNet |

|Neural voice puppetry:Audio-driven facial reenactment|ECCV|[paper](https://arxiv.org/pdf/1912.05566.pdf)| | |

|Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars| ECCV|[paper](https://arxiv.org/pdf/2008.10174v1.pdf) |[code](https://github.com/saic-violet/bilayer-model)| |

|HeadGAN:Video-and-Audio-Driven Talking Head Synthesis| |[paper](https://arxiv.org/pdf/2012.08261v1.pdf)| |VoxCeleb2|

|MakeItTalk: Speaker-Aware Talking Head Animation| |[paper](https://arxiv.org/pdf/2004.12992.pdf)|[code](https://github.com/adobe-research/MakeItTalk), [code](https://github.com/yzhou359/MakeItTalk)| VoxCeleb2, VCTK |[paper](https://arxiv.org/pdf/2008.10174v1.pdf)|[code](https://github.com/saic-violet/bilayer-model)|VoxCeleb2|

|Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose |- |[paper](https://arxiv.org/pdf/2002.10137.pdf)|[code](https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose)|  ImageNet,  FaceWarehouse,  LRW|

|Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks| |[paper](https://arxiv.org/pdf/2002.08700.pdf)| | |

|SPEECH-DRIVEN FACIAL ANIMATION USING POLYNOMIAL FUSION OF FEATURES| |[paper](https://arxiv.org/pdf/1912.05833.pdf)| |LRW|

|Animating Face using Disentangled Audio Representations|WACV|[paper](https://arxiv.org/pdf/1910.00726.pdf)| | |

|Everybody’s Talkin’: Let Me Talk as You Want| |[paper](https://arxiv.org/pdf/2001.05201.pdf)| | |

|Multimodal Inputs Driven Talking Face Generation With Spatial-Temporal Dependency| |[paper](https://www.researchgate.net/profile/Jun_Yu42/publication/339224051_Multimodal_Inputs_Driven_Talking_Face_Generation_With_Spatial-Temporal_Dependency/links/5eae2c6a92851cb2676fa016/Multimodal-Inputs-Driven-Talking-Face-Generation-With-Spatial-Temporal-Dependency.pdf)| | |

|Speech Driven Talking Face Generation from a Single Image and an Emotion Condition| |[paper](https://arxiv.org/pdf/2008.03592v1.pdf)| | |

## 2019

| title | | paper | code | dataset |

| --- | ---| --- | --- | --- |

|Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss|CVPR|[paper](https://arxiv.org/pdf/1905.03820.pdf)|[code](https://github.com/lelechen63/ATVGnet)|VGG Face, LRW|

## datasets

- MEAD [link](https://wywu.github.io/projects/MEAD/MEAD.html)

- HDTF [link](https://github.com/MRzzm/HDTF)

- CREMA-D [link](https://github.com/CheyneyComputerScience/CREMA-D)

- VoxCeleb [link](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

- LRS2 [link](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)

- LRW [link](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)

- GRID [link](http://spandh.dcs.shef.ac.uk/avlombard/)

- SAVEE [link](http://kahlan.eps.surrey.ac.uk/savee/Download.html)

- BIWI(3D) [link](https://data.vision.ee.ethz.ch/cvl/datasets/b3dac2.en.html)

- VOCA [link](https://voca.is.tue.mpg.de/)

- Multiface(3D) [link](https://github.com/facebookresearch/multiface)

  

## metrics

- PSNR (peak signal-to-noise ratio) 

- SSIM (structural similarity index measure)

- LMD (landmark distance error)

- LRA (lip-reading accuracy) [-](https://arxiv.org/pdf/1804.04786.pdf)

- FID (Fréchet inception distance) 

- LSE-D (Lip Sync Error - Distance)

- LSE-C (Lip Sync Error - Confidence) 

- LPIPS (Learned Perceptual Image Patch Similarity) [-](https://arxiv.org/pdf/1801.03924.pdf)

- NIQE (Natural Image Quality Evaluator) [-](http://live.ece.utexas.edu/research/Quality/niqe_spl.pdf)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/YunjinPark/awesome_talking_face_generation

Awesome Lists containing this project

README