Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/YunjinPark/awesome_talking_face_generation


https://github.com/YunjinPark/awesome_talking_face_generation

List: awesome_talking_face_generation

Last synced: 8 days ago
JSON representation

Awesome Lists containing this project

README

        

# Awesome talking face generation

# papers & codes
## 2023

| title | | paper | code | dataset |keywords|
| --- | ---| --- | --- | --- | --- |
|CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior|CVPR(23)|[paper](https://arxiv.org/abs/2301.02379)|[code](https://github.com/Doubiiu/CodeTalker)|BIWI, VOCA||
|DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation|CVPR(23)|[paper](https://arxiv.org/abs/2301.03786)||HDTF|Diffusion|
|AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction|CVPR(23)|[paper](https://arxiv.org/abs/2304.13115)||Multiface|3D|
|Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert|CVPR(23)|[paper](https://arxiv.org/abs/2303.17480)|[code](https://github.com/Sxjdwang/TalkLip)|LRS2||
|LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook|CVPR(23)|[paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_LipFormer_High-Fidelity_and_Generalizable_Talking_Face_Generation_With_a_Pre-Learned_CVPR_2023_paper.pdf)||LRS2, FFHQ||
|Parametric Implicit Face Representation for Audio-Driven Facial Reenactment|CVPR(23)|[paper](https://arxiv.org/abs/2306.07579)| |HDTF||
|Identity-Preserving Talking Face Generation with Landmark and Appearance Priors|CVPR(23)|[paper](https://arxiv.org/abs/2305.08293)|[code](https://github.com/Weizhi-Zhong/IP_LAP)|LRS2, LRS3||
|High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning|CVPR(23)|[paper](https://arxiv.org/abs/2305.02572)| |MEAD|emotion|
|Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks|InterSpeech(23)|[paper](https://arxiv.org/abs/2306.03594)| | MEAD | emotion |
|EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation|ICCV(23)|[paper](https://arxiv.org/abs/2303.11089)|[code(not yet)](https://github.com/ZiqiaoPeng/EmoTalk)||emotion|
|Emotionally Enhanced Talking Face Generation||[paper](https://arxiv.org/pdf/2303.11548.pdf)|[code](https://github.com/sahilg06/EmoGen)|CREMA-D|emotion|
|DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video|AAAI(23)|[paper](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)|[code](https://github.com/MRzzm/DINet)|||
|CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior||[paper](https://arxiv.org/pdf/2301.02379.pdf)|[code](https://github.com/Doubiiu/CodeTalker)||3D|
|GENEFACE: GENERALIZED AND HIGH-FIDELITY AUDIO-DRIVEN 3D TALKING FACE SYNTHESIS|ICLR (23)|[paper](https://arxiv.org/pdf/2301.13430.pdf)|[code](https://github.com/yerfor/GeneFace)||NeRF|
|OPT: ONE-SHOT POSE-CONTROLLABLE TALKING HEAD GENERATION||[paper](https://arxiv.org/pdf/2302.08197.pdf)||||
|LipNeRF: What is the right feature space to lip-sync a NeRF?||[paper](https://assets.amazon.science/00/58/6b3a5d7e417bae273191ed9ea1b2/lipnerf-what-is-the-right-feature-space-to-lip-sync-a-nerf.pdf)|||NeRF|
|Audio-Visual Face Reenactment | WACV (23)|[paper](https://arxiv.org/pdf/2210.02755.pdf)| [code](https://github.com/mdv3101/AVFR-Gan)| | |
|Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization|WACV (23)|[paper](https://openaccess.thecvf.com/content/WACV2023/papers/Gupta_Towards_Generating_Ultra-High_Resolution_Talking-Face_Videos_With_Lip_Synchronization_WACV_2023_paper.pdf)| ||
|StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles|AAAI(23)|[paper](https://arxiv.org/pdf/2301.01081.pdf)|[code](https://github.com/FuxiVirtualHuman/styletalk)|||
|DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis||[paper](https://arxiv.org/pdf/2301.03786.pdf)|[proj](https://sstzal.github.io/DiffTalk/)||Diffusion|
|Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation||[paper](https://mstypulkowski.github.io/diffusedheads/diffused_heads.pdf)|[proj](https://mstypulkowski.github.io/diffusedheads/)||Diffusion|
|Speech Driven Video Editing via an Audio-Conditioned Diffusion Model||[paper](https://arxiv.org/pdf/2301.04474.pdf)|[code](https://github.com/DanBigioi/DiffusionVideoEditing)||Diffusion|
|TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles||[paper](https://arxiv.org/pdf/2304.00334.pdf)||Text-Annotated MEAD|Text|

## 2022

| title | | paper | code | dataset |keywords|
| --- | ---| --- | --- | --- | --- |
|Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors||[paper](https://arxiv.org/pdf/2212.04248v1.pdf)|[proj](https://zxyin.github.io/TH-PAD/)|||
|SPACE: Speech-driven Portrait Animation with Controllable Expression|ICCV(23)|[paper](https://arxiv.org/abs/2211.09809)|||Pose, Emotion|
|SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation|CVPR(23)|[paper](https://arxiv.org/pdf/2211.12194v1.pdf)|[code](https://github.com/OpenTalker/SadTalker)|||
|Compressing Video Calls using Synthetic Talking Heads | BMVC (22)|[paper](https://arxiv.org/pdf/2210.03692.pdf)| | |application |
|EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model | SIGGRAPH (22)|[paper](https://arxiv.org/pdf/2205.15278.pdf)|||emotion|
|Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis|ECCV(22)|[paper](https://arxiv.org/abs/2207.11770)|[code](https://github.com/sstzal/DFRF)|||
| Expressive Talking Head Generation with Granular Audio-Visual Control |CVPR(22)| [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expressive_Talking_Head_Generation_With_Granular_Audio-Visual_Control_CVPR_2022_paper.pdf)| | | |
|Talking Face Generation With Multilingual TTS| CVPR(22)| [paper](https://arxiv.org/abs/2205.06421) | [code](https://huggingface.co/spaces/CVPR/ml-talking-face) | | - |
|Deep Learning for Visual Speech Analysis: A Survey| |[paper](https://arxiv.org/abs/2205.10839)| | |survey|
| StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN | | [paper](https://arxiv.org/abs/2203.04036) | [code](https://github.com/FeiiYin/StyleHEAT) | |stylegan|
|Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation|ECCV(22)|[paper](https://arxiv.org/pdf/2201.07786.pdf)|[code(coming soon)](https://github.com/alvinliu0/SSP-NeRF)||NeRF|
|Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation| |[paper](https://www.aaai.org/AAAI22Papers/AAAI-6163.YangC.pdf) | | | |
|SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory|AAAI(22)| [paper(temp)](https://www.aaai.org/AAAI22Papers/AAAI-7528.ParkS.pdf) | | LRW, LRS2, BBC News | |
|DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering||[paper](https://arxiv.org/abs/2201.00791)|||NeRF|
|Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos||[paper](https://arxiv.org/pdf/2206.04523.pdf)|||
|Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions||[paper](https://arxiv.org/pdf/2204.06180.pdf)|||
|DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation||[paper](https://arxiv.org/pdf/2203.07931.pdf)|||
|Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion||[paper](https://arxiv.org/pdf/2204.12756.pdf)|||
|StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation| | [paper](https://arxiv.org/abs/2208.10922) | | - | |
|AUTOLV: AUTOMATIC LECTURE VIDEO GENERATOR ||[paper](https://arxiv.org/pdf/2209.08795v1.pdf)||||
|Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement||[paper](https://arxiv.org/pdf/2209.01320v1.pdf)||||

## 2021
| title | | paper | code | dataset |
| --- | ---| --- | --- | --- |
|Depth-Aware Generative Adversarial Network for Talking Head Video Generation||[paper](https://arxiv.org/pdf/2203.06605v2.pdf)|[code](https://github.com/sstzal/DFRF)|||
|||[paper]()|[code]()|||
|Parallel and High-Fidelity Text-to-Lip Generation ||[paper](https://arxiv.org/pdf/2107.06831v2.pdf)|| |
|[Survey]Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis|- |[paper](https://arxiv.org/abs/2109.02081)| | |
|FaceFormer: Speech-Driven 3D Facial Animation with Transformers|CVPR(22)|[paper](https://arxiv.org/abs/2112.05329)|[code](https://github.com/EvelynFan/FaceFormer)| |
| Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices | | [paper](https://arxiv.org/pdf/2104.10299v1.pdf) | [code](https://github.com/choyingw/Voice2Mesh) | |
|FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning|ICCV|[paper](https://arxiv.org/pdf/2108.07938v1.pdf)|[code](https://github.com/zhangchenxu528/FACIAL)| |
|Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis| |[paper](https://arxiv.org/pdf/2111.00203v1.pdf)|[code](https://github.com/wuhaozhe/style_avatar) | |
|Audio-Driven Emotional Video Portraits|CVPR|[paper](https://arxiv.org/pdf/2104.07452.pdf)|[code](https://github.com/jixinya/EVP/)|MEAD, LRW|
|LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Lahiri_LipSync3D_Data-Efficient_Learning_of_Personalized_3D_Talking_Faces_From_Video_CVPR_2021_paper.pdf)| | |
|Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhou_Pose-Controllable_Talking_Face_Generation_by_Implicitly_Modularized_Audio-Visual_Representation_CVPR_2021_paper.pdf)|[code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)|VoxCeleb2, LRW|
|Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset|CVPR|[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_Flow-Guided_One-Shot_Talking_Face_Generation_With_a_High-Resolution_Audio-Visual_Dataset_CVPR_2021_paper.pdf)|[code](https://github.com/MRzzm/HDTF)|HDTF|
|MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement|ICCV| [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Richard_MeshTalk_3D_Face_Animation_From_Speech_Using_Cross-Modality_Disentanglement_ICCV_2021_paper.pdf) | [code(coming soon)](https://github.com/facebookresearch/meshtalk) | |
|AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis|ICCV| [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Guo_AD-NeRF_Audio_Driven_Neural_Radiance_Fields_for_Talking_Head_Synthesis_ICCV_2021_paper.pdf) | [code](https://github.com/YudongGuo/AD-NeRF) | |
|Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation |AAAI |[paper](https://arxiv.org/pdf/2104.07995.pdf) |[code(coming soon)](https://github.com/FuxiVirtualHuman/Write-a-Speaker)|Mocap dataset |
|Visual Speech Enhancement Without A Real Visual Stream | | [paper](https://openaccess.thecvf.com/content/WACV2021/papers/Hegde_Visual_Speech_Enhancement_Without_a_Real_Visual_Stream_WACV_2021_paper.pdf)| | |
|Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary| |[paper](https://arxiv.org/pdf/2104.14631v1.pdf)|[code](https://github.com/sibozhang/Text2Video)| |
|Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion|IJCAI|[paper](https://arxiv.org/pdf/2107.09293.pdf)|[code](https://github.com/wangsuzhen/Audio2Head) | VoxCeleb, GRID, LRW |
|3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head| |[paper](https://arxiv.org/pdf/2104.12051.pdf)| | |
|AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person| |[paper](https://arxiv.org/pdf/2108.04325v2.pdf)| | VoxCeleb2, Obama|

## 2020
| title | | paper | code | dataset |
| --- | ---| --- | --- | --- |
|[Survey]What comprises a good talking-head video generation?: A survey and benchmark| |[paper](https://arxiv.org/pdf/2005.03201.pdf)|[code](https://github.com/lelechen63/talking-head-generation-survey)| |
|One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing|CVPR(21)|[paper](https://arxiv.org/abs/2011.15126)|[code](https://github.com/NVlabs/imaginaire)| |
|Speech Driven Talking Face Generation from a Single Image and an Emotion Condition| |[paper](https://arxiv.org/pdf/2008.03592.pdf)|[code](https://github.com/eeskimez/emotalkingface)|CREMA-D|
|A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild | ACMMM|[paper](https://arxiv.org/pdf/2008.10010.pdf) |[code](https://github.com/Rudrabha/Wav2Lip) | LRS2 |
|Talking-head Generation with Rhythmic Head Motion |ECCV |[paper](https://arxiv.org/pdf/2007.08547.pdf) | [code](https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion)| Crema, Grid, Voxceleb, Lrs3 |
|MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation|ECCV | [paper](https://wywu.github.io/projects/MEAD/support/MEAD.pdf)| [code](https://github.com/uniBruce/Mead)| VoxCeleb2, AffectNet |
|Neural voice puppetry:Audio-driven facial reenactment|ECCV|[paper](https://arxiv.org/pdf/1912.05566.pdf)| | |
|Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars| ECCV|[paper](https://arxiv.org/pdf/2008.10174v1.pdf) |[code](https://github.com/saic-violet/bilayer-model)| |
|HeadGAN:Video-and-Audio-Driven Talking Head Synthesis| |[paper](https://arxiv.org/pdf/2012.08261v1.pdf)| |VoxCeleb2|
|MakeItTalk: Speaker-Aware Talking Head Animation| |[paper](https://arxiv.org/pdf/2004.12992.pdf)|[code](https://github.com/adobe-research/MakeItTalk), [code](https://github.com/yzhou359/MakeItTalk)| VoxCeleb2, VCTK |[paper](https://arxiv.org/pdf/2008.10174v1.pdf)|[code](https://github.com/saic-violet/bilayer-model)|VoxCeleb2|
|Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose |- |[paper](https://arxiv.org/pdf/2002.10137.pdf)|[code](https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose)| ImageNet, FaceWarehouse, LRW|
|Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks| |[paper](https://arxiv.org/pdf/2002.08700.pdf)| | |
|SPEECH-DRIVEN FACIAL ANIMATION USING POLYNOMIAL FUSION OF FEATURES| |[paper](https://arxiv.org/pdf/1912.05833.pdf)| |LRW|
|Animating Face using Disentangled Audio Representations|WACV|[paper](https://arxiv.org/pdf/1910.00726.pdf)| | |
|Everybody’s Talkin’: Let Me Talk as You Want| |[paper](https://arxiv.org/pdf/2001.05201.pdf)| | |
|Multimodal Inputs Driven Talking Face Generation With Spatial-Temporal Dependency| |[paper](https://www.researchgate.net/profile/Jun_Yu42/publication/339224051_Multimodal_Inputs_Driven_Talking_Face_Generation_With_Spatial-Temporal_Dependency/links/5eae2c6a92851cb2676fa016/Multimodal-Inputs-Driven-Talking-Face-Generation-With-Spatial-Temporal-Dependency.pdf)| | |
|Speech Driven Talking Face Generation from a Single Image and an Emotion Condition| |[paper](https://arxiv.org/pdf/2008.03592v1.pdf)| | |

## 2019
| title | | paper | code | dataset |
| --- | ---| --- | --- | --- |
|Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss|CVPR|[paper](https://arxiv.org/pdf/1905.03820.pdf)|[code](https://github.com/lelechen63/ATVGnet)|VGG Face, LRW|

## datasets
- MEAD [link](https://wywu.github.io/projects/MEAD/MEAD.html)
- HDTF [link](https://github.com/MRzzm/HDTF)
- CREMA-D [link](https://github.com/CheyneyComputerScience/CREMA-D)
- VoxCeleb [link](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)
- LRS2 [link](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)
- LRW [link](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)
- GRID [link](http://spandh.dcs.shef.ac.uk/avlombard/)
- SAVEE [link](http://kahlan.eps.surrey.ac.uk/savee/Download.html)
- BIWI(3D) [link](https://data.vision.ee.ethz.ch/cvl/datasets/b3dac2.en.html)
- VOCA [link](https://voca.is.tue.mpg.de/)
- Multiface(3D) [link](https://github.com/facebookresearch/multiface)

## metrics
- PSNR (peak signal-to-noise ratio)
- SSIM (structural similarity index measure)
- LMD (landmark distance error)
- LRA (lip-reading accuracy) [-](https://arxiv.org/pdf/1804.04786.pdf)
- FID (Fréchet inception distance)
- LSE-D (Lip Sync Error - Distance)
- LSE-C (Lip Sync Error - Confidence)
- LPIPS (Learned Perceptual Image Patch Similarity) [-](https://arxiv.org/pdf/1801.03924.pdf)
- NIQE (Natural Image Quality Evaluator) [-](http://live.ece.utexas.edu/research/Quality/niqe_spl.pdf)