Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Jason-cs18/awesome-avatar

📖 A curated list of resources dedicated to avatar.
https://github.com/Jason-cs18/awesome-avatar

List: awesome-avatar

avatar awesome-list co-speech-gesture deep-generative-models digital-human pose2img talking-head

Last synced: 3 months ago
JSON representation

📖 A curated list of resources dedicated to avatar.

Awesome Lists containing this project

README

        

# awesome-avatar
This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body).

#### 🔆 This project is still on-going, pull requests are welcomed!!
If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request.

#### News
- **2024.09.07**: add ASR and TTS tool
- **2024.08.24**: add backgrounds for image/video generations
- **2024.08.24**: re-organize paper list with table formating
- **2024.08.24**: add works about full-body avatar synthesis

#### TO DO LIST

- [x] Main paper list
- [x] Researchers list
- [x] Toolbox for avatar
- [x] Add paper link
- [ ] Add [paper notes](https://github.com/Jason-cs18/awesome-avatar/tree/main/notes)
- [x] Add codes if have
- [x] Add project page if have
- [x] Datasets and metrics
- [x] Related links

## Researchers and labs
1. [NVIDIA Research](https://www.nvidia.com/en-us/research/)
- Neural rendering models for human generation: [vid2vid NeurIPS'18](https://tcwang0509.github.io/vid2vid/), [fs-vid2vid NeurIPS'19](https://nvlabs.github.io/few-shot-vid2vid/), [EG3D CVPR'22](https://github.com/NVlabs/eg3d);
- Talking-face synthesis: [face-vid2vid CVPR'21](https://nvlabs.github.io/face-vid2vid/), [Implicit NeurIPS'22](https://research.nvidia.com/labs/dir/implicit_warping/), [SPACE ICCV'23](https://research.nvidia.com/labs/dir/space/), [One-shot
Neural Head Avatar arXiv'23](https://research.nvidia.com/labs/lpr/one-shot-avatar/);
- Talking-body synthesis: [DreamPose ICCV'23](https://grail.cs.washington.edu/projects/dreampose/);
- Face enhancement (relighting, restoration, etc): [Lumos SIGGRAPH Asia 2022](https://research.nvidia.com/labs/dir/lumos/), [RANA ICCV'23](https://nvlabs.github.io/RANA/);
- Authorized use of synthetic videos: [Avatar Fingerprinting arXiv'23](https://research.nvidia.com/labs/nxp/avatar-fingerprinting/);
2. [Aliaksandr Siarohin @ Snap Research](https://research.snap.com/team/team-member.html#aliaksandr-siarohin)
- Neural rendering models for human generation (focus on flow-based generative models): [Unsupervised-Volumetric-Animation CVPR'23](https://github.com/snap-research/unsupervised-volumetric-animation), [3DAvatarGAN CVPR'23](https://arxiv.org/abs/2301.02700), [3D-SGAN ECCV'22](https://arxiv.org/abs/2112.01422), [Articulated-Animation CVPR'21](https://arxiv.org/abs/2104.11280), [Monkey-Net CVPR'19](https://arxiv.org/abs/1812.08861), [FOMM NeurIPS'19](http://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation);
3. [Ziwei Liu @ Nanyang Technological University](https://liuziwei7.github.io/index.html)
- Talking-face synthesis: [StyleSync CVPR'23](https://hangz-nju-cuhk.github.io/projects/StyleSync), [AV-CAT SIGGRAPH Asia 2022](https://hangz-nju-cuhk.github.io/projects/AV-CAT), [StyleGANX ICCV'23](https://www.mmlab-ntu.com/project/styleganex/), [StyleSwap ECCV'22](https://hangz-nju-cuhk.github.io/projects/StyleSwap), [PC-AVS CVPR'21](https://hangz-nju-cuhk.github.io/projects/PC-AVS), [Speech2Talking-Face IJCAI'21](https://www.ijcai.org/proceedings/2021/0141.pdf), [VToonify SIGGRAPH Asia 2022](https://www.youtube.com/watch?v=0_OmVhDgYuY);
- Talking-body synthesis: [MotionDiffuse arXiv'22](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html);
- Face enhancement (relighting, restoration, etc): [Relighting4D ECCV'22](https://www.youtube.com/watch?v=NayAw89qtsY);
4. [Xiaodong Cun @ Tencent AI Lab](https://vinthony.github.io/academic/):
- Talking-face synthesis: [StyleHEAT ECCV'22](https://arxiv.org/abs/2203.04036), [VideoReTalking SIGGRAPH Asia'22](https://arxiv.org/abs/2211.14758), [ToolTalking ICCV'23](https://arxiv.org/abs/2308.12866), [DPE CVPR'23](https://arxiv.org/abs/2301.06281), [CodeTalker CVPR'23](https://arxiv.org/abs/2301.06281), [SadTalker CVPR'23](https://arxiv.org/abs/2211.12194);
- Talking-body synthesis: [LivelySpeaker ICCV'23](https://arxiv.org/abs/2306.00926);

5. Max Planck Institute for Informatics:
- 3D face models (*e.g.,* 3DMM): [FLAME SIGGRAPH Asia 2017](https://flame.is.tue.mpg.de/);

## Papers

### Image and video generation
|Model|Paper|Blog|Codebase|Note|
|:---:|:---:|:---:|:---:|:---:|
|StyleGANv3|[Alias-Free Generative Adversarial Networks](https://nvlabs.github.io/stylegan3/), NVIDIA, NeurIPS 2021|[The Evolution of StyleGAN: Introduction](https://blog.paperspace.com/evolution-of-stylegan/)|[Code](https://github.com/NVlabs/stylegan3)|high fidlity face generation|
|Stable Diffusion|[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/pdf/2112.10752), Heidelberg University, CVPR 2022|[What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)|[Code](https://github.com/CompVis/latent-diffusion)|diverse and high quality images|
|Stable Video Diffusion|[Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://arxiv.org/abs/2311.15127), Stability AI, arXiv 2023|[Diffusion Models for Video Generation](https://lilianweng.github.io/posts/2024-04-12-diffusion-video/)|[Code](https://github.com/Stability-AI/generative-models)||
|DiT|[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748), Meta, ICCV 2023|[Diffusion Transformed](https://www.deeplearning.ai/the-batch/a-new-class-of-diffusion-models-based-on-the-transformer-architecture/)|[Code](https://github.com/facebookresearch/DiT)|magic behind OpenAI Sora|
|VQ-VAE|[Neural Discrete Representation Learning](https://arxiv.org/pdf/1711.00937), DeepMind, NIPS 2017|[OpenAI's DALL-E 2 and DALL-E 1 Explained](https://vaclavkosar.com/ml/openai-dall-e-2-and-dall-e-1)||magic behinds OpenAI DALL-E|
|NeRF|[NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/abs/2003.08934), UC Berkeley, ECCV 2020|[NeRF Explosion 2020](https://dellaert.github.io/NeRF/)|[Code](https://github.com/yenchenlin/nerf-pytorch)|3D synthesis via volume rendering|
|3DGS|[3D Gaussian Splatting for Real-Time Radiance Field Rendering](https://arxiv.org/abs/2308.04079), Inria, SIGGRAPH 2023|[A Comprehensive Overview of Gaussian Splatting](https://towardsdatascience.com/a-comprehensive-overview-of-gaussian-splatting-e7d570081362)|[Code](https://github.com/graphdeco-inria/gaussian-splatting)|real-time 3d rendering|

### 3D Avatar (face+body)
|Conference|Paper|Affiliation|Codebase|Notes|
|:---:|:---:|:---:|:---:|:---:|
|CVPR 2021|[Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors](https://www.liuyebin.com/Function4D/Function4D.html)|Tsinghua University|[Dataset](https://github.com/ytrock/THuman2.0-Dataset)||
|ECCV 2022|[HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling](https://caizhongang.com/projects/HuMMan/)|Shanghai Artificial Intelligence Laboratory|[Dataset](https://caizhongang.com/projects/HuMMan/)||
|SIGGRAPH 2023|[AvatarReX: Real-time Expressive Full-body Avatars](https://liuyebin.com/AvatarRex/)|Tsinghua University|[Dataset](https://github.com/lizhe00/AnimatableGaussians/blob/master/AVATARREX_DATASET.md)||
|arXiv 2024|[A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation](https://arxiv.org/pdf/2406.04253 )|The University of Hong Kong |||
|arXiv 2024|[From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations](https://people.eecs.berkeley.edu/~evonne_ng/projects/audio2photoreal/static/CCA.pdf)|Meta Reality Labs Research|[Code](https://github.com/facebookresearch/audio2photoreal/) ![Github stars](https://img.shields.io/github/stars/facebookresearch/audio2photoreal.svg) ![Github forks](https://img.shields.io/github/forks/facebookresearch/audio2photoreal.svg)|conversational avatar|
|CVPR 2024|[Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file)|Tsinghua Univserity|[Code](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file) ![Github stars](https://img.shields.io/github/stars/lizhe00/AnimatableGaussians.svg) ![Github forks](https://img.shields.io/github/forks/lizhe00/AnimatableGaussians.svg)||
|CVPR 2024|[4K4D: Real-Time 4D View Synthesis at 4K Resolution](https://drive.google.com/file/d/1Y-C6ASIB8ofvcZkyZ_Vp-a2TtbiPw1Yx/view?usp=sharing)|Zhejiang University|[Code](https://github.com/zju3dv/4K4D) ![Github stars](https://img.shields.io/github/stars/zju3dv/4K4D.svg) ![Github forks](https://img.shields.io/github/forks/zju3dv/4K4D.svg)|real-time synthesis with 3DGS|

### 2D talking-face synthesis

|Conference|Paper|Affiliation|Codebase|Training Code|Notes|
|:---:|:---:|:---:|:---:|:---:|:---|
|MM 2020|[Wav2Lip: Accurately Lip-sync Videos to Any Speech](https://arxiv.org/abs/2008.10010)|The International Institute of Islamic Thought (IIIT), India|[Code](https://github.com/Rudrabha/Wav2Lip) ![Github stars](https://img.shields.io/github/stars/Rudrabha/Wav2Lip.svg) ![Github forks](https://img.shields.io/github/forks/Rudrabha/Wav2Lip.svg)|✅|most accurate lip-sync model, bad video quality `96*96`|
|MM 2021|[Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://hcsi.cs.tsinghua.edu.cn/Paper/Paper21/MM21-WUHAOZHE.pdf)|Tsinghua University|[Code](https://github.com/wuhaozhe/style_avatar), ![Github stars](https://img.shields.io/github/stars/wuhaozhe/style_avatar.svg) ![Github forks](https://img.shields.io/github/forks/wuhaozhe/style_avatar.svg)|||
|CVPR 2021|[Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/abs/2104.11116)|The Chinese University of Hong Kong|[Code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS) ![Github stars](https://img.shields.io/github/stars/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg) ![Github forks](https://img.shields.io/github/forks/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg)||contrastive learning on audio-lip|
|ICCV 2021|[PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering](https://arxiv.org/abs/2109.08379)|Peking University|[Code](https://github.com/RenYurui/PIRender) ![Github stars](https://img.shields.io/github/stars/RenYurui/PIRender.svg) ![Github forks](https://img.shields.io/github/forks/RenYurui/PIRender.svg)|||
|ECCV 2022|[StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN](https://arxiv.org/pdf/2203.04036.pdf)|Tsinghua University|[Code](https://github.com/OpenTalker/StyleHEAT) ![Github stars](https://img.shields.io/github/stars/OpenTalker/StyleHEAT.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/StyleHEAT.svg)||High-fidenity synthesis via StyleGAN|
|SIGGRAPH Asia 2022|[VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild](https://github.com/OpenTalker/video-retalking)|Xidian University|[Code](https://github.com/OpenTalker/video-retalking) ![Github stars](https://img.shields.io/github/stars/OpenTalker/video-retalking.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/video-retalking.svg)|||
|AAAI 2023|[DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)|Virtual Human Group, Netease Fuxi AI Lab|[Code](https://github.com/MRzzm/DINet)![Github stars](https://img.shields.io/github/stars/MRzzm/DINet.svg) ![Github forks](https://img.shields.io/github/forks/MRzzm/DINet.svg)|✅|accurate lip-sync and high-quality synthesis (`256*256`)|
|CVPR 2023|[SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation](https://arxiv.org/pdf/2211.12194.pdf)|Xi'an Jiaotong University|[Code](https://github.com/Winfredy/SadTalker) ![Github stars](https://img.shields.io/github/stars/OpenTalker/SadTalker.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/SadTalker.svg), [Note](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md)|||
|arXiv 2023|[DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models](https://arxiv.org/abs/2312.09767)|Tsinghua University|[Code](https://github.com/ali-vilab/dreamtalk), ![Github stars](https://img.shields.io/github/stars/ali-vilab/dreamtalk.svg) ![Github forks](https://img.shields.io/github/forks/ali-vilab/dreamtalk.svg)||diffusion|
|||Tencent TMElyralab|[MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting](https://github.com/TMElyralab/MuseTalk) ![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseTalk.svg) ![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseTalk.svg) |||
|arXiv 2024|[LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/abs/2407.03168)|Kuaishou Technology|[Code](https://github.com/KwaiVGI/LivePortrait) ![Github stars](https://img.shields.io/github/stars/KwaiVGI/LivePortrait.svg) ![Github forks](https://img.shields.io/github/forks/KwaiVGI/LivePortrait.svg) ||face reenactment with micro-expression|
|arXiv 2024|[EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136)|Ant Group|[Code](https://github.com/BadToBest/EchoMimic) ![Github stars](https://img.shields.io/github/stars/BadToBest/EchoMimic.svg) ![Github forks](https://img.shields.io/github/forks/BadToBest/EchoMimic.svg)||accurate lip-sync on Chinese speakers, diffusion, `512*512`|
|arXiv 2024|[Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/abs/2407.08136)|Fudan University|[Code](https://github.com/fudan-generative-vision/hallo), ![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/hallo.svg) ![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/hallo.svg)|✅|accurate lip-sync, diffusion, `512*512`|
|[arXiv 2024]|[Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://loopyavatar.github.io/)|Zhejiang University and ByteDance||||

### 3D talking-face synthesis
|Conference|Paper|Affiliation|Codebase|Notes|
|:---:|:---:|:---:|:---:|:---:|
|ICCV 2021|[AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis](https://arxiv.org/pdf/2103.11078)|University of Science and Technology of China|[Code](https://github.com/YudongGuo/AD-NeRF)![Github stars](https://img.shields.io/github/stars/YudongGuo/AD-NeRF.svg)![Github forks](https://img.shields.io/github/forks/YudongGuo/AD-NeRF.svg)||
|ECCV 2022|[Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis](https://github.com/sstzal/DFRF/blob/show_page/images/DFRF_eccv2022.pdf)|Tsinghua University|[Code](https://github.com/sstzal/DFRF)![Github stars](https://img.shields.io/github/stars/sstzal/DFRF.svg)![Github forks](https://img.shields.io/github/forks/sstzal/DFRF.svg)||
|ICLR 2023|[GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis](https://arxiv.org/pdf/2301.13430)|Zhejiang University|[Code](https://github.com/yerfor/GeneFace)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFace.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFace.svg)||
|ICCV 2023|[Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html)|Beihang University|[Code](https://github.com/Fictionarry/ER-NeRF)![Github stars](https://img.shields.io/github/stars/Fictionarry/ER-NeRF.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/ER-NeRF.svg)||
|arXiv 2023|[GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/pdf/2305.00787)|Zhejiang University|[Code](https://github.com/yerfor/GeneFacePlusPlus)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFacePlusPlus.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFacePlusPlus.svg)||
|CVPR 2024|[SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi](https://arxiv.org/pdf/2311.17590)|Renmin University of China|[Code](https://github.com/ziqiaopeng/SyncTalk)![Github stars](https://img.shields.io/github/stars/ziqiaopeng/SyncTalk.svg)![Github forks](https://img.shields.io/github/forks/ziqiaopeng/SyncTalk.svg)||
|ECCV 2024|[TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting](https://github.com/Fictionarry/TalkingGaussian)|Beihang University|[Code](https://github.com/Fictionarry/TalkingGaussian)![Github stars](https://img.shields.io/github/stars/Fictionarry/TalkingGaussian.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/TalkingGaussian.svg)||

### Talking-body synthesis

#### Pose2video

|Conference|Paper|Affiliation|Codebase|Notes|
|:---:|:---:|:---:|:---:|:---:|
|NeurIPS 2018|[Video-to-Video Synthesis](https://github.com/NVIDIA/vid2vid)|NVIDIA|[Code](https://github.com/NVIDIA/vid2vid) ![Github stars](https://img.shields.io/github/stars/NVIDIA/vid2vid.svg) ![Github forks](https://img.shields.io/github/forks/NVIDIA/vid2vid.svg)||
|ICCV 2019|[Everybody Dance Now](https://github.com/carolineec/EverybodyDanceNow)|UC Berkeley|[Code](https://github.com/carolineec/EverybodyDanceNow)![Github stars](https://img.shields.io/github/stars/carolineec/EverybodyDanceNow.svg)![Github forks](https://img.shields.io/github/forks/carolineec/EverybodyDanceNow.svg)||
|arXiv 2023|[Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation](https://arxiv.org/pdf/2311.17117.pdf)|Alibaba Group|[Code](https://github.com/HumanAIGC/AnimateAnyone)![Github stars](https://img.shields.io/github/stars/HumanAIGC/AnimateAnyone.svg)![Github forks](https://img.shields.io/github/forks/HumanAIGC/AnimateAnyone.svg)||
|CVPR 2024|[MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model](https://github.com/magic-research/magic-animate/blob/main/assets/preprint/MagicAnimate.pdf)|National University of Singapore|[Code](https://github.com/magic-research/magic-animate)![Github stars](https://img.shields.io/github/stars/magic-research/magic-animate.svg)![Github forks](https://img.shields.io/github/forks/magic-research/magic-animate.svg)||
|arXiv 2024|[Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance](https://arxiv.org/pdf/2403.14781)|Nanjing University|[Code](https://github.com/fudan-generative-vision/champ)![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/champ.svg)![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/champ.svg)||
|Github repo|[MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising](https://github.com/TMElyralab/MuseV)|Tencent TMElyralab|[Code](https://github.com/TMElyralab/MuseV)![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseV.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseV.svg)||
|Github repo|[MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation](https://github.com/TMElyralab/MusePose)|Tencent|[Code](https://github.com/TMElyralab/MusePose)![Github stars](https://img.shields.io/github/stars/TMElyralab/MusePose.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MusePose.svg) ⭐||
|arXiv 2024|[ControlNeXt: Powerful and Efficient Control for Image and Video Generation](https://pbihao.github.io/projects/controlnext/index.html)|The Chinese University of Hong Kong|[Code](https://github.com/dvlab-research/ControlNeXt)![Github stars](https://img.shields.io/github/stars/dvlab-research/ControlNeXt.svg)![Github forks](https://img.shields.io/github/forks/dvlab-research/ControlNeXt.svg)|stable video diffusion|
|[arXiv 2024]|[CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention](https://cyberhost.github.io/)|Zhejiang University and ByteDance|||

## Datasets

### Talking-face


Audio-Visual Datasets for Enlish Speakers


Dataset name
Environment
Year
Resolution
Subject
Duration
Sentence


VoxCeleb1
Wild
2017
360p~720p
1251
352 hours
100k


VoxCeleb2
Wild
2018
360p~720p
6112
2442 hours
1128k


HDTF
Wild
2020
720p~1080p
300+
15.8 hours



LSP
Wild
2021
720p~1080p
4
18 minutes
100k


Audio-Visual Datasets for Chinese Speakers


Dataset name
Environment
Year
Resolution
Subject
Duration
Sentence


CMLR
Lab
2019

11

102k


MAVD
Lab
2023
1920x1080
64
24 hours
12k


CN-Celeb
Wild
2020

3000
1200 hours



CN-Celeb-AV
Wild
2023

1136
660 hours



CN-CVS
Wild
2023

2500+
300+ hours


## Metrics

### Talking-face


Lip-Sync


Metric name
Description
Code/Paper


LMD↓
Mouth landmark distance



LMD↓
Mouth landmark distance



MA↑
The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area



Sync↑
The confidence score from SyncNet (Sync)
wav2lip


LSE-C↑
Lip Sync Error - Confidence
wav2lip


LSE-D↓
Lip Sync Error - Distance
wav2lip


Image Quality (identity preserving)


Metric name
Description
Code/Paper


MAE↓
Mean Absolute Error metric for image
mmagic


MSE↓
Mean Squared Error metric for image
mmagic


PSNR↑
Peak Signal-to-Noise Ratio
mmagic


SSIM↑
Structural similarity for image
mmagic


FID↓
Frchet Inception Distance
mmagic


IS↑
Inception score
mmagic


NIQE↓
Natural Image Quality Evaluator metric
mmagic


CSIM↑
The cosine similarity of identity embedding
InsightFace


CPBD↑
The cumulative probability blur detection
python-cpbd


Diversity


Metric name
Description
Code/Paper


Diversity of head motions↑
A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated
SadTalker


Beat Align Score↑
The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022)
SadTalker

## Toolbox
1. A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic
2. face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d
3. 3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch
4. OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace
5. autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop
6. OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose
7. GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN
8. CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer
9. metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream
10. EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap
11. 3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component

### Automatic Speech Recognition (ASR)
1. BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh
2. SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍

### Text to Speech (TTS)
1. CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍
2. FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS
3. GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file

### Speech to Speech (GPT4-o)
1. Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni
2. Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech

## Related Links
If you are interested in avatar and digital human, we would also like to recommend you to check out other related collections:
- awesome digital human https://github.com/weihaox/awesome-digital-human