https://github.com/harlanhong/actalker
ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).
https://github.com/harlanhong/actalker
avatar diffusion-models digitalhuman face-animation multi-modal stablevideodiffusion talking-head
Last synced: 4 months ago
JSON representation
ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).
- Host: GitHub
- URL: https://github.com/harlanhong/actalker
- Owner: harlanhong
- Created: 2025-03-20T02:08:54.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2025-04-19T10:51:40.000Z (6 months ago)
- Last Synced: 2025-06-02T13:58:19.475Z (5 months ago)
- Topics: avatar, diffusion-models, digitalhuman, face-animation, multi-modal, stablevideodiffusion, talking-head
- Homepage: https://harlanhong.github.io/publications/actalker/index.html
- Size: 41.8 MB
- Stars: 277
- Watchers: 43
- Forks: 17
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## :book: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
> [[Paper](https://arxiv.org/abs/2504.02542)] [[Project Page](https://harlanhong.github.io/publications/actalker/index.html)] [[HuggingFace](https://huggingface.co/papers/2504.02542)]
> [Fa-Ting Hong](https://harlanhong.github.io)1,2, Zunnan Xu2,3, Zixiang Zhou2, Jun Zhou2, Xiu Li3, Qin Lin2, Qinglin Lu2, [Dan Xu](https://www.danxurgb.net)1
> 1The Hong Kong University of Science and Technology
> 2Tencent
> 3Tsinghua University
:triangular_flag_on_post: **Updates**
☑ arXiv paper is released [here](https://arxiv.org/abs/2504.02542) !
## Framework

## TL;DR:
We propose ACTalker, an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, pose, expression). ACTalker uses a parallel mamba-based architecture with a gating mechanism to assign different control signals to specific facial regions, ensuring fine-grained and conflict-free generation. A mask-drop strategy further enhances regional independence and control stability. Experiments show that ACTalker produces natural, synchronized talking head videos under various control combinations.
## Expression Driven Samples
https://github.com/user-attachments/assets/fc46c7cd-d1b4-44a6-8649-2ef973107637
## Audio Dirven Samples
https://github.com/user-attachments/assets/8f9e18a0-6fff-4a31-bbf4-c21702d4da38
## Audio-Visual Driven Samples
https://github.com/user-attachments/assets/3d8af4ef-edc7-4971-87b6-7a9c77ee0cb2
https://github.com/user-attachments/assets/2d12defd-de3d-4a33-8178-b5af30d7f0c2
### :e-mail: Contact
If you have any question or collaboration need (research purpose or commercial purpose), please email `fhongac@connect.ust.hk`.
# 📍Citation
Please feel free to leave a star⭐️⭐️⭐️ and cite our paper:
```bibtex
@article{hong2025audio,
title={Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation},
author={Hong, Fa-Ting and Xu, Zunnan and Zhou, Zixiang and Zhou, Jun and Li, Xiu and Lin, Qin and Lu, Qinglin and Xu, Dan},
journal={arXiv preprint arXiv:2504.02542},
year={2025}
}
```