An open API service indexing awesome lists of open source software.

https://github.com/harlanhong/actalker

ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).
https://github.com/harlanhong/actalker

avatar diffusion-models digitalhuman face-animation multi-modal stablevideodiffusion talking-head

Last synced: 4 months ago
JSON representation

ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).

Awesome Lists containing this project

README

          

## :book: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

> [[Paper](https://arxiv.org/abs/2504.02542)]   [[Project Page](https://harlanhong.github.io/publications/actalker/index.html)]   [[HuggingFace](https://huggingface.co/papers/2504.02542)]

> [Fa-Ting Hong](https://harlanhong.github.io)1,2, Zunnan Xu2,3, Zixiang Zhou2, Jun Zhou2, Xiu Li3, Qin Lin2, Qinglin Lu2, [Dan Xu](https://www.danxurgb.net)1

> 1The Hong Kong University of Science and Technology

> 2Tencent

> 3Tsinghua University


:triangular_flag_on_post: **Updates**

☑ arXiv paper is released [here](https://arxiv.org/abs/2504.02542) !

## Framework

## TL;DR:
We propose ACTalker, an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, pose, expression). ACTalker uses a parallel mamba-based architecture with a gating mechanism to assign different control signals to specific facial regions, ensuring fine-grained and conflict-free generation. A mask-drop strategy further enhances regional independence and control stability. Experiments show that ACTalker produces natural, synchronized talking head videos under various control combinations.

## Expression Driven Samples
https://github.com/user-attachments/assets/fc46c7cd-d1b4-44a6-8649-2ef973107637

## Audio Dirven Samples
https://github.com/user-attachments/assets/8f9e18a0-6fff-4a31-bbf4-c21702d4da38

## Audio-Visual Driven Samples
https://github.com/user-attachments/assets/3d8af4ef-edc7-4971-87b6-7a9c77ee0cb2

https://github.com/user-attachments/assets/2d12defd-de3d-4a33-8178-b5af30d7f0c2

### :e-mail: Contact

If you have any question or collaboration need (research purpose or commercial purpose), please email `fhongac@connect.ust.hk`.

# 📍Citation
Please feel free to leave a star⭐️⭐️⭐️ and cite our paper:
```bibtex
@article{hong2025audio,
title={Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation},
author={Hong, Fa-Ting and Xu, Zunnan and Zhou, Zixiang and Zhou, Jun and Li, Xiu and Lin, Qin and Lu, Qinglin and Xu, Dan},
journal={arXiv preprint arXiv:2504.02542},
year={2025}
}
```