https://github.com/harlanhong/actalker

ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).
https://github.com/harlanhong/actalker

avatar diffusion-models digitalhuman face-animation multi-modal stablevideodiffusion talking-head

Last synced: 4 months ago
JSON representation

ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).

Host: GitHub
URL: https://github.com/harlanhong/actalker
Owner: harlanhong
Created: 2025-03-20T02:08:54.000Z (7 months ago)
Default Branch: master
Last Pushed: 2025-04-19T10:51:40.000Z (6 months ago)
Last Synced: 2025-06-02T13:58:19.475Z (5 months ago)
Topics: avatar, diffusion-models, digitalhuman, face-animation, multi-modal, stablevideodiffusion, talking-head
Homepage: https://harlanhong.github.io/publications/actalker/index.html
Size: 41.8 MB
Stars: 277
Watchers: 43
Forks: 17
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## :book: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

> [[Paper](https://arxiv.org/abs/2504.02542)] [[Project Page](https://harlanhong.github.io/publications/actalker/index.html)] [[HuggingFace](https://huggingface.co/papers/2504.02542)]

> [Fa-Ting Hong](https://harlanhong.github.io)^1,2, Zunnan Xu^2,3, Zixiang Zhou², Jun Zhou², Xiu Li³, Qin Lin², Qinglin Lu², [Dan Xu](https://www.danxurgb.net)¹

> ¹The Hong Kong University of Science and Technology

> ²Tencent

> ³Tsinghua University

:triangular_flag_on_post: **Updates**

☑ arXiv paper is released [here](https://arxiv.org/abs/2504.02542) !

## Framework

## TL;DR:
We propose ACTalker, an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, pose, expression). ACTalker uses a parallel mamba-based architecture with a gating mechanism to assign different control signals to specific facial regions, ensuring fine-grained and conflict-free generation. A mask-drop strategy further enhances regional independence and control stability. Experiments show that ACTalker produces natural, synchronized talking head videos under various control combinations.

## Expression Driven Samples
https://github.com/user-attachments/assets/fc46c7cd-d1b4-44a6-8649-2ef973107637

## Audio Dirven Samples
https://github.com/user-attachments/assets/8f9e18a0-6fff-4a31-bbf4-c21702d4da38

## Audio-Visual Driven Samples
https://github.com/user-attachments/assets/3d8af4ef-edc7-4971-87b6-7a9c77ee0cb2

https://github.com/user-attachments/assets/2d12defd-de3d-4a33-8178-b5af30d7f0c2

### :e-mail: Contact

If you have any question or collaboration need (research purpose or commercial purpose), please email `fhongac@connect.ust.hk`.

# 📍Citation
Please feel free to leave a star⭐️⭐️⭐️ and cite our paper:
```bibtex
@article{hong2025audio,
title={Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation},
author={Hong, Fa-Ting and Xu, Zunnan and Zhou, Zixiang and Zhou, Jun and Li, Xiu and Lin, Qin and Lu, Qinglin and Xu, Dan},
journal={arXiv preprint arXiv:2504.02542},
year={2025}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harlanhong/actalker

Awesome Lists containing this project

README