https://github.com/skyworkai/skyreels-a1
SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers
https://github.com/skyworkai/skyreels-a1
condition-render portrait-animation video-diffusion-transformers
Last synced: about 1 month ago
JSON representation
SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers
- Host: GitHub
- URL: https://github.com/skyworkai/skyreels-a1
- Owner: SkyworkAI
- License: other
- Created: 2025-02-13T02:37:51.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-04-23T15:09:10.000Z (about 2 months ago)
- Last Synced: 2025-04-23T16:25:08.900Z (about 2 months ago)
- Topics: condition-render, portrait-animation, video-diffusion-transformers
- Language: Python
- Homepage: https://www.skyreels.ai
- Size: 56.1 MB
- Stars: 488
- Watchers: 10
- Forks: 55
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
![]()
SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers
Skywork AI, Kunlun Inc.
![]()
🔥 For more results, visit our homepage 🔥
👋 Join our DiscordThis repo, named **SkyReels-A1**, contains the official PyTorch implementation of our paper [SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers](https://arxiv.org/abs/2502.10841).
## 🔥🔥🔥 News!!
* Apr 3, 2025: 🔥 We release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
* Mar 4, 2025: 🔥 We release audio-driven portrait image animation pipeline. Try out on [Huggingface Spaces Demo](https://huggingface.co/spaces/Skywork/skyreels-a1-talking-head) !
* Feb 18, 2025: 👋 We release the inference code and model weights of SkyReels-A1. [Download](https://huggingface.co/Skywork/SkyReels-A1)
* Feb 18, 2025: 🎉 We have made our technical report available as open source. [Read](https://skyworkai.github.io/skyreels-a1.github.io/report.pdf)
* Feb 18, 2025: 🔥 Our online demo of LipSync is available on SkyReels now! Try out on [LipSync](https://www.skyreels.ai/home/tools/lip-sync?refer=navbar) .
* Feb 18, 2025: 🔥 We have open-sourced I2V video generation model [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1). This is the first and most advanced open-source human-centric video foundation model.## 📑 TODO List
- [x] Checkpoints
- [x] Inference Code
- [x] Web Demo (Gradio)
- [x] Audio-driven Portrait Image Animation Pipeline
- [x] Inference Code for Long Videos
- [ ] User-Level GPU Inference on RTX4090
- [ ] ComfyUI## Getting Started 🏁
### 1. Clone the code and prepare the environment 🛠️
First git clone the repository with code:
```bash
git clone https://github.com/SkyworkAI/SkyReels-A1.git
cd SkyReels-A1# create env using conda
conda create -n skyreels-a1 python=3.10
conda activate skyreels-a1
```
Then, install the remaining dependencies:
```bash
pip install -r requirements.txt
```### 2. Download pretrained weights 📥
You can download the pretrained weights is from HuggingFace:
```bash
# !pip install -U "huggingface_hub[cli]"
huggingface-cli download Skywork/SkyReels-A1 --local-dir local_path --exclude "*.git*" "README.md" "docs"
```The FLAME, mediapipe, and smirk models are located in the SkyReels-A1/extra_models folder.
The directory structure of our SkyReels-A1 code is formulated as:
```text
pretrained_models
├── FLAME
├── SkyReels-A1-5B
│ ├── pose_guider
│ ├── scheduler
│ ├── tokenizer
│ ├── siglip-so400m-patch14-384
│ ├── transformer
│ ├── vae
│ └── text_encoder
├── mediapipe
└── smirk```
#### Download DiffposeTalk assets and pretrained weights (For Audio-driven)
- We use [diffposetalk](https://github.com/DiffPoseTalk/DiffPoseTalk/tree/main) to generate flame coefficients from audio, thereby constructing motion signals.
- Download the diffposetalk code and follow its README to download the weights and related data.
- Then place them in the specified directory.
```bash
cp -r ${diffposetalk_root}/style pretrained_models/diffposetalk
cp ${diffposetalk_root}/experiments/DPT/head-SA-hubert-WM/checkpoints/iter_0110000.pt pretrained_models/diffposetalk
cp ${diffposetalk_root}/datasets/HDTF_TFHP/lmdb/stats_train.npz pretrained_models/diffposetalk
```- Or you can download style files from [link](https://drive.google.com/file/d/1XT426b-jt7RUkRTYsjGvG-wS4Jed2U1T/view?usp=sharing) and stats_train.npz from [link](https://drive.google.com/file/d/1_I5XRzkMP7xULCSGVuaN8q1Upplth9xR/view?usp=sharing).
```text
pretrained_models
├── FLAME
├── SkyReels-A1-5B
├── mediapipe
├── diffposetalk
│ ├── style
│ ├── iter_0110000.pt
│ ├── stats_train.npz
└── smirk```
#### Download Frame interpolation Model pretrained weights (For Long Video Inference and Dynamic Resolution)
- We use [FILM](https://github.com/dajes/frame-interpolation-pytorch) to generate transition frames, making the video transitions smoother (Set `use_interpolation` to True).
- Download [film_net_fp16.pt](https://github.com/dajes/frame-interpolation-pytorch/releases), and place it in the specified directory.
```text
pretrained_models
├── FLAME
├── SkyReels-A1-5B
├── mediapipe
├── diffposetalk
├── film_net
│ ├── film_net_fp16.pt
└── smirk
```### 3. Inference 🚀
You can simply run the inference scripts as:
```bash
python inference.py# inference audio to video
python inference_audio.py
```If the script runs successfully, you will get an output mp4 file. This file includes the following results: driving video, input image or video, and generated result.
#### Long Video Inference
Now, you can run the long video inference scripts to obtain portrait animation of any length:
```bash
python inference_long_video.py# inference audio to video
python inference_audio_long_video.py
```#### Dynamic Resolution
All inference scripts now support dynamic resolution, simply set `target_fps` to any desired fps, recommended fps include: 12fps (Native), 24fps, 48fps, 60fps, other settings such as 25fps and 30fps may cause unstable frame rates.
## Gradio Interface 🤗
We provide a [Gradio](https://huggingface.co/docs/hub/spaces-sdks-gradio) interface for a better experience, just run by:
```bash
python app.py
```The graphical interactive interface is shown as below:

## Metric Evaluation 👓
We also provide all scripts for automatically calculating the metrics, including SimFace, FID, and L1 distance between expression and motion, reported in the paper.
All codes can be found in the ```eval``` folder. After setting the video result path, run the following commands in sequence:
```bash
python arc_score.py
python expression_score.py
python pose_score.py
```## Acknowledgements 💐
We would like to thank the contributors of [CogvideoX](https://github.com/THUDM/CogVideo), [finetrainers](https://github.com/a-r-r-o-w/finetrainers) and [DiffPoseTalk](https://github.com/DiffPoseTalk/DiffPoseTalk)repositories, for their open research and contributions.## Citation 💖
If you find SkyReels-A1 useful for your research, welcome to 🌟 this repo and cite our work using the following BibTeX:
```bibtex
@article{qiu2025skyreels,
title={Skyreels-a1: Expressive portrait animation in video diffusion transformers},
author={Qiu, Di and Fei, Zhengcong and Wang, Rui and Bai, Jialin and Yu, Changqian and Fan, Mingyuan and Chen, Guibin and Wen, Xiang},
journal={arXiv preprint arXiv:2502.10841},
year={2025}
}
```## Star History
[](https://www.star-history.com/#SkyworkAI/SkyReels-A1&Date)