Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://zixiangzhou916.github.io/UDE/

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://zixiangzhou916.github.io/UDE/
Owner: zixiangzhou916
Created: 2020-01-12T03:46:11.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-08-08T07:48:51.000Z (over 1 year ago)
Last Synced: 2024-11-10T08:37:26.454Z (about 2 months ago)
Language: Python
Size: 47.5 MB
Stars: 55
Watchers: 6
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Project page of "UDE: A Unified Driving Engine for Human Motion Generation"

💥💥💥 The demo code is now available!!!

✨✨✨ Our paper has been accepted by CVPR2023!

---

## [Project Website](https://zixiangzhou916.github.io/UDE/), [Paper](http://arxiv.org/abs/2211.16016), [Demo](https://www.youtube.com/embed/CaG1PTvzkxA)

---

![plot](./assets/teaser.png)
Our shared Unified Driving Engine (UDE) can support both text-driven and audio-driven human motion generation. Left shows an example of a motion sequence driven by a text description while Right shows an example driven by a LA Hiphop music clip

# Abstract
#### Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific. In this paper, we propose “UDE”, the first unified driving engine that enables generating human motion sequences from natural language or audio sequences. Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code, 2) a modality-agnostic transformer encoder that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D and AIST++ benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance

![plot](./assets/overview.png)

# Overview
#### Our model consists of four key components. First, we train a codebook using VQ-VAE. For the codebook, each code represents a certain pattern of the motion sequence. Second, we introduce a ModalityAgnostic Transformer Encoder (MATE). It takes the input of different modalities and transforms them into sequential embedding in one joint space. The third component is a Unified Token Transformer (UTT). We feed it with sequential embedding obtained by MATE and predict the motion token sequences in an auto-regressive manner. The fourth component is a Diffusion Motion Decoder (DMD). Unlike recent works, which are modality-specific, our DMD is modality-agnostic. Given the motion token sequences, DMD encodes them to semantic-rich embedding and then decodes them to motion sequences in continuous space by the reversed diffusion process.

# Demo

#### We show a short demo video on how our model can generate motion sequence with mixed modality of inputs. To watch full demo video, please visit this site: (https://www.youtube.com/embed/CaG1PTvzkxA)

# Getting started

This code was tested on Ubuntu 20.04 LTS and requires:
* Python 3.8
* Conda
* CUDA capable GPU (single GPU works!)

### 1. Setup enviroment

Clone this repo and move it to:

git clone https://github.com/zixiangzhou916/UDE.git
cd UDE

Create a conda enviroment, activate it, and install the dependencies:

pip install -r requirements.txt

### 2. Pretrained model downloading

The pretrained checkpoints could be downloaded from [checkpoint](https://drive.google.com/drive/folders/13aLxNhgEOwxIkdT-taH7Ig-4j4ObJAaB?usp=sharing).

tar -xzvf checkpoints.tar.gz
tar -xzvf smpl_models.tar.gz

The unzipped checkpoint files will be organized as:

checkpoints
|--- ude_best.pth
|--- dmd_best.pth
|--- vqvae_best.pth
|--- ViT-B-32.pt

Unzip the files and put move the smpl_models:

mv smpl networks

The smpl models should be organized as:

### 3. Run the demo

We provide sample data for quick demo, the sample data are organized as:

|sample_data
|--- t2m
|--- text_descriptions.json
|--- a2m
|--- gHO_sBM_cAll_d21_mHO5_ch01.npy
|---

Run the following command to play with it:

sh demo.sh

### 4. Train your own UDE

Comming soon

### 5. Evaluate the model

Comming soon

# Citation

@InProceedings{Zhou_2023_CVPR, \
$\qquad$ author = {Zhou, Zixiang and Wang, Baoyuan}, \
$\qquad$ title = {UDE: A Unified Driving Engine for Human Motion Generation},\
$\qquad$ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},\
$\qquad$ month = {June},\
$\qquad$ year = {2023},\
$\qquad$ pages = {5632-5641} \
}