{"id":13656148,"url":"https://github.com/alvinliu0/HA2G","last_synced_at":"2025-04-23T17:31:18.754Z","repository":{"id":38023820,"uuid":"471211488","full_name":"alvinliu0/HA2G","owner":"alvinliu0","description":"[CVPR 2022] Code for \"Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation\"","archived":false,"fork":false,"pushed_at":"2023-03-16T09:02:44.000Z","size":2732,"stargazers_count":129,"open_issues_count":11,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-10T08:37:22.962Z","etag":null,"topics":["audio-visual-learning","co-speech-gesture","cvpr2022"],"latest_commit_sha":null,"homepage":"https://alvinliu0.github.io/projects/HA2G","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alvinliu0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-18T02:32:23.000Z","updated_at":"2024-10-23T08:00:25.000Z","dependencies_parsed_at":"2024-11-10T08:31:22.494Z","dependency_job_id":"34b64422-0189-48ed-b968-092bde4b9279","html_url":"https://github.com/alvinliu0/HA2G","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvinliu0%2FHA2G","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvinliu0%2FHA2G/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvinliu0%2FHA2G/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvinliu0%2FHA2G/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alvinliu0","download_url":"https://codeload.github.com/alvinliu0/HA2G/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250480405,"owners_count":21437540,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-visual-learning","co-speech-gesture","cvpr2022"],"created_at":"2024-08-02T04:00:52.070Z","updated_at":"2025-04-23T17:31:16.116Z","avatar_url":"https://github.com/alvinliu0.png","language":"Python","funding_links":[],"categories":["Papers"],"sub_categories":["Audio-Driven motion generation"],"readme":"# Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation (CVPR 2022)\n\n[Xian Liu](https://alvinliu0.github.io/), [Qianyi Wu](https://qianyiwu.github.io/), [Hang Zhou](https://hangz-nju-cuhk.github.io/), [Yinghao Xu](https://justimyhxu.github.io/), [Rui Qian](https://shvdiwnkozbw.github.io/), [Xinyi Lin](https://alvinliu0.github.io/), [Xiaowei Zhou](https://xzhou.me/), [Wayne Wu](https://wywu.github.io/), [Bo Dai](http://daibo.info/), [Bolei Zhou](http://bzhou.ie.cuhk.edu.hk/).\n\n### [Project](https://alvinliu0.github.io/projects/HA2G) | [Paper](https://arxiv.org/pdf/2203.13161.pdf) | [Demo](https://www.youtube.com/watch?v=CG632W-nIWk) | [Data](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EQhOOXYsZDhJs-oEVwA7oyABSrkwcTKC6kwX-A85r0-42g?e=BiIsV1)\n\nGenerating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named **Hierarchical Audio-to-Gesture (HA2G)** for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.\n\n\u003cimg src='./misc/HA2G.png' width=800\u003e\n\n## Update\n\n- [2023/01/31] An evaluation bug on the BC metric is reported ([L424](https://github.com/alvinliu0/HA2G/blob/main/scripts/train.py#L424) of the scripts/train.py file and [L539](https://github.com/alvinliu0/HA2G/blob/main/scripts/train_expressive.py#L539) of the scripts/train_expressive.py file). Originally, the mean pose vectors are not added back to recover the correct skeleton in the main paper's reported BC evaluation results. We will update the quantitative results in the arxiv updates.\n\n## Environment\n\nThis project is developed and tested on Ubuntu 18.04, Python 3.6, PyTorch 1.10.2 and CUDA version 11.3. Since the repository is developed based on [Gesture Generation from Trimodal Context](https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context) of Yoon et al., the environment requirements, installation and dataset preparation process generally follow theirs.\n\n## Installation\n\n1. Clone this repository:\n   ```\n   git clone https://github.com/alvinliu0/HA2G.git\n   ```\n\n2. Install required python packages:\n   ```\n   pip install -r requirements.txt\n   ```\n\n3. Install Gentle for audio-transcript alignment. Download the source code from [Gentle github](https://github.com/lowerquality/gentle) and install the library via `install.sh`. And then, you can import gentle library by specifying the path to the library at `script/synthesize.py` line 27.\n\n4. Download pretrained fasttext model from [here](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip) and put `crawl-300d-2M-subword.bin` and `crawl-300d-2M-subword.vec` at `data/fasttext/`.\n\n5. Download the pretrained co-speech gesture models, which include the following:\n\n* [TED Expressive Dataset Auto-Encoder](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EWbBxUeuIHFDnBUgZFMCq1oBdiZSw6pOlmVxC8d9xS3HOg?e=IT1AoC), which is used to evaluate the FGD metric;\n\n* [TED Gesture Dataset Pretrained Model](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EWNjGPct4vJFq1nXccRi8OsBYmy62FugwGE_eRRqt0siDw?e=lGbjxp), which is the HA2G model trained on the TED Gesture Dataset;\n\n* [TED Expressive Dataset Pretrained Model](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EXtwgK2itnpGmE8fkMjrfccBmw4l3zsTDfhAb_PKw1aXdA), which is the HA2G model trained on the TED Expressive Dataset.\n\n## TED Expressive Dataset\n\nDownload [the preprocessed TED Expressive dataset](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EQhOOXYsZDhJs-oEVwA7oyABSrkwcTKC6kwX-A85r0-42g?e=BiIsV1) (16GB) and extract the ZIP file into `data/ted_expressive_dataset`. \n\nYou can find out the details of the TED Expressive dataset from [here](https://github.com/alvinliu0/HA2G/blob/main/dataset_script/README.md). The dataset pre-processing are extended based on [youtube-gesture-dataset](https://github.com/youngwoo-yoon/youtube-gesture-dataset). Our dataset extends new features of 3D upper body keypoints annotations including fine-grained fingers.\n\n## TED Gesture Dataset\n\nOur codebase also supports the training and inference of TED Gesture dataset of Yoon et al. Download [the preprocessed TED Gesture dataset](https://kaistackr-my.sharepoint.com/:u:/g/personal/zeroyy_kaist_ac_kr/EYAPLf8Hvn9Oq9GMljHDTK4BRab7rl9hAOcnjkriqL8qSg) (16GB) and extract the ZIP file into `data/ted_gesture_dataset`. Please refer to [here](https://github.com/youngwoo-yoon/youtube-gesture-dataset) for the details of TED Gesture dataset.\n\n## Pretrained Models and Training Logs\n\nWe also provide the pretrained models and training logs for better reproducibility and further research in this community. Note that since this work was done during internship at SenseTime Research, only the original training logs are provided while the original pretrained models are unavailble. Instead, we provide the newly pretrained models as well as the corresponding training logs. The new models outperform the evaluation results reported in the paper.\n\nPretrained models contain:\n\n* [TED Gesture Dataset Pretrained Model](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EWNjGPct4vJFq1nXccRi8OsBYmy62FugwGE_eRRqt0siDw?e=lGbjxp), which is the HA2G model trained on the TED Gesture Dataset;\n\n* [TED Expressive Dataset Pretrained Model](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EXtwgK2itnpGmE8fkMjrfccBmw4l3zsTDfhAb_PKw1aXdA), which is the HA2G model trained on the TED Expressive Dataset.\n\nTraining logs contain:\n\n* [ted_gesture_original.log](https://github.com/alvinliu0/HA2G/blob/main/training_logs/ted_gesture_original.log), which is the original HA2G training log on TED Gesture dataset;\n\n* [ted_gesture_new.log](https://github.com/alvinliu0/HA2G/blob/main/training_logs/ted_gesture_new.log), which is the newly trained HA2G log on TED Gesture dataset;\n\n* [ted_expressive_original.log](https://github.com/alvinliu0/HA2G/blob/main/training_logs/ted_expressive_original.log), which is the original HA2G training log on TED Expressive dataset;\n\n* [ted_expressive_new.log](https://github.com/alvinliu0/HA2G/blob/main/training_logs/ted_expressive_new.log), which is the newly trained HA2G log on TED Expressive dataset.\n\n## Synthesize from TED speech\n\nGenerate gestures from a clip in the **TED Gesture testset** using **baseline** models: \n\n```\npython scripts/synthesize.py from_db_clip [trained model path] [number of samples to generate]\n```\n\nYou would run like this:\n\n```\npython scripts/synthesize.py from_db_clip output/train_multimodal_context/multimodal_context_checkpoint_best.bin 10\n```\n\nGenerate gestures from a clip in the **TED Gesture testset** using **HA2G** models: \n\n```\npython scripts/synthesize_hierarchy.py from_db_clip [trained model path] [number of samples to generate]\n```\n\nYou would run like this:\n\n```\npython scripts/synthesize_hierarchy.py from_db_clip TED-Gesture-output/train_hierarchy/ted_gesture_hierarchy_checkpoint_best.bin 10\n```\n\nGenerate gestures from a clip in the **TED Expressive testset** using **HA2G** models: \n\n```\npython scripts/synthesize_expressive_hierarchy.py from_db_clip [trained model path] [number of samples to generate]\n```\n\nYou would run like this:\n\n```\npython scripts/synthesize_expressive_hierarchy.py from_db_clip TED-Expressive-output/train_hierarchy/ted_expressive_hierarchy_checkpoint_best.bin 10\n```\n\nThe first run takes several minutes to cache the datset. After that, it runs quickly.   \nYou can find synthesized results in `output/generation_results`. There are MP4, WAV, and PKL files for visualized output, audio, and pickled raw results, respectively. Speaker IDs are randomly selected for each generation. The following shows sample MP4 files.\n\n![Generated Sample 1](./misc/sample1.gif)\n![Generated Sample 2](./misc/sample2.gif)\n\n## Training\n\nTrain the proposed HA2G model on TED Gesture Dataset:\n```\npython scripts/train.py --config=config/hierarchy.yml\n```\n\nAnd the baseline models on TED Gesture Dataset:\n\n```\npython scripts/train.py --config=config/seq2seq.yml\npython scripts/train.py --config=config/speech2gesture.yml\npython scripts/train.py --config=config/joint_embed.yml \npython scripts/train.py --config=config/multimodal_context.yml\n```\n\nFor the TED Expressive Dataset, you can train the HA2G model by:\n```\npython scripts/train_expressive.py --config=config_expressive/hierarchy.yml\n```\n\nAnd the baseline models on TED Expressive Dataset:\n\n```\npython scripts/train.py --config=config_expressive/seq2seq.yml\npython scripts/train.py --config=config_expressive/speech2gesture.yml\npython scripts/train.py --config=config_expressive/joint_embed.yml \npython scripts/train.py --config=config_expressive/multimodal_context.yml\n```\n\nCaching TED training set (`lmdb_train`) takes tens of minutes at your first run. Model checkpoints and sample results will be saved in subdirectories of `./TED-Gesture-output` and `./TED-Expressive-output` folder.\n\nNote on reproducibility:  \nunfortunately, we didn't fix a random seed, so you are not able to reproduce the same FGD in the paper. But, several runs with different random seeds mostly fell in a similar FGD range.\n\n### Fréchet Gesture Distance (FGD)\n\nYou can train the autoencoder used for FGD. However, please note that FGD will change as you train the autoencoder anew. We recommend you to stick to the checkpoint that we shared.\n \n1. For the TED Gesture Dataset, we use the pretrained Auto-Encoder model provided by Yoon et al. for better reproducibility [the ckpt in the train_h36m_gesture_autoencoder folder](https://kaistackr-my.sharepoint.com/:u:/g/personal/zeroyy_kaist_ac_kr/Ec1UIsDDLHtKia04_TTRbygBepXORv__kkq-C9IqZs32aA?e=bJGXQr).\n\n2. For the TED Expressive Dataset, the pretrained Auto-Encoder model is provided [here](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155165198_link_cuhk_edu_hk/EWbBxUeuIHFDnBUgZFMCq1oBdiZSw6pOlmVxC8d9xS3HOg?e=IT1AoC). If you want to train the autoencoder anew, you could run the following training script:\n   \n```\npython scripts/train_feature_extractor_expressive.py --config=config_expressive/gesture_autoencoder.yml\n```\n\nThe model checkpoints will be saved in `./TED-Expressive-output/AE-cos1e-3`.\n\n## License\n\nWe follow the GPL-3.0 license, please see details [here](https://github.com/alvinliu0/HA2G/blob/main/license).\n\n## Citation\n\nIf you find our work useful, please kindly cite as:\n```\n@inproceedings{liu2022learning,\n  title={Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation},\n  author={Liu, Xian and Wu, Qianyi and Zhou, Hang and Xu, Yinghao and Qian, Rui and Lin, Xinyi and Zhou, Xiaowei and Wu, Wayne and Dai, Bo and Zhou, Bolei},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={10462--10472},\n  year={2022}\n}\n```\n\n## Related Links\nIf you are interested in **Audio-Driven Co-Speech Gesture Generation**, we would also like to recommend you to check out our other related works:\n\n* Audio-Driven Co-Speech Gesture Video Generation, [ANGIE](https://alvinliu0.github.io/projects/ANGIE).\n\n* Taming Diffusion Model for Co-Speech Gesture, [DiffGesture](https://github.com/Advocate99/DiffGesture).\n\n## Acknowledgement\n* The codebase is developed based on [Gesture Generation from Trimodal Context](https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context) of Yoon et al.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvinliu0%2FHA2G","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falvinliu0%2FHA2G","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvinliu0%2FHA2G/lists"}