Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/praveena2j/rjcaforspeakerverification
[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"
https://github.com/praveena2j/rjcaforspeakerverification
attention attention-model audio-visual-learning multimodal-learning speaker-verification
Last synced: about 1 month ago
JSON representation
[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"
- Host: GitHub
- URL: https://github.com/praveena2j/rjcaforspeakerverification
- Owner: praveena2j
- Created: 2024-01-08T18:30:15.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-05-14T16:41:30.000Z (7 months ago)
- Last Synced: 2024-05-15T11:33:58.767Z (7 months ago)
- Topics: attention, attention-model, audio-visual-learning, multimodal-learning, speaker-verification
- Language: Python
- Homepage:
- Size: 1 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
In this work, we present Recursive fusion of Joint Cross-Attention across audio and visual modalities for person verification.
## References
If you find this work useful in your research, please consider citing our work :pencil: and giving a star :star2: :
```bibtex
@article{praveen2024audio,
title={Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention},
author={Praveen, R Gnana and Alam, Jahangir},
journal={arXiv preprint arXiv:2403.04654},
year={2024}
}
```There are three major blocks in this repository to reproduce the results of our paper. This code uses Mixed Precision Training (torch.cuda.amp). The dependencies and packages required to reproduce the environment of this repository can be found in the `environment.yml` file.
### Creating the environment
Create an environment using the `environment.yml` file`conda env create -f environment.yml`
### Models and Text Files
The pre-trained models of audio and visual backbones are obtained [here](https://drive.google.com/drive/u/0/folders/1bXyexxgspeOi6gFiP177pM-KhwSeVsTq)The fusion models trained using our fusion approach can be found [here](https://drive.google.com/file/d/1lB0YeZSIYKpCs6EZG0hYaaPyl8dtCAAO/view?usp=drive_link)
The text files can be found [here](https://drive.google.com/drive/u/0/folders/1NJicFlj9CeNzxvtrOHRIHy6HnoTszro7)
```
train_list : Train list
val_trials : Validation trials list
val_list : Validation list
test_trials : VoX1-O trials list
test_list : Vox 1-O list```
+ [Preprocessing](#DP)
+ [Step One: Download the dataset](#PD)
+ [Step Two: Preprocess the visual modality](#PV)
+ [Training](#Training)
+ [Training the fusion model](#TE)
+ [Inference](#R)
+ [Generating the results](#GR)
## Preprocessing
[Return to Table of Content](#Table_of_Content)### Step One: Download the dataset
[Return to Table of Content](#Table_of_Content)
Please download the following.
+ The images of Voxceleb1 dataset can be downloaded [here](https://www.robots.ox.ac.uk/~vgg/research/CMBiometrics/)### Step Two: Preprocess the visual modality
[Return to Table of Content](#Table_of_Content)
+ The downloaded images are not properly aligned. So the images are aligned using [Insightface](https://github.com/TadasBaltrusaitis/OpenFace/releases) The preprocessing scripts are provided in preprocessing folder## Training
[Return to Table of Content](#Table_of_Content)
+ sbatch run_train.sh## Inference
[Return to Table of Content](#Table_of_Content)
+ sbatch run_eval.sh### 👍 Acknowledgments
Our code is based on [AVCleanse](https://github.com/TaoRuijie/AVCleanse)