Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/praveena2j/rjcaforspeakerverification

[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"
https://github.com/praveena2j/rjcaforspeakerverification

attention attention-model audio-visual-learning multimodal-learning speaker-verification

Last synced: about 1 month ago
JSON representation

[FG 2024] "Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention"

Host: GitHub
URL: https://github.com/praveena2j/rjcaforspeakerverification
Owner: praveena2j
Created: 2024-01-08T18:30:15.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-05-14T16:41:30.000Z (7 months ago)
Last Synced: 2024-05-15T11:33:58.767Z (7 months ago)
Topics: attention, attention-model, audio-visual-learning, multimodal-learning, speaker-verification
Language: Python
Homepage:
Size: 1 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        In this work, we present Recursive fusion of Joint Cross-Attention across audio and visual modalities for person verification. 

## References

If you find this work useful in your research, please consider citing our work :pencil: and giving a star :star2: :

```bibtex

@article{praveen2024audio,

  title={Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention},

  author={Praveen, R Gnana and Alam, Jahangir},

  journal={arXiv preprint arXiv:2403.04654},

  year={2024}

}

```

There are three major blocks in this repository to reproduce the results of our paper. This code uses Mixed Precision Training (torch.cuda.amp). The dependencies and packages required to reproduce the environment of this repository can be found in the `environment.yml` file. 

### Creating the environment

Create an environment using the `environment.yml` file

`conda env create -f environment.yml`

### Models and Text Files

The pre-trained models of audio and visual backbones are obtained [here](https://drive.google.com/drive/u/0/folders/1bXyexxgspeOi6gFiP177pM-KhwSeVsTq) 

The fusion models trained using our fusion approach can be found [here](https://drive.google.com/file/d/1lB0YeZSIYKpCs6EZG0hYaaPyl8dtCAAO/view?usp=drive_link)

The text files can be found [here](https://drive.google.com/drive/u/0/folders/1NJicFlj9CeNzxvtrOHRIHy6HnoTszro7)

```

train_list :  Train list

val_trials :  Validation trials list

val_list : Validation list

test_trials : VoX1-O trials list

test_list : Vox 1-O list

```

# Table of contents 

+ [Preprocessing](#DP) 

    + [Step One: Download the dataset](#PD)

    + [Step Two: Preprocess the visual modality](#PV) 

+ [Training](#Training) 

    + [Training the fusion model](#TE) 

+ [Inference](#R)

    + [Generating the results](#GR)

 

## Preprocessing 

[Return to Table of Content](#Table_of_Content)

### Step One: Download the dataset 

[Return to Table of Content](#Table_of_Content)

Please download the following.

  + The images of Voxceleb1 dataset can be downloaded [here](https://www.robots.ox.ac.uk/~vgg/research/CMBiometrics/) 

### Step Two: Preprocess the visual modality 

[Return to Table of Content](#Table_of_Content)

  + The downloaded images are not properly aligned. So the images are aligned using [Insightface](https://github.com/TadasBaltrusaitis/OpenFace/releases) The preprocessing scripts are provided in preprocessing folder 

## Training 

[Return to Table of Content](#Table_of_Content)

  + sbatch run_train.sh 

## Inference 

[Return to Table of Content](#Table_of_Content)

  + sbatch run_eval.sh

### 👍 Acknowledgments

Our code is based on [AVCleanse](https://github.com/TaoRuijie/AVCleanse)