Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/parth-shastri/adv_syn_data_aug_cs_lid

code for the paper "Adversarial synthesis based data-augmentation for code-switched spoken language identification"
https://github.com/parth-shastri/adv_syn_data_aug_cs_lid

deep-learning machine-learning

Last synced: 15 days ago
JSON representation

code for the paper "Adversarial synthesis based data-augmentation for code-switched spoken language identification"

Awesome Lists containing this project

README

        

## Dataset
> The Hindi-English dataset is extracted from spoken tutorials. These tutorials cover a range of technical topics and the code-switching predominantly arises from the technical content of the lectures. The segments file in the baseline recipe provides sentence time-stamps. These time-stamps were used to derive segments from the audio file to be aligned with the transcripts given in the text file. Hindi-English train and test datasets contain 89.86 hours and 5.18 hours, respectively. All the audio files in both datasets are sampled at 16 kHz, 16 bits encoding. The vocabulary size for Hindi-English is 17877.

Dataset Link: ![hindi-english codeswitched data](https://www.openslr.org/104/)
Link to the trained models and the classification dataset: ![Drive link](https://coepac-my.sharepoint.com/:f:/g/personal/shastripp18_extc_coep_ac_in/Ehd5lUFVASdEudEdc6_j0EkBDiwzKa56NAuPlFuyr2pOJQ?e=31Odlz)

# Methodology
## Using Generative Adversarial Networks for data augmentation
- GANs are relatively new and have worked wonders in the domain if computer vision, especially generative modelling.
- By treating audio spectrograms as images we try to exploit these networks' representation learning capabilities in generating Code-Switched spectrograms.
## Objectives
- The objective of our project is to study the effect of representation learning on the field of data augmentation.
- Representation learning is nothing but the field in which the model that we use tries to learn the representations from the data probability distribution,
- by using GAN on the Code-Switched data we try to learn the representations from the Code-Switched data space and later augment our data while classification.
## Data preprocessing
#### We use the Cond-GANS to generate mel-spectrograms of 128x128 the conditioning factor is the LogF0 (represents the pitch)
###### To obtain an image of 128x128x1 we use-
- FFT size = 1024.
- hop length = 256
- frame period = 16 ms
- sample rate = 16KHz

These parameters are chosen to obtain the desired image size, similar parameters are used to make the F0 contour a vetor of length 128.


The F0 frequency is used to represent the pitch info at each time frame.
By doing the above on the extraction of F0 contour too, we get the shape of F0 feature as (128,1)
###### The steps for data preprocessing done in `data_transform.py`:
- Audio Normalization
- Frame and block
- Calculate the STFT, make it power spectrum i.e (mag)^2
- Multiply it with mel-filter bank to make it mel-spectrogram.
- Convert to log-space or dB mel space.
- Normalization in between [-1, 1].

## GAN Architecture
![img](/imgs/figure2a.png)
![img2](/imgs/figure2b.png)
#### GAN results
![img3](/imgs/Comparison figure.png)
#### Frechet Inception Distance
FID is designed to capture the variability as well as the fidelity of the generated data in comparison with the input data.

![img](/imgs/figure3.png)
## Classifier
![img4](/imgs/figure5.png)
#### Dataset for LID
We used the approach of a 3-class classification for the LID model.
There is no defined dataset for such a task so we combined the utterances from existing datasets.

- NPTEL Indian English dataset - take utterances for the "English” class
- NISP dataset - Take utterances for both the “Hindi” as well as the “English” class.
- MUCS Hindi-English dataset - take utterances for the “Hindi-English” class
- IIIT-H dataset - take utterances for the “Hindi” class

#### Experiments
We used 3 methods to evaluate our scheme -
1. We trained the model on the imbalanced data
2. We trained the model on the data in which we used the SpecAugment technique on the Hindi_english class
3. We trained the model by augmenting the Hindi-English class by the spectrograms generated by our GAN model

We performed 5-fold cross validation to select the best model.

#### Results
We observed an accuracy bump of about 4.7% on the models which used proposed data augmentation as compared to the baseline.
![table img](/imgs/Accuracy table.png)
![uar comparision](/imgs/figure6.png)

# Conclusion
We concluded that the GANs a powerful representation learning tool learns the representation of the Hindi_english Speech data successfully.

Although this was a simple architecture relative to the current progress in the domain of GANs we have seen a improvement in the metrics.

We can further increase this by using
1. Higher temporal resolution spectrograms
2. Using techniques like Progressive Growing of GANS, StyleGANs etc.
3. Using Phonetic information for the conditioning of the GANs

This technique can be further used in the domain of ASR by using the phonetic information and producing realistic speech utterances.