An open API service indexing awesome lists of open source software.

https://github.com/tugstugi/pytorch-speech-commands

Speech commands recognition with PyTorch | Kaggle 10th place solution in TensorFlow Speech Recognition Challenge
https://github.com/tugstugi/pytorch-speech-commands

cifar10 classification deep-learning densenet dual-path-networks kaggle neural-network pytorch resnet resnext speech-recognition wide-residual-networks

Last synced: 14 days ago
JSON representation

Speech commands recognition with PyTorch | Kaggle 10th place solution in TensorFlow Speech Recognition Challenge

Awesome Lists containing this project

README

        

Convolutional neural networks for [Google speech commands data set](https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html)
with [PyTorch](http://pytorch.org/).

# General
We, [xuyuan](https://github.com/xuyuan) and [tugstugi](https://github.com/tugstugi), have participated
in the Kaggle competition [TensorFlow Speech Recognition Challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge)
and reached the 10-th place. This repository contains a simplified and cleaned up version of our team's code.

# Features
* `1x32x32` mel-spectrogram as network input
* single network implementation both for CIFAR10 and Google speech commands data sets
* faster audio data augmentation on STFT
* Kaggle private LB scores evaluated on 150.000+ audio files

# Results
Due to time limit of the competition, we have trained most of the nets with `sgd` using `ReduceLROnPlateau` for 70 epochs.
For the training parameters and dependencies, see [TRAINING.md](TRAINING.md). Earlier stopping the train process will sometimes produce a better score in Kaggle.

        Model        
CIFAR10
test set
accuracy

Speech Commands
test set
accuracy

Speech Commands
test set
accuracy with crop

Speech Commands
Kaggle private LB
score

Speech Commands
Kaggle private LB
score with crop

        Remarks        

VGG19 BN
93.56%
97.337235%
97.527432%
0.87454
0.88030

ResNet32
-
96.181419%
96.196050%
0.87078
0.87419

WRN-28-10
-
97.937089%
97.922458%
0.88546
0.88699

WRN-28-10-dropout
96.22%
97.702999%
97.717630%
0.89580
0.89568

WRN-52-10
-
98.039503%
97.980980%
0.88159
0.88323
another trained model has 97.52%/0.89322

ResNext29 8x64
-
97.190929%
97.161668%
0.89533
0.89733
our best model during competition

DPN92
-
97.190929%
97.249451%
0.89075
0.89286

DenseNet-BC (L=100, k=12)
95.52%
97.161668%
97.147037%
0.88946
0.89134

DenseNet-BC (L=190, k=40)
-
97.117776%
97.147037%
0.89369
0.89521

# Results with Mixup

After the competition, some of the networks were retrained using [mixup: Beyond Empirical Risk Minimization](https://arxiv.org/abs/1710.09412) by Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-Paz.

        Model        
CIFAR10
test set
accuracy

Speech Commands
test set
accuracy

Speech Commands
test set
accuracy with crop

Speech Commands
Kaggle private LB
score

Speech Commands
Kaggle private LB
score with crop

        Remarks        

VGG19 BN
-
97.483541%
97.542063%
0.89521
0.89839

WRN-52-10
-
97.454279%
97.498171%
0.90273
0.90355
same score as the 16-th place in Kaggle