https://github.com/asappresearch/slue-toolkit

A toolkit for Spoken Language Understanding Evaluation (SLUE) benchmark. Refer paper https://arxiv.org/abs/2111.10367 for more details. Official website: https://asappresearch.github.io/slue-toolkit/
https://github.com/asappresearch/slue-toolkit

Last synced: 3 months ago
JSON representation

A toolkit for Spoken Language Understanding Evaluation (SLUE) benchmark. Refer paper https://arxiv.org/abs/2111.10367 for more details. Official website: https://asappresearch.github.io/slue-toolkit/

Host: GitHub
URL: https://github.com/asappresearch/slue-toolkit
Owner: asappresearch
License: mit
Created: 2021-10-07T18:27:18.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2024-02-26T16:07:05.000Z (over 1 year ago)
Last Synced: 2025-04-05T11:11:12.193Z (3 months ago)
Language: Python
Homepage: https://asappresearch.github.io/slue-toolkit/
Size: 2.35 MB
Stars: 64
Watchers: 3
Forks: 17
Open Issues: 6
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# slue-toolkit
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-red.svg)](#python)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

We introduce Spoken Language Understanding Evaluation (SLUE) benchmark. This toolkit provides codes to download and pre-process the SLUE datasets, train the baseline models, and evaluate SLUE tasks. Refer [https://arxiv.org/abs/2111.10367](https://arxiv.org/abs/2111.10367) for more details.

## News
- Jan. 8. 2024: All test set labels were released in Huggingface dataset. You can find slue and slue-phase-2 dataset audio and labels here. Please use the label to evaluate and don't submit your result via email to us.
- SLUE dataset: https://huggingface.co/datasets/asapp/slue
- SLUE Phase-2 dataset: https://huggingface.co/datasets/asapp/slue-phase-2
- Jul. 28, 2022: We update the data to v0.2 where the dev set of slue-voxceleb has a similar sentiment distribution as the test set. All statstics and evaluation result in arxiv paper and leaderboard was updated accordingly.
- Nov. 22, 2021: We release the SLUE paper on arXiv along with the slue-toolkit repository. The repository contains data processing and evaluation scripts. We will publish the scripts for training the baseline models soon.

## Installation
1. git clone this repository and install slue-toolkit (development mode)
```sh
git clone https://github.com/asappresearch/slue-toolkit.git
cd slue-toolkit/
pip install -e .
```
or install directly from Github
```sh
pip install git+https://github.com/asappresearch/slue-toolkit.git
```
2. Install additional dependency based on your choice (e.g. you need `fairseq` and ``transformers`` for baselines)

Last checked with fairseq commit `8e804cb`.

3. Additional dependencies required:

3a. [wav2letter](https://github.com/flashlight/wav2letter): For decoding ASR and E2E NER models
This version of wav2letter python bindings does not require flashlight installation.
```
git clone --recursive https://github.com/facebookresearch/wav2letter.git
cd wav2letter
git checkout 96f5f9d
cd bindings/python
pip install -e .
```

3b. [kenlm](https://github.com/kpu/kenlm): For training language models

## SLUE Tasks
### Automatic Speech Recognition (ASR)

Although this is not a SLU task, ASR can help analyze the performance of downstream SLU tasks on the same domain. Additionally, pipeline approaches depend on ASR outputs, making ASR relevant to SLU. ASR is evaluated using word error rate (WER).

### Named Entity Recognition (NER)

Named entity recognition involves detecting the named entities and their tags (types) in a given sentence. We evaluate performance using micro-averaged F1 and label-F1 scores. The F1 score evaluates an unordered list of named entity phrase and tag pairs predicted for each sentence. Only the tag predictions are considered for label-F1.

### Sentiment Analysis (SA)

Sentiment analysis refers to classifying a given speech segment as having negative, neutral, or positive sentiment. We evaluate SA using macro-averaged (unweighted) recall and F1 scores.

### Named Entity Recognition (NEL)

Named entity localization involves detecting time stamps of named entities in a given utterance. An NEL algorithm returns a list of time stamps and we evaluate the performance using two measure: frame-F1 and word-F1 scores. For the F1 score computation we measure the number of frames (or words) that are missed (false negative) or are detected correctly (true positives) or are detected incorrectly (false positives) within the detected time stamp boundaries.

### Datasets

Corpus
Size - utts (hours)
Tasks
License

Fine-tune
Dev
Test

SLUE-VoxPopuli
5,000 (14.5)
1,753 (5.0)
1,842 (4.9)
ASR, NER
CC0 (check complete license here)

SLUE-VoxCeleb
5,777 (12.8)
1,454 (3.2)
3,553 (7.8)
ASR, SA
CC-BY 4.0 (check complete license here)

For SLUE, you need [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and [VoxPopuli](https://github.com/facebookresearch/voxpopuli) dataset. We carefully curated subset of those dataset for fine-tuning and evaluation for SLUE tasks, and we re-distribute the the subsets. Thus, you don't need to download a whole gigantic datasets. In the dataset, we also includes the human annotation and transcription for SLUE tasks. All you need to do is just running the script below and it will download and pre-process the dataset.

#### Download and pre-process dataset

```sh
bash scripts/download_datasets.sh
```

Note, for NEL, the dataset is hosted on [HuggingFace](https://huggingface.co/datasets/asapp/slue-phase-2/viewer/vp_nel), so run the following command to prepare the manifest files.
```
python slue_toolkit/prepare/prepare_voxpopuli_nel.py create_manifest
```

## SLUE score evaluation
The test set data and annotation will be used for the official SLUE score evaluation, however we will not release the test set annotation. Thus, the SLUE score can be evaluated by submitting your prediction result in tsv format. We will prepare the website to accept your submission. Please stay tuned for this.

## Model development rule
To train model, You can use fine-tuning and dev sets (audio, transcription and annotation) except the test set of SLUE task. Additionally you can use any kind of external dataset whether it is labeled or unlabeled for any purpose of training (e.g. pre-training and fine-tuning).

For vadidation of your model, you can use official dev set we provide, or you can make your own splits or cross-validation splits by mixing fine-tuning and dev set all together.

## Baselines

### ASR
#### Fine-tuning
Assuming that the preprocessed manifest files are in `manifest/slue-voxceleb` and `manifest/slue-voxpopuli` for SLUE-VoxCeleb and SLUE-VoxPopuli. This command fine-tune a wav2vec 2.0 base model on these two datasets using one GPU.
```sh
bash baselines/asr/ft-w2v2-base.sh manifest/slue-voxceleb save/asr/w2v2-base-vc
bash baselines/asr/ft-w2v2-base.sh manifest/slue-voxpopuli save/asr/w2v2-base-vp
```

#### Evaluation
To evaluate the fine-tuned wav2vec 2.0 ASR models on the dev set, please run the following commands.
```sh
python slue_toolkit/eval/eval_w2v.py eval_ctc_model save/asr/w2v2-base-vc --data manifest/slue-voxceleb --subset dev
python slue_toolkit/eval/eval_w2v.py eval_ctc_model save/asr/w2v2-base-vp --data manifest/slue-voxpopuli --subset dev
```
The WER will be printed directly.
The predictions are saved in `save/asr/w2v2-base-vc/pred-dev.wrd` and `save/asr/w2v2-base-vp/pred-dev.wrd` and can be used for pipeline models.

More detail baseline experiment described [here](baselines/asr/README.md)

### NER
#### Fine-tuning End-to-end model
Assuming that the preprocessed manifest files are in `manifest/slue-voxpopuli` for SLUE-VoxPopuli. This command fine-tune a wav2vec 2.0 base model using one GPU.
```sh
bash baselines/ner/e2e_scripts/ft-w2v2-base.sh manifest/slue-voxpopuli/e2e_ner save/e2e_ner/w2v2-base
```

#### Evaluating End-to-End model

To evaluate the fine-tuned wav2vec 2.0 E2E NER model on the dev set, please run the following command. (decoding without language model)
```sh
bash baselines/ner/e2e_scripts/eval-ner.sh w2v2-base dev combined nolm
```
More detail baseline experiment described [here](baselines/ner/README.md)

### Sentiment Analysis
#### Fine-tuning
This command fine-tune a wav2vec 2.0 base model on the voxceleb dataset
```sh
bash baselines/sentiment/e2e_scripts/ft-w2v2-base-senti.sh manifest/slue-voxceleb save/sentiment/w2v2-base
```
#### Evaluation
To evaluate the fine-tuned wav2vec 2.0 sentiment model, run following commands or run `baselines/sentiment/e2e_scripts/eval.sh`
```sh
python3 slue_toolkit/eval/eval_w2v_sentiment.py --save-dir save/sentiment/w2v2-base --data manifest/slue-voxceleb --subset dev
```
More detail baseline experiment described [here](baselines/sentiment/README.md)

### NEL
Assuming that the preprocessed manifest files are in `manifest/slue-voxpopuli/nel` for SLUE-VoxPopuli.
#### Fine-tuning End-to-end model
NEL does not have a train split and no separate fine-tuning is done for NEL. The baseline NEL algorithm uses fine-tuned NER models, so follow the instructions for [fine-tuning an E2E NER model](https://github.com/asappresearch/slue-toolkit/tree/main#fine-tuning-end-to-end-model).

#### Evaluating End-to-End model

To evaluate the fine-tuned wav2vec 2.0 E2E NER model on the dev set, please run the following commands. (decoding without language model)
```sh
bash baselines/nel/decode.sh e2e_ner dev
bash baselines/nel/eval_nel.sh e2e
```
More detail baseline experiment described [here](baselines/nel/README.md)

# How-to-submit for your test set evaluation

See here https://asappresearch.github.io/slue-toolkit/how-to-submit.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/asappresearch/slue-toolkit

Awesome Lists containing this project

README