An open API service indexing awesome lists of open source software.

https://github.com/ictnlp/diseg

Source code for ACL 2023 paper "End-to-End Simultaneous Speech Translation with Differentiable Segmentation"
https://github.com/ictnlp/diseg

machine-translation segment segmentation sequence-segmentation simultaneous-machine-translation simultaneous-translation speech speech-translation streaming streaming-speech-to-text

Last synced: about 1 month ago
JSON representation

Source code for ACL 2023 paper "End-to-End Simultaneous Speech Translation with Differentiable Segmentation"

Awesome Lists containing this project

README

        

# End-to-End Simultaneous Speech Translation with Differentiable Segmentation

> **[Shaolei Zhang](https://zhangshaolei1998.github.io/), Yang Feng**

Source code for our ACL 2023 paper "[End-to-End Simultaneous Speech Translation with Differentiable Segmentation](https://aclanthology.org/2023.findings-acl.485.pdf)". **Differentiable Segmentation (DiSeg)** is a technique that adaptively segments speech into word-level segments. DiSeg learns to segment from the underlying model in an unsupervised manner.

![DiSeg](./DiSeg.png)

## Overview
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Data Pre-processing](#data-pre-processing)
- [Training](#Training)
- [0. (optional) Pre-training on MT Data](#0-optional-pre-training-on-mt-data)
- [1. Training DiSeg](#1-training-diseg)
- [Inference](#inference)
- [1. Offline Speech Translation with DiSeg](#1-offline-speech-translation-with-diseg)
- [2. Simultaneous Speech Translation with DiSeg](2-simultaneous-speech-translation-with-diseg)
- [3. Segment Speech with DiSeg](3-segment-speech-with-diseg)
- [Results](#results)
- [Citation](#citation)

## Installation

- DiSeg is implemented based on the open-source toolkit [Fairseq](https://github.com/pytorch/fairseq), install DiSeg:

```bash
git clone https://github.com/ictnlp/DiSeg.git
cd DiSeg
pip install --editable ./
```

## Quick Start

### Data Pre-processing

We use [MuST-C data](https://nlp.stanford.edu/projects/nmt/) from English TED talks. Download `MUSTC_v1.0_en-${LANG}.tar.gz` to the path `${MUSTC_ROOT}`, and then preprocess it with [`shell_scripts/prep.sh`](shell_scripts/prep.sh)

```bash
bash shell_scripts/prep.sh
```

Finally, the directory `${MUSTC_ROOT}` should look like:

```
.
├── en-de/
│ ├── config_raw.yaml
│ ├── spm_unigram10000_raw.model
│ ├── spm_unigram10000_raw.txt
│ ├── spm_unigram10000_raw.vocab
│ ├── dev_raw_st.tsv
│ ├── tst-COMMON_raw_st.tsv
│ ├── train_raw.tsv
│ ├── tst-COMMON_raw.tsv
│ ├── tst-HE_raw.tsv
│ ├── docs/
│ ├── data/
├── en-de-text/
│ ├── train.spm.en
│ ├── train.spm.de
│ ├── dev.spm.en
│ ├── dev.spm.de
│ ├── tst-COMMON.spm.en
│ ├── tst-COMMON.spm.de
├── data-bin/
│ ├── mustc_en_de_text/
│ │ ├── dict.en.txt
│ │ ├── dict.de.txt
│ │ ├── preprocess.log
│ │ ├── ***.bin
│ │ ├── ***.idx
├── en-de-simuleval/
│ ├── tst-COMMON/
│ │ ├── tst-COMMON.de
│ │ ├── tst-COMMON.wav_list
│ │ ├── ted_****_**.wav
│ │ ├── ...
│ ├── dev/
│ │ ├── dev.de
│ │ ├── dev.wav_list
│ │ ├── ted_****_**.wav
│ │ ├── ...
└── MUSTC_v1.0_en-de.tar.gz
```

- Config file `config_raw.yaml` should be like this.

```yaml
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
input_channels: 1
prepend_tgt_lang_tag: true
use_audio_input: true
vocab_filename: spm_unigram10000_raw.txt
```

- Training data `train_raw.tsv` should be like:

```
id audio n_frames src_text tgt_text speaker src_lang tgt_lang
ted_1_0 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:98720:460800 460800 And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful. I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night. Vielen Dank, Chris. Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür. Ich bin wirklich begeistert von dieser Konferenz, und ich danke Ihnen allen für die vielen netten Kommentare zu meiner Rede vorgestern Abend. spk.1 en de
ted_1_1 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:560160:219040 219040 And I say that sincerely, partly because (Mock sob) I need that. (Laughter) Das meine ich ernst, teilweise deshalb — weil ich es wirklich brauchen kann! (Lachen) Versetzen Sie sich mal in meine Lage! (Lachen) (Applaus) Ich bin bin acht Jahre lang mit der Air Force Two geflogen. spk.1 en de
ted_1_2 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:779200:367200 367200 Now I have to take off my shoes or boots to get on an airplane! (Laughter) (Applause) Jetzt muss ich meine Schuhe ausziehen, um überhaupt an Bord zu kommen! (Applaus) spk.1 en de
ted_1_3 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:1161600:65920 65920 I'll tell you one quick story to illustrate what that's been like for me. Ich erzähle Ihnen mal eine Geschichte, dann verstehen Sie mich vielleicht besser. spk.1 en de
ted_1_4 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:1235520:128320 128320 It's a true story — every bit of this is true. Soon after Tipper and I left the — (Mock sob) White House — Eine wahre Geschichte — kein Wort daran ist erfunden. spk.1 en de
......
```

### Training

#### 0. (optional) Pre-training on MT Data

Pre-training on MT data can speed up the convergence of DiSeg. Note that MT pretraining is optional, you can jump to the next step to train DiSeg directly.

- Pre-training on MuSTC MT data, following [`shell_scripts/pretrain.sh`](shell_scripts/pretrain.sh).

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

MUSTC_ROOT=path_to_mustc_data
LANG=de

PRETRAIN_DIR=path_to_save_pretrained_checkpoints
W2V_MODEL=path_to_wav2vec_model

python train.py ${MUSTC_ROOT}/en-${LANG} --text-data ${MUSTC_ROOT}/data-bin/mustc_en_${LANG}_text --tgt-lang ${LANG} --ddp-backend=legacy_ddp \
--config-yaml config_raw.yaml \
--train-subset train \
--valid-subset dev \
--save-dir ${PRETRAIN_DIR} \
--max-tokens 2000000 --max-tokens-text 8192 \
--update-freq 1 \
--task speech_to_text_multitask \
--criterion speech_to_text_multitask \
--label-smoothing 0.1 \
--arch convtransformer_espnet_base_wav2vec \
--w2v2-model-path ${W2V_MODEL} \
--optimizer adam \
--lr 2e-3 \
--lr-scheduler inverse_sqrt \
--warmup-updates 8000 \
--clip-norm 10.0 \
--seed 1 \
--ext-mt-training \
--eval-task ext_mt \
--eval-bleu \
--eval-bleu-args '{"beam": 1,"prefix_size":1}' \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--keep-best-checkpoints 10 \
--save-interval-updates 1000 \
--keep-interval-updates 15 \
--max-source-positions 800000 \
--skip-invalid-size-inputs-valid-test \
--dropout 0.1 --activation-dropout 0.1 --attention-dropout 0.1 --layernorm-embedding \
--empty-cache-freq 1000 \
--ignore-prefix-size 1 \
--patience 10 \
--fp16

```

- Average best 10 checkpoints.

```bash
python scripts/average_checkpoints.py \
--inputs ${PRETRAIN_DIR} \
--num-update-checkpoints 10 \
--output ${PRETRAIN_DIR}/mt_pretrain_model.pt \
--best True
```

#### 1. Training DiSeg

Download pre-trained [Wav2Vec2.0](dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt) at `${W2V_MODEL}`. Train DiSeg with [`shell_scripts/train.sh`](shell_scripts/train.sh).

- Multi-task learning: `--st-training`, `--mt-training`, `--asr-training`
- Segment speech inputs: `--seg-speech`
- Apply token-level contrastive learning: `--add-speech-seg-text-ctr`

*PS: We find that training an offline ST model (w/o `--seg-speech`) and then using `--seg-speech` to fineturn a DiSeg model can achieve better results.*

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

MUSTC_ROOT=path_to_mustc_data
LANG=de

SAVE_DIR=path_to_save_checkpoints
W2V_MODEL=path_to_wav2vec_model

mean=0
var=3

# (optional) pre-train a mt encoder/decoder and load the pre-trained model with --load-pretrained-mt-encoder-decoder-from ${PRETRAIN_DIR}/mt_pretrain_model.pt
python train.py ${MUSTC_ROOT}/en-${LANG} --tgt-lang ${LANG} --ddp-backend=legacy_ddp \
--config-yaml config_raw.yaml \
--train-subset train_raw \
--valid-subset dev_raw \
--save-dir ${SAVE_DIR} \
--max-tokens 1500000 --batch-size 32 --max-tokens-text 4096 \
--update-freq 1 \
--num-workers 8 \
--task speech_to_text_multitask \
--criterion speech_to_text_multitask_with_seg \
--report-accuracy \
--arch convtransformer_espnet_base_wav2vec_seg \
--w2v2-model-path ${W2V_MODEL} \
--optimizer adam \
--lr 0.0001 \
--lr-scheduler inverse_sqrt \
--weight-decay 0.0001 \
--label-smoothing 0.1 \
--warmup-updates 4000 \
--clip-norm 10.0 \
--seed 1 \
--seg-encoder-layers 6 \
--noise-mean ${mean} --noise-var ${var} \
--st-training --mt-training --asr-training \
--seg-speech --add-speech-seg-text-ctr \
--eval-task st \
--eval-bleu \
--eval-bleu-args '{"beam": 1,"prefix_size":1}' \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--keep-best-checkpoints 20 \
--save-interval-updates 1000 \
--keep-interval-updates 30 \
--max-source-positions 800000 \
--skip-invalid-size-inputs-valid-test \
--dropout 0.1 --activation-dropout 0.1 --attention-dropout 0.1 --layernorm-embedding \
--empty-cache-freq 1000 \
--ignore-prefix-size 1 \
--fp16

```

### Inference

#### 1. Offline Speech Translation with DiSeg

Perform offline speech translation with [`shell_scripts/test.offline.sh`](shell_scripts/test.offline.sh)

```bash
export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de

SAVE_DIR=path_to_save_checkpoints

python scripts/average_checkpoints.py \
--inputs ${SAVE_DIR} \
--num-update-checkpoints 5 \
--output ${SAVE_DIR}/average-model.pt \
--best True

python fairseq_cli/generate.py ${MUSTC_ROOT}/en-${LANG} --tgt-lang ${LANG} \
--config-yaml config_raw.yaml \
--gen-subset tst-COMMON_raw \
--task speech_to_text_multitask \
--path ${SAVE_DIR}/average-model.pt \
--max-tokens 1000000 \
--batch-size 250 \
--beam 1 \
--scoring sacrebleu \
--prefix-size 1 \
--max-source-positions 1000000 \
--eval-task st

```

#### 2. Simultaneous Speech Translation with DiSeg

Perform **simultaneous speech translation** with [SimulEval](https://github.com/facebookresearch/SimulEval), following [`shell_scripts/test.simuleval.sh`](shell_scripts/test.simuleval.sh)

- Install [SimulEval@2db1a59](https://github.com/facebookresearch/SimulEval/tree/2db1a590af11c28f6a3f67779568c4589b922cf1):

```bash
cd SimulEval
pip install -e .
```

- Simultaneous speech translation with agent [`diseg_agent.py`](diseg_agent.py):

```bash
export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de
EVAL_ROOT=path_to_save_simuleval_data
SAVE_DIR=path_to_save_checkpoints
OUTPUT_DIR=path_to_save_simuleval_results

lagging_seg=5 # lagging segment in DiSeg

simuleval --agent diseg_agent.py \
--source ${EVAL_ROOT}/tst-COMMON/tst-COMMON.wav_list \
--target ${EVAL_ROOT}/tst-COMMON/tst-COMMON.${LANG} \
--data-bin ${MUSTC_ROOT}/en-${LANG} \
--config config_raw.yaml \
--model-path ${SAVE_DIR}/average-model.pt \
--output ${OUTPUT_DIR} \
--lagging-segment ${lagging_seg} \
--lang ${LANG} \
--scores --gpu --fp16 \
--port 12345

```

#### 3. Segment Speech with DiSeg

You can **segment any speech with a trained DiSeg model**, following [`shell_scripts/seg.sh`](shell_scripts/seg.sh)

```bash
export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de
SAVE_DIR=path_to_save_checkpoints
OUTPUT_SEG=path_to_save_segment

WAV=path_to_wav_file

python segment.py ${MUSTC_ROOT}/en-${LANG} \
--task speech_to_text_multitask \
--config-yaml config_raw.yaml \
--ckpt ${SAVE_DIR}/average-model.pt \
--save-root ${OUTPUT_SEG} \
--wav ${WAV}
```

## Results

- DiSeg's performance on MuST-C English-to-German:

| k | CW | AP | AL | DAL | BLEU | TER | chrF | chrF++ |
| :--: | :--: | :--: | :--: | :--: | :---: | :---: | :---: | :----: |
| 1 | 462 | 0.67 | 1102 | 1518 | 18.85 | 73.13 | 44.29 | 42.31 |
| 3 | 553 | 0.76 | 1514 | 1967 | 20.74 | 69.95 | 49.34 | 47.09 |
| 5 | 666 | 0.82 | 1928 | 2338 | 22.11 | 66.90 | 50.13 | 47.94 |
| 7 | 850 | 0.86 | 2370 | 2732 | 22.98 | 65.42 | 50.36 | 48.23 |
| 9 | 1084 | 0.90 | 2785 | 3115 | 23.01 | 65.48 | 50.24 | 48.13 |
| 11 | 1354 | 0.92 | 3168 | 3464 | 23.13 | 65.04 | 50.42 | 48.31 |
| 13 | 1632 | 0.94 | 3575 | 3846 | 23.05 | 64.85 | 50.53 | 48.41 |
| 15 | 1935 | 0.96 | 3801 | 4040 | 23.12 | 64.92 | 50.47 | 48.36 |

- DiSeg's performance on MuST-C English-to-Spanish:

| k | CW | AP | AL | DAL | BLEU | TER | chrF | chrF++ |
| :--: | :--: | :--: | :--: | :--: | :---: | :---: | :---: | :----: |
| 1 | 530 | 0.67 | 1144 | 1625 | 22.03 | 71.34 | 45.69 | 43.82 |
| 3 | 563 | 0.76 | 1504 | 2107 | 24.49 | 66.63 | 53.09 | 50.85 |
| 5 | 632 | 0.81 | 1810 | 2364 | 26.58 | 63.35 | 54.55 | 52.39 |
| 7 | 788 | 0.85 | 2249 | 2764 | 27.81 | 61.87 | 55.28 | 53.16 |
| 9 | 1010 | 0.89 | 2694 | 3164 | 28.33 | 60.98 | 55.51 | 53.40 |
| 11 | 1257 | 0.92 | 3108 | 3530 | 28.59 | 60.63 | 55.64 | 53.55 |
| 13 | 1534 | 0.94 | 3479 | 3855 | 28.72 | 60.49 | 55.61 | 53.53 |
| 15 | 1835 | 0.95 | 3819 | 4160 | 28.92 | 60.22 | 55.80 | 53.71 |

## Citation

If you have any questions, feel free to contact me with: `[email protected]`.

If this repository is useful for you, please cite as:

```
@inproceedings{DiSeg,
title = "End-to-End Simultaneous Speech Translation with Differentiable Segmentation",
author = "Zhang, Shaolei and
Feng, Yang",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.485",
pages = "7659--7680",
}
```