{"id":13775249,"url":"https://github.com/ictnlp/streamspeech","last_synced_at":"2025-05-16T11:03:47.337Z","repository":{"id":242662547,"uuid":"810195157","full_name":"ictnlp/StreamSpeech","owner":"ictnlp","description":"StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.","archived":false,"fork":false,"pushed_at":"2024-08-24T09:01:06.000Z","size":19119,"stargazers_count":1053,"open_issues_count":14,"forks_count":80,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-04-09T05:05:16.055Z","etag":null,"topics":["all-in-one","asr","audio-processing","machine-translation","non-autoregressive","seamless","simultaneous-translation","speech","speech-enhancement","speech-processing","speech-recognition","speech-synthesis","speech-to-text","speech-translation","streaming-audio","text-to-audio","text-to-speech","translation","tts","voice"],"latest_commit_sha":null,"homepage":"https://ictnlp.github.io/StreamSpeech-site/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ictnlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-04T08:25:10.000Z","updated_at":"2025-04-08T00:04:21.000Z","dependencies_parsed_at":"2024-12-06T10:03:58.677Z","dependency_job_id":"a0e55531-94e6-49e3-a880-6f0156c6aa70","html_url":"https://github.com/ictnlp/StreamSpeech","commit_stats":null,"previous_names":["ictnlp/streamspeech"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ictnlp%2FStreamSpeech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ictnlp%2FStreamSpeech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ictnlp%2FStreamSpeech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ictnlp%2FStreamSpeech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ictnlp","download_url":"https://codeload.github.com/ictnlp/StreamSpeech/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254518383,"owners_count":22084374,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["all-in-one","asr","audio-processing","machine-translation","non-autoregressive","seamless","simultaneous-translation","speech","speech-enhancement","speech-processing","speech-recognition","speech-synthesis","speech-to-text","speech-translation","streaming-audio","text-to-audio","text-to-speech","translation","tts","voice"],"created_at":"2024-08-03T17:01:35.871Z","updated_at":"2025-05-16T11:03:47.302Z","avatar_url":"https://github.com/ictnlp.png","language":"Python","funding_links":[],"categories":["Projekte"],"sub_categories":["🗣️ Voice"],"readme":"# StreamSpeech\n\n[![arXiv](https://img.shields.io/badge/arXiv-2406.03049-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2406.03049)\n[![project](https://img.shields.io/badge/%F0%9F%8E%A7%20Demo-Listen%20to%20StreamSpeech-orange.svg)](https://ictnlp.github.io/StreamSpeech-site/)\n[![model](https://img.shields.io/badge/%F0%9F%A4%97%20-StreamSpeech_Models-blue.svg)](https://huggingface.co/ICTNLP/StreamSpeech_Models/tree/main)\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fictnlp%2FStreamSpeech\u0026count_bg=%2379C83D\u0026title_bg=%23555555\u0026icon=awesomelists.svg\u0026icon_color=%23E7E7E7\u0026title=Visitors\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n\n[![twitter](https://img.shields.io/badge/Twitter-@Gorden%20Sun-black?logo=X\u0026logoColor=black)](https://x.com/Gorden_Sun/status/1798742796524007845) [![twitter](https://img.shields.io/badge/Twitter-@imxiaohu-black?logo=X\u0026logoColor=black)](https://x.com/imxiaohu/status/1798999363987124355)\n\n\u003e **Authors**: **[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Shoutao Guo](https://scholar.google.com.hk/citations?user=XwHtPyAAAAAJ\u0026hl), [Zhengrui Ma](https://scholar.google.com.hk/citations?user=dUgq6tEAAAAJ), [Min Zhang](https://scholar.google.com.hk/citations?user=CncXH-YAAAAJ), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**\n\n\nCode for ACL 2024 paper \"[StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning](https://arxiv.org/pdf/2406.03049)\".\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/streamspeech.png\" alt=\"StreamSpeech\" style=\"width: 70%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  🎧 Listen to \u003ca href=\"https://ictnlp.github.io/StreamSpeech-site/\"\u003eStreamSpeech's translated speech\u003c/a\u003e 🎧 \n\u003c/p\u003e\n\n💡**Highlight**:\n1. StreamSpeech achieves **SOTA performance** on both offline and simultaneous speech-to-speech translation.\n2. StreamSpeech performs **streaming ASR**, **simultaneous speech-to-text translation** and **simultaneous speech-to-speech translation** via an \"All in One\" seamless model.\n3. StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.\n\n## 🔥News\n- [06.17] Add [Web GUI demo](./demo), now you can experience StreamSpeech in your local browser.\n- [06.05] [Paper](https://arxiv.org/pdf/2406.03049), [code](https://github.com/ictnlp/StreamSpeech), [models](https://huggingface.co/ICTNLP/StreamSpeech_Models/tree/main) and [demo](https://ictnlp.github.io/StreamSpeech-site/) of StreamSpeech are available!\n\n## ⭐Features\n\n### Support 8 Tasks\n- **Offline**: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅\n- **Simultaneous**: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)\n\n### GUI Demo\n\nhttps://github.com/ictnlp/StreamSpeech/assets/34680227/4d9bdabf-af66-4320-ae7d-0f23e721cd71\n\u003cp align=\"center\"\u003e\n  Simultaneously provide ASR, translation, and synthesis results via a seamless model\n\u003c/p\u003e\n\n### Case\n\n\u003e **Speech Input**: [example/wavs/common_voice_fr_17301936.mp3](./example/wavs/common_voice_fr_17301936.mp3)\n\u003e\n\u003e **Transcription** (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure\n\u003e\n\u003e **Translation** (ground truth): i therefore have the experience of the passed years i'll say a few words about that later\n\n| StreamSpeech                                    | Simultaneous                                                 | Offline                                                      |\n| ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| **Speech Recognition**                          | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure |\n| **Speech-to-Text Translation**                  | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later |\n| **Speech-to-Speech Translation**                | \u003cvideo src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ed41ba13-353b-489b-acfa-85563d0cc2cb' width=\"30%\"/\u003e                          | \u003cvideo src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ca482ba6-76da-4619-9dfd-24aa2eb3339a' width=\"30%\"/\u003e                          |\n| **Text-to-Speech Synthesis** (*incrementally synthesize speech word by word*) | \u003cvideo src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/294f1310-eace-4914-be30-5cd798e8592e' width=\"30%\"/\u003e                          | \u003cvideo src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/52854163-7fc5-4622-a5a6-c133cbd99e58' width=\"30%\"/\u003e                          |\n\n\n\n## ⚙Requirements\n\n- Python == 3.10, PyTorch == 2.0.1, Install fairseq \u0026 SimulEval\n\n  ```bash\n  cd fairseq\n  pip install --editable ./ --no-build-isolation\n  cd SimulEval\n  pip install --editable ./\n  ```\n\n## 🚀Quick Start\n\n### 1. Model Download\n\n#### (1) StreamSpeech Models\n\n| Language | UnitY                                                        | StreamSpeech (offline)                                       | StreamSpeech (simultaneous)                                  |\n| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| Fr-En    | unity.fr-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.fr-en.pt)] [[Baidu](https://pan.baidu.com/s/10uGYgl0xTej9FP43iKx7Cg?pwd=nkvu)] | streamspeech.offline.fr-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.fr-en.pt)] [[Baidu](https://pan.baidu.com/s/1GFckHGP5SNLuOEj6mbIWhQ?pwd=pwgq)] | streamspeech.simultaneous.fr-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.fr-en.pt)] [[Baidu](https://pan.baidu.com/s/1edCPFljogyDHgGXkUV8_3w?pwd=8gg3)] |\n| Es-En    | unity.es-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.es-en.pt)] [[Baidu](https://pan.baidu.com/s/1RwIEHye8jjw3kiIgrCHA3A?pwd=hde4)] | streamspeech.offline.es-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.es-en.pt)] [[Baidu](https://pan.baidu.com/s/1T89G4NC4J0Ofzcsc8Rt2Ww?pwd=yuhd)] | streamspeech.simultaneous.es-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.es-en.pt)] [[Baidu](https://pan.baidu.com/s/1NbLEVcYWHIdqqLD17P1s9g?pwd=p1pc)] |\n| De-En    | unity.de-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.de-en.pt)] [[Baidu](https://pan.baidu.com/s/1Mg_PBeZ5acEDhl5wRJ_-7w?pwd=egvv)] | streamspeech.offline.de-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.de-en.pt)] [[Baidu](https://pan.baidu.com/s/1mTE4eHuVLJPB7Yg9AackEg?pwd=6ga8)] | streamspeech.simultaneous.de-en.pt [[Huggingface](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.de-en.pt)] [[Baidu](https://pan.baidu.com/s/1DYPMg3mdDopLY70BYQTduQ?pwd=r7kw)] |\n\n#### (2) Unit-based HiFi-GAN Vocoder\n\n| Unit config       | Unit size | Vocoder language | Dataset                                             | Model                                                        |\n| ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ |\n| mHuBERT, layer 11 | 1000      | En               | [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) | [ckpt](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000), [config](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json) |\n\n### 2. Prepare Data and Config (only for test/inference)\n\n#### (1) Config Files\n\nReplace `/data/zhangshaolei/StreamSpeech` in files [configs/fr-en/config_gcmvn.yaml](./configs/fr-en/config_gcmvn.yaml) and [configs/fr-en/config_mtl_asr_st_ctcst.yaml](./configs/fr-en/config_mtl_asr_st_ctcst.yaml) with your local address of StreamSpeech repo.\n\n#### (2) Test Data\n\nPrepare test data following [SimulEval](https://github.com/facebookresearch/SimulEval) format. [example/](./example) provides an example:\n\n- [wav_list.txt](./example/wav_list.txt): Each line records the path of a source speech.\n- [target.txt](./example/target.txt): Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).\n\n### 3. Inference with SimulEval\n\nRun these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and  simultaneous S2ST.\n\n\u003e `--source-segment-size`: set the chunk size (millisecond) to any value to control the latency\n\n\u003cdetails\u003e\n\u003csummary\u003eSimultaneous Speech-to-Speech Translation\u003c/summary\u003e\n\n`--output-asr-translation`: whether to output the intermediate ASR and translated text results during simultaneous speech-to-speech translation.\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo\nPRETRAIN_ROOT=/data/zhangshaolei/pretrain_models \nVOCODER_CKPT=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000 # path to downloaded Unit-based HiFi-GAN Vocoder\nVOCODER_CFG=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json # path to downloaded Unit-based HiFi-GAN Vocoder\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/simul-s2st\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \\\n    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \\\n    --source example/wav_list.txt --target example/target.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT/agent/speech_to_speech.streamspeech.agent.py \\\n    --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \\\n    --output $output_dir/chunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics ASR_BLEU  --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \\\n    --device gpu --computation-aware \\\n    --output-asr-translation True\n```\n\nYou should get the following outputs:\n\n```\nfairseq plugins loaded...\nfairseq plugins loaded...\nfairseq plugins loaded...\nfairseq plugins loaded...\n2024-06-06 09:45:46 | INFO     | fairseq.tasks.speech_to_speech | dictionary size: 1,004\nimport agents...\nRemoving weight norm...\n2024-06-06 09:45:50 | INFO     | agent.tts.vocoder | loaded CodeHiFiGAN checkpoint from /data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000\n2024-06-06 09:45:50 | INFO     | simuleval.utils.agent | System will run on device: gpu.\n2024-06-06 09:45:50 | INFO     | simuleval.dataloader | Evaluating from speech to speech.\n  0%|                                                                                                                                                                              | 0/2 [00:00\u003c?, ?it/s]\nStreaming ASR: \nStreaming ASR: \nStreaming ASR: je\nSimultaneous translation: i would\nStreaming ASR: je voudrais\nSimultaneous translation: i would like to\nStreaming ASR: je voudrais soumettre\nSimultaneous translation: i would like to sub\nStreaming ASR: je voudrais soumettre cette\nSimultaneous translation: i would like to submit\nStreaming ASR: je voudrais soumettre cette idée\nSimultaneous translation: i would like to submit this\nStreaming ASR: je voudrais soumettre cette idée à la\nSimultaneous translation: i would like to submit this idea to\nStreaming ASR: je voudrais soumettre cette idée à la réflexion\nSimultaneous translation: i would like to submit this idea to the\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de\nSimultaneous translation: i would like to submit this idea to the reflection\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée\nSimultaneous translation: i would like to submit this idea to the reflection of\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale\nSimultaneous translation: i would like to submit this idea to the reflection of the\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale\nSimultaneous translation: i would like to submit this idea to the reflection of the national assembly\n 50%|███████████████████████████████████████████████████████████████████████████████████                                                                                   | 1/2 [00:04\u003c00:04,  4.08s/it]\nStreaming ASR: \nStreaming ASR: \nStreaming ASR: \nStreaming ASR: \nStreaming ASR: jai donc\nSimultaneous translation: i therefore\nStreaming ASR: jai donc\nStreaming ASR: jai donc expérience des\nSimultaneous translation: i therefore have an experience\nStreaming ASR: jai donc expérience des années\nStreaming ASR: jai donc expérience des années passé\nSimultaneous translation: i therefore have an experience of last\nStreaming ASR: jai donc expérience des années passé jen\nSimultaneous translation: i therefore have an experience of last years\nStreaming ASR: jai donc expérience des années passé jen dirairai\nSimultaneous translation: i therefore have an experience of last years i will\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot\nSimultaneous translation: i therefore have an experience of last years i will tell a\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure\nSimultaneous translation: i therefore have an experience of last years i will tell a word\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure\nSimultaneous translation: i therefore have an experience of last years i will tell a word later\n100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06\u003c00:00,  3.02s/it]\n2024-06-06 09:45:56 | WARNING  | simuleval.scorer.asr_bleu | Beta feature: Evaluating speech output. Faieseq is required.\n2024-06-06 09:46:12 | INFO | fairseq.tasks.audio_finetuning | Using dict_path : /data/zhangshaolei/.cache/ust_asr/en/dict.ltr.txt\nTranscribing predictions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01\u003c00:00,  1.63it/s]\n2024-06-06 09:46:21 | INFO     | simuleval.sentence_level_evaluator | Results:\n ASR_BLEU       AL    AL_CA    AP  AP_CA      DAL  DAL_CA  StartOffset  StartOffset_CA  EndOffset  EndOffset_CA     LAAL  LAAL_CA      ATD   ATD_CA  NumChunks  NumChunks_CA  DiscontinuitySum  DiscontinuitySum_CA  DiscontinuityAve  DiscontinuityAve_CA  DiscontinuityNum  DiscontinuityNum_CA   RTF  RTF_CA\n   15.448 1724.895 2913.508 0.425  0.776 1358.812 3137.55       1280.0        2213.906     1366.0        1366.0 1724.895 2913.508 1440.146 3389.374        9.5           9.5             110.0                110.0              55.0                 55.0                 1                    1 1.326   1.326\n\n```\n\nLogs and evaluation results are stored in ` $output_dir/chunk_size=$chunk_size`:\n\n```\n$output_dir/chunk_size=$chunk_size\n├── wavs/\n│   ├── 0_pred.wav # generated speech\n│   ├── 1_pred.wav \n│   ├── 0_pred.txt # asr transcription for ASR-BLEU tookit\n│   ├── 1_pred.txt \n├── config.yaml\n├── asr_transcripts.txt # ASR-BLEU transcription results\n├── metrics.tsv\n├── scores.tsv\n├── asr_cmd.bash\n└── instances.log # logs of Simul-S2ST\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eSimultaneous Speech-to-Text Translation\u003c/summary\u003e\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/simul-s2tt\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \\\n    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \\\n    --source example/wav_list.txt --target example/target.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT/agent/speech_to_text.s2tt.streamspeech.agent.py\\\n    --output $output_dir/chunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eStreaming ASR\u003c/summary\u003e\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/streaming-asr\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \\\n    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \\\n    --source example/wav_list.txt --target example/source.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT/agent/speech_to_text.asr.streamspeech.agent.py\\\n    --output $output_dir/chunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003c/details\u003e\n\n## 🎈Develop Your Own StreamSpeech\n\n### 1. Data Preprocess\n\n- Follow [`./preprocess_scripts`](./preprocess_scripts) to process CVSS-C data. \n\n### 2. Training\n\n\u003e [!Note]\n\u003e You can directly use the [downloaded StreamSpeech model](#1-model-download) for evaluation and skip training.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/model.png\" alt=\"model\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n- Follow [`researches/ctc_unity/train_scripts/train.simul-s2st.sh`](./researches/ctc_unity/train_scripts/train.simul-s2st.sh) to train StreamSpeech for simultaneous speech-to-speech translation.\n- Follow [`researches/ctc_unity/train_scripts/train.offline-s2st.sh`](./researches/ctc_unity/train_scripts/train.offline-s2st.sh) to train StreamSpeech for offline speech-to-speech translation.\n- We also provide some other StreamSpeech variants and baseline implementations.\n\n| Model             | --user-dir                 | --arch                            | Description                                                  |\n| ----------------- | -------------------------- | --------------------------------- | ------------------------------------------------------------ |\n| **Translatotron 2** | `researches/translatotron` | `s2spect2_conformer_modified`     | [Translatotron 2](https://proceedings.mlr.press/v162/jia22b.html) |\n| **UnitY**         | `researches/translatotron` | `unity_conformer_modified`        | [UnitY](https://aclanthology.org/2023.acl-long.872/)         |\n| **Uni-UnitY**     | `researches/uni_unity`     | `uni_unity_conformer`             | Change all encoders in UnitY into unidirectional             |\n| **Chunk-UnitY**   | `researches/chunk_unity`   | `chunk_unity_conformer`           | Change the Conformer in UnitY into Chunk-based Conformer     |\n| **StreamSpeech**  | `researches/ctc_unity`     | `streamspeech`                    | StreamSpeech                                                 |\n| **StreamSpeech (cascade)** | `researches/ctc_unity` | `streamspeech_cascade` | Cascaded StreamSpeech of S2TT and TTS. TTS module can be used independently for real-time TTS given incremental text. |\n| **HMT**           | `researches/hmt`           | `hmt_transformer_iwslt_de_en`     | [HMT](https://openreview.net/forum?id=9y0HFvaAYD6): strong simultaneous text-to-text translation method |\n| **DiSeg**         | `researches/diseg`         | `convtransformer_espnet_base_seg` | [DiSeg](https://aclanthology.org/2023.findings-acl.485/): strong simultaneous speech-to-text translation method |\n\n\u003e [!Tip]\n\u003e The `train_scripts/` and `test_scripts/` in directory `--user-dir` give the training and testing scripts for each model.\n\u003e Refer to official repo of [UnitY](https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/speech_to_speech/s2s_conformer_unity.py), [Translatotron 2](https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/speech_to_speech/s2s_conformer_translatotron2.py), [HMT](https://github.com/ictnlp/HMT) and [DiSeg](https://github.com/ictnlp/DiSeg) for more details.\n\n### 3. Evaluation\n\n#### (1) Offline Evaluation\n\nFollow [`pred.offline-s2st.sh`](./researches/ctc_unity/test_scripts/pred.offline-s2st.sh) to evaluate the offline performance of StreamSpeech on ASR, S2TT and S2ST.\n\n#### (2) Simultaneous Evaluation\n\nA trained StreamSpeech model can be used for streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation. We provide [agent/](./agent) for these three tasks:\n\n- `agent/speech_to_speech.streamspeech.agent.py`: simultaneous speech-to-speech translation\n- `agent/speech_to_text.s2tt.streamspeech.agent.py`: simultaneous speech-to-text translation\n- `agent/speech_to_text.asr.streamspeech.agent.py`: streaming ASR\n\nFollow [`simuleval.simul-s2st.sh`](./researches/ctc_unity/test_scripts/simuleval.simul-s2st.sh), [`simuleval.simul-s2tt.sh`](./researches/ctc_unity/test_scripts/simuleval.simul-s2tt.sh), [`simuleval.streaming-asr.sh`](./researches/ctc_unity/test_scripts/simuleval.streaming-asr.sh)  to evaluate StreamSpeech.\n\n### 4. Our Results\n\nOur project page ([https://ictnlp.github.io/StreamSpeech-site/](https://ictnlp.github.io/StreamSpeech-site/)) provides some translated speech generated by StreamSpeech, listen to it 🎧.\n\n#### (1) Offline Speech-to-Speech Translation  ( ASR-BLEU: quality )\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/offline_results.png\" alt=\"offline\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n#### (2) Simultaneous Speech-to-Speech Translation  ( AL: latency  |  ASR-BLEU: quality )\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/simultaneous_results.png\" alt=\"simul\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n#### (3) Simultaneous Speech-to-Text Translation  ( AL: latency  |  BLEU: quality )\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/s2tt.png\" alt=\"simul\" style=\"width: 38%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n#### (4) Streaming ASR  ( AL: latency  |  WER: quality )\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"./assets/asr.png\" alt=\"simul\" style=\"width: 50%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\n## 🖋Citation\n\nIf you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.\n\nIf our work is useful for you, please cite as:\n\n```\n@inproceedings{streamspeech,\n      title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, \n      author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},\n      year={2024},\n      booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},\n      publisher = {Association for Computational Linguistics}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fictnlp%2Fstreamspeech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fictnlp%2Fstreamspeech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fictnlp%2Fstreamspeech/lists"}