https://github.com/picovoice/speech-to-text-benchmark
speech to text benchmark framework
https://github.com/picovoice/speech-to-text-benchmark
aws-transcribe cheetah deep-learning deep-neural-networks deepspeech edge-ai google-speech-to-text mozilla-deepspeech offline picovoice pocketsphinx privacy speech-recognition speech-to-text voice-recognition
Last synced: about 1 month ago
JSON representation
speech to text benchmark framework
- Host: GitHub
- URL: https://github.com/picovoice/speech-to-text-benchmark
- Owner: Picovoice
- License: apache-2.0
- Created: 2018-08-04T02:52:01.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2024-01-12T00:17:59.000Z (over 1 year ago)
- Last Synced: 2024-02-17T08:34:39.932Z (about 1 year ago)
- Topics: aws-transcribe, cheetah, deep-learning, deep-neural-networks, deepspeech, edge-ai, google-speech-to-text, mozilla-deepspeech, offline, picovoice, pocketsphinx, privacy, speech-recognition, speech-to-text, voice-recognition
- Language: Python
- Homepage: https://picovoice.ai/
- Size: 159 MB
- Stars: 577
- Watchers: 28
- Forks: 62
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Speech-to-Text Benchmark
Made in Vancouver, Canada by [Picovoice](https://picovoice.ai)
This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.
## Table of Contents
- [Data](#data)
- [Metrics](#metrics)
- [Engines](#engines)
- [Usage](#usage)
- [Results](#results)## Data
- [LibriSpeech](http://www.openslr.org/12/)
- [TED-LIUM](https://www.openslr.org/7/)
- [Common Voice](https://commonvoice.mozilla.org/en)
- [Multilingual LibriSpeech](https://openslr.org/94)
- [VoxPopuli](https://github.com/facebookresearch/voxpopuli)## Metrics
### Word Error Rate
Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output
of the speech-to-text engine to the number of words in the reference transcript.### Core-Hour
The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine,
indicating the number of CPU hours required to process one hour of audio. A speech-to-text
engine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.### Model Size
The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.
## Engines
- [Amazon Transcribe](https://aws.amazon.com/transcribe/)
- [Azure Speech-to-Text](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/)
- [Google Speech-to-Text](https://cloud.google.com/speech-to-text)
- [IBM Watson Speech-to-Text](https://www.ibm.com/ca-en/cloud/watson-speech-to-text)
- [OpenAI Whisper](https://github.com/openai/whisper)
- [Picovoice Cheetah](https://picovoice.ai/)
- [Picovoice Leopard](https://picovoice.ai/)## Usage
This benchmark has been developed and tested on `Ubuntu 22.04`.
- Install [FFmpeg](https://www.ffmpeg.org/)
- Download datasets.
- Install the requirements:```console
pip3 install -r requirements.txt
```In the following, we provide instructions for running the benchmark for each engine.
The supported datasets are:
`COMMON_VOICE`, `LIBRI_SPEECH_TEST_CLEAN`, `LIBRI_SPEECH_TEST_OTHER`, `TED_LIUM`, `MLS`, and `VOX_POPULI`.
The supported languages are:
`EN`, `FR`, `DE`, `ES`, `IT`, `PT_BR`, and `PT_PT`.### Amazon Transcribe Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language, and `${AWS_PROFILE}`
with the name of AWS profile you wish to use.```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--language ${LANGUAGE} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}
```### Azure Speech-to-Text Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,
`${AZURE_SPEECH_KEY}` and `${AZURE_SPEECH_LOCATION}` information from your Azure account.```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--language ${LANGUAGE} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}
```### Google Speech-to-Text Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,
and `${GOOGLE_APPLICATION_CREDENTIALS}` with credentials download from Google Cloud Platform.```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--language ${LANGUAGE} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}
```### IBM Watson Speech-to-Text Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset,
and `${WATSON_SPEECH_TO_TEXT_API_KEY}`/`${${WATSON_SPEECH_TO_TEXT_URL}}` with credentials from your IBM account.```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}
```### OpenAI Whisper Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,
and `${WHISPER_MODEL}` with the whisper model type (`WHISPER_TINY`, `WHISPER_BASE`, `WHISPER_SMALL`,
`WHISPER_MEDIUM`, `WHISPER_LARGE_V1`, `WHISPER_LARGE_V2` or `WHISPER_LARGE_V3`)```console
python3 benchmark.py \
--engine ${WHISPER_MODEL} \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--dataset-folder ${DATASET_FOLDER} \
```### Picovoice Cheetah Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,
and `${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).
If benchmarking a non-English language, include `--picovoice-model-path` and replace `${PICOVOICE_MODEL_PATH}` with the path to a model file acquired from the [Cheetah Github Repo](https://github.com/Picovoice/cheetah/tree/master/lib/common/).```console
python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
--picovoice-model-path ${PICOVOICE_MODEL_PATH}
```### Picovoice Leopard Instructions
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,
and `${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).
If benchmarking a non-English language, include `--picovoice-model-path` and replace `${PICOVOICE_MODEL_PATH}` with the path to a model file acquired from the [Leopard Github Repo](https://github.com/Picovoice/leopard/tree/master/lib/common/).```console
python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
--picovoice-model-path ${PICOVOICE_MODEL_PATH}
```## Results
### English
#### Word Error Rate

| Engine | LibriSpeech test-clean | LibriSpeech test-other | TED-LIUM | CommonVoice | Average |
|:------------------------------:|:----------------------:|:----------------------:|:--------:|:-----------:|:-------:|
| Amazon Transcribe | 2.6% | 5.6% | 3.8% | 8.7% | 5.2% |
| Azure Speech-to-Text | 2.8% | 6.2% | 4.6% | 8.9% | 5.6% |
| Google Speech-to-Text | 10.8% | 24.5% | 14.4% | 31.9% | 20.4% |
| Google Speech-to-Text Enhanced | 6.2% | 13.0% | 6.1% | 18.2% | 10.9% |
| IBM Watson Speech-to-Text | 10.9% | 26.2% | 11.7% | 39.4% | 22.0% |
| Whisper Large (Multilingual) | 3.7% | 5.4% | 4.6% | 9.0% | 5.7% |
| Whisper Medium | 3.3% | 6.2% | 4.6% | 10.2% | 6.1% |
| Whisper Small | 3.3% | 7.2% | 4.8% | 12.7% | 7.0% |
| Whisper Base | 4.3% | 10.4% | 5.4% | 17.9% | 9.5% |
| Whisper Tiny | 5.9% | 13.8% | 6.5% | 24.4% | 12.7% |
| Picovoice Cheetah | 5.4% | 12.0% | 6.8% | 17.3% | 10.4% |
| Picovoice Leopard | 5.1% | 11.1% | 6.4% | 16.1% | 9.7% |#### Core-Hour & Model Size
To obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time.
The measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (`AMD Ryzen 9 5900X (12) @ 3.70GHz`),
64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large from this benchmark.| Engine | Core-Hour | Model Size / MB |
|:-----------------:|:---------:|:---------------:|
| Whisper Medium | 1.50 | 1457 |
| Whisper Small | 0.89 | 462 |
| Whisper Base | 0.28 | 139 |
| Whisper Tiny | 0.15 | 73 |
| Picovoice Leopard | 0.05 | 36 |
| Picovoice Cheetah | 0.09 | 31 |
### French
#### Word Error Rate

| Engine | CommonVoice | Multilingual LibriSpeech | VoxPopuli | Average |
|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|
| Amazon Transcribe | 6.0% | 4.4% | 8.6% | 6.3% |
| Azure Speech-to-Text | 11.1% | 9.0% | 11.8% | 10.6% |
| Google Speech-to-Text | 14.3% | 14.2% | 15.1% | 14.5% |
| Whisper Large | 9.3% | 4.6% | 10.9% | 8.3% |
| Whisper Medium | 13.1% | 8.6% | 12.1% | 11.3% |
| Whisper Small | 19.2% | 13.5% | 15.3% | 16.0% |
| Whisper Base | 35.4% | 24.4% | 23.3% | 27.7% |
| Whisper Tiny | 49.8% | 36.2% | 32.1% | 39.4% |
| Picovoice Cheetah | 14.5% | 14.5% | 14.9% | 14.6% |
| Picovoice Leopard | 15.9% | 19.2% | 17.5% | 17.5% |### German
#### Word Error Rate

| Engine | CommonVoice | Multilingual LibriSpeech | VoxPopuli | Average |
|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|
| Amazon Transcribe | 5.3% | 2.9% | 14.6% | 7.6% |
| Azure Speech-to-Text | 6.9% | 5.4% | 13.1% | 8.5% |
| Google Speech-to-Text | 9.2% | 13.9% | 17.2% | 13.4% |
| Whisper Large | 5.3% | 4.4% | 12.5% | 7.4% |
| Whisper Medium | 8.3% | 7.6% | 13.5% | 9.8% |
| Whisper Small | 13.8% | 11.2% | 16.2% | 13.7% |
| Whisper Base | 26.9% | 19.8% | 24.0% | 23.6% |
| Whisper Tiny | 39.5% | 28.6% | 33.0% | 33.7% |
| Picovoice Cheetah | 8.4% | 12.1% | 17.0% | 12.5% |
| Picovoice Leopard | 8.2% | 11.6% | 23.6% | 14.5% |### Italian
#### Word Error Rate

| Engine | CommonVoice | Multilingual LibriSpeech | VoxPopuli | Average |
|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|
| Amazon Transcribe | 4.1% | 9.1% | 16.1% | 9.8% |
| Azure Speech-to-Text | 5.8% | 14.0% | 17.8% | 12.5% |
| Google Speech-to-Text | 5.5% | 19.6% | 18.7% | 14.6% |
| Whisper Large | 4.9% | 8.8% | 21.8% | 11.8% |
| Whisper Medium | 8.7% | 14.9% | 19.3% | 14.3% |
| Whisper Small | 15.4% | 20.6% | 22.7% | 19.6% |
| Whisper Base | 32.3% | 31.6% | 31.6% | 31.8% |
| Whisper Tiny | 48.1% | 43.3% | 43.5% | 45.0% |
| Picovoice Cheetah | 8.6% | 17.6% | 20.1% | 15.4% |
| Picovoice Leopard | 13.0% | 27.7% | 22.2% | 21.0% |### Spanish
#### Word Error Rate

| Engine | CommonVoice | Multilingual LibriSpeech | VoxPopuli | Average |
|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|
| Amazon Transcribe | 3.9% | 3.3% | 8.7% | 5.3% |
| Azure Speech-to-Text | 6.3% | 5.8% | 9.4% | 7.2% |
| Google Speech-to-Text | 6.6% | 9.2% | 11.6% | 9.1% |
| Whisper Large | 4.0% | 2.9% | 9.7% | 5.5% |
| Whisper Medium | 6.2% | 4.8% | 9.7% | 6.9% |
| Whisper Small | 9.8% | 7.7% | 11.4% | 9.6% |
| Whisper Base | 20.2% | 13.0% | 15.3% | 16.2% |
| Whisper Tiny | 33.3% | 20.6% | 22.7% | 25.5% |
| Picovoice Cheetah | 8.3% | 8.0% | 11.4% | 9.2% |
| Picovoice Leopard | 7.6% | 14.9% | 14.1% | 12.2% |### Portuguese
#### Word Error Rate

| Engine | CommonVoice | Multilingual LibriSpeech | Average |
|:------------------------------:|:-----------:|:-------------------------:|:-------:|
| Amazon Transcribe | 5.4% | 7.8% | 6.6% |
| Azure Speech-to-Text | 7.4% | 9.0% | 8.2% |
| Google Speech-to-Text | 8.8% | 14.2% | 11.5% |
| Whisper Large | 5.9% | 5.4% | 5.7% |
| Whisper Medium | 9.6% | 8.1% | 8.9% |
| Whisper Small | 15.6% | 13.0% | 14.3% |
| Whisper Base | 31.2% | 22.7% | 27.0% |
| Whisper Tiny | 47.7% | 34.6% | 41.2% |
| Picovoice Cheetah | 10.6% | 16.1% | 13.4% |
| Picovoice Leopard | 17.1% | 20.0% | 18.6% |- For Amazon Transcribe, Azure Speech-to-Text, and Google Speech-to-Text, we report results with the language set to `PT-BR`, as this achieves better results compared to `PT-PT` across all engines.