{"id":21423903,"url":"https://github.com/picovoice/speech-to-text-benchmark","last_synced_at":"2025-04-04T20:16:04.744Z","repository":{"id":41225800,"uuid":"143492492","full_name":"Picovoice/speech-to-text-benchmark","owner":"Picovoice","description":"speech to text benchmark framework","archived":false,"fork":false,"pushed_at":"2024-01-12T00:17:59.000Z","size":167219,"stargazers_count":577,"open_issues_count":1,"forks_count":62,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-02-17T08:34:39.932Z","etag":null,"topics":["aws-transcribe","cheetah","deep-learning","deep-neural-networks","deepspeech","edge-ai","google-speech-to-text","mozilla-deepspeech","offline","picovoice","pocketsphinx","privacy","speech-recognition","speech-to-text","voice-recognition"],"latest_commit_sha":null,"homepage":"https://picovoice.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Picovoice.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-04T02:52:01.000Z","updated_at":"2024-05-30T01:29:36.374Z","dependencies_parsed_at":"2023-12-06T21:24:22.518Z","dependency_job_id":"4f94822b-c4b6-4147-8e7d-29d0e4a6812e","html_url":"https://github.com/Picovoice/speech-to-text-benchmark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Picovoice%2Fspeech-to-text-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Picovoice%2Fspeech-to-text-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Picovoice%2Fspeech-to-text-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Picovoice%2Fspeech-to-text-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Picovoice","download_url":"https://codeload.github.com/Picovoice/speech-to-text-benchmark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242683,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-transcribe","cheetah","deep-learning","deep-neural-networks","deepspeech","edge-ai","google-speech-to-text","mozilla-deepspeech","offline","picovoice","pocketsphinx","privacy","speech-recognition","speech-to-text","voice-recognition"],"created_at":"2024-11-22T21:18:51.917Z","updated_at":"2025-04-04T20:16:04.720Z","avatar_url":"https://github.com/Picovoice.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Speech-to-Text Benchmark\n\nMade in Vancouver, Canada by [Picovoice](https://picovoice.ai)\n\nThis repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.\n\n## Table of Contents\n\n- [Data](#data)\n- [Metrics](#metrics)\n- [Engines](#engines)\n- [Usage](#usage)\n- [Results](#results)\n\n## Data\n\n- [LibriSpeech](http://www.openslr.org/12/)\n- [TED-LIUM](https://www.openslr.org/7/)\n- [Common Voice](https://commonvoice.mozilla.org/en)\n- [Multilingual LibriSpeech](https://openslr.org/94)\n- [VoxPopuli](https://github.com/facebookresearch/voxpopuli)\n\n## Metrics\n\n### Word Error Rate\n\nWord error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output\nof the speech-to-text engine to the number of words in the reference transcript.\n\n### Core-Hour\n\nThe Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine,\nindicating the number of CPU hours required to process one hour of audio. A speech-to-text\nengine with lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.\n\n### Model Size\n\nThe aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.\n\n## Engines\n\n- [Amazon Transcribe](https://aws.amazon.com/transcribe/)\n- [Azure Speech-to-Text](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/)\n- [Google Speech-to-Text](https://cloud.google.com/speech-to-text)\n- [IBM Watson Speech-to-Text](https://www.ibm.com/ca-en/cloud/watson-speech-to-text)\n- [OpenAI Whisper](https://github.com/openai/whisper)\n- [Picovoice Cheetah](https://picovoice.ai/)\n- [Picovoice Leopard](https://picovoice.ai/)\n\n## Usage\n\nThis benchmark has been developed and tested on `Ubuntu 22.04`.\n\n- Install [FFmpeg](https://www.ffmpeg.org/)\n- Download datasets.\n- Install the requirements:\n\n```console\npip3 install -r requirements.txt\n```\n\nIn the following, we provide instructions for running the benchmark for each engine. \nThe supported datasets are: \n`COMMON_VOICE`, `LIBRI_SPEECH_TEST_CLEAN`, `LIBRI_SPEECH_TEST_OTHER`, `TED_LIUM`, `MLS`, and `VOX_POPULI`.\nThe supported languages are:\n`EN`, `FR`, `DE`, `ES`, `IT`, `PT_BR`, and `PT_PT`.\n\n### Amazon Transcribe Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language, and `${AWS_PROFILE}`\nwith the name of AWS profile you wish to use.\n\n```console\npython3 benchmark.py \\\n--dataset ${DATASET} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--language ${LANGUAGE} \\\n--engine AMAZON_TRANSCRIBE \\\n--aws-profile ${AWS_PROFILE}\n```\n\n### Azure Speech-to-Text Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,\n`${AZURE_SPEECH_KEY}` and `${AZURE_SPEECH_LOCATION}` information from your Azure account.\n\n```console\npython3 benchmark.py \\\n--dataset ${DATASET} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--language ${LANGUAGE} \\\n--engine AZURE_SPEECH_TO_TEXT \\\n--azure-speech-key ${AZURE_SPEECH_KEY}\n--azure-speech-location ${AZURE_SPEECH_LOCATION}\n```\n\n### Google Speech-to-Text Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,\nand `${GOOGLE_APPLICATION_CREDENTIALS}` with credentials download from Google Cloud Platform.\n\n```console\npython3 benchmark.py \\\n--dataset ${DATASET} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--language ${LANGUAGE} \\\n--engine GOOGLE_SPEECH_TO_TEXT \\\n--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}\n```\n\n### IBM Watson Speech-to-Text Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset,\nand `${WATSON_SPEECH_TO_TEXT_API_KEY}`/`${${WATSON_SPEECH_TO_TEXT_URL}}` with credentials from your IBM account.\n\n```console\npython3 benchmark.py \\\n--dataset ${DATASET} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--engine IBM_WATSON_SPEECH_TO_TEXT \\\n--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}\n--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}\n```\n\n### OpenAI Whisper Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,\nand `${WHISPER_MODEL}` with the whisper model type (`WHISPER_TINY`, `WHISPER_BASE`, `WHISPER_SMALL`,\n`WHISPER_MEDIUM`, `WHISPER_LARGE_V1`, `WHISPER_LARGE_V2` or `WHISPER_LARGE_V3`)\n\n```console\npython3 benchmark.py \\\n--engine ${WHISPER_MODEL} \\\n--dataset ${DATASET} \\\n--language ${LANGUAGE} \\\n--dataset-folder ${DATASET_FOLDER} \\\n```\n\n### Picovoice Cheetah Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,\nand `${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).\nIf benchmarking a non-English language, include `--picovoice-model-path` and replace `${PICOVOICE_MODEL_PATH}` with the path to a model file acquired from the [Cheetah Github Repo](https://github.com/Picovoice/cheetah/tree/master/lib/common/).\n\n```console\npython3 benchmark.py \\\n--engine PICOVOICE_CHEETAH \\\n--dataset ${DATASET} \\\n--language ${LANGUAGE} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--picovoice-access-key ${PICOVOICE_ACCESS_KEY}\n--picovoice-model-path ${PICOVOICE_MODEL_PATH}\n```\n\n### Picovoice Leopard Instructions\n\nReplace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, `${LANGUAGE}` with the target language,\nand `${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).\nIf benchmarking a non-English language, include `--picovoice-model-path` and replace `${PICOVOICE_MODEL_PATH}` with the path to a model file acquired from the [Leopard Github Repo](https://github.com/Picovoice/leopard/tree/master/lib/common/).\n\n```console\npython3 benchmark.py \\\n--engine PICOVOICE_LEOPARD \\\n--dataset ${DATASET} \\\n--language ${LANGUAGE} \\\n--dataset-folder ${DATASET_FOLDER} \\\n--picovoice-access-key ${PICOVOICE_ACCESS_KEY}\n--picovoice-model-path ${PICOVOICE_MODEL_PATH}\n```\n\n## Results\n\n### English\n\n#### Word Error Rate\n\n![](results/plots/WER.png)\n\n|             Engine             | LibriSpeech test-clean | LibriSpeech test-other | TED-LIUM | CommonVoice | Average |\n|:------------------------------:|:----------------------:|:----------------------:|:--------:|:-----------:|:-------:|\n|       Amazon Transcribe        |          2.6%          |          5.6%          |   3.8%   |    8.7%     |  5.2%   |\n|      Azure Speech-to-Text      |          2.8%          |          6.2%          |   4.6%   |    8.9%     |  5.6%   |\n|     Google Speech-to-Text      |         10.8%          |         24.5%          |  14.4%   |    31.9%    |  20.4%  |\n| Google Speech-to-Text Enhanced |          6.2%          |         13.0%          |   6.1%   |    18.2%    |  10.9%  |\n|   IBM Watson Speech-to-Text    |         10.9%          |         26.2%          |  11.7%   |    39.4%    |  22.0%  |\n|  Whisper Large (Multilingual)  |          3.7%          |          5.4%          |   4.6%   |    9.0%     |  5.7%   |\n|         Whisper Medium         |          3.3%          |          6.2%          |   4.6%   |    10.2%    |  6.1%   |\n|         Whisper Small          |          3.3%          |          7.2%          |   4.8%   |    12.7%    |  7.0%   |\n|          Whisper Base          |          4.3%          |         10.4%          |   5.4%   |    17.9%    |  9.5%   |\n|          Whisper Tiny          |          5.9%          |         13.8%          |   6.5%   |    24.4%    |  12.7%  |\n|       Picovoice Cheetah        |          5.4%          |         12.0%          |   6.8%   |    17.3%    |  10.4%  |\n|       Picovoice Leopard        |          5.1%          |         11.1%          |   6.4%   |    16.1%    |  9.7%   |\n\n\n#### Core-Hour \u0026 Model Size\n\nTo obtain these results, we ran the benchmark across the entire TED-LIUM dataset and recorded the processing time.\nThe measurement is carried out on an Ubuntu 22.04 machine with AMD CPU (`AMD Ryzen 9 5900X (12) @ 3.70GHz`),\n64 GB of RAM, and NVMe storage, using 10 cores simultaneously. We omit Whisper Large from this benchmark.\n\n|      Engine       | Core-Hour | Model Size / MB |\n|:-----------------:|:---------:|:---------------:|\n|  Whisper Medium   |   1.50    |      1457       |\n|   Whisper Small   |   0.89    |       462       |\n|   Whisper Base    |   0.28    |       139       |\n|   Whisper Tiny    |   0.15    |       73        |\n| Picovoice Leopard |   0.05    |       36        |\n| Picovoice Cheetah |   0.09    |       31        |\n\n![](results/plots/cpu_usage_comparison.png)\n\n### French\n\n#### Word Error Rate\n\n![](results/plots/WER_FR.png)\n\n|             Engine             | CommonVoice | Multilingual LibriSpeech  | VoxPopuli | Average |\n|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|\n|       Amazon Transcribe        |    6.0%     |          4.4%             |   8.6%    |  6.3%   |\n|      Azure Speech-to-Text      |    11.1%    |          9.0%             |   11.8%   |  10.6%  |\n|     Google Speech-to-Text      |    14.3%    |          14.2%            |   15.1%   |  14.5%  |\n|         Whisper Large          |    9.3%     |          4.6%             |   10.9%   |  8.3%   |\n|         Whisper Medium         |    13.1%    |          8.6%             |   12.1%   |  11.3%  |\n|         Whisper Small          |    19.2%    |          13.5%            |   15.3%   |  16.0%  |\n|          Whisper Base          |    35.4%    |          24.4%            |   23.3%   |  27.7%  |\n|          Whisper Tiny          |    49.8%    |          36.2%            |   32.1%   |  39.4%  |\n|       Picovoice Cheetah        |    14.5%    |          14.5%            |   14.9%   |  14.6%  |\n|       Picovoice Leopard        |    15.9%    |          19.2%            |   17.5%   |  17.5%  |\n\n### German\n\n#### Word Error Rate\n\n![](results/plots/WER_DE.png)\n\n|             Engine             | CommonVoice | Multilingual LibriSpeech  | VoxPopuli | Average |\n|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|\n|       Amazon Transcribe        |    5.3%     |          2.9%             |   14.6%   |  7.6%   |\n|      Azure Speech-to-Text      |    6.9%     |          5.4%             |   13.1%   |  8.5%   |\n|     Google Speech-to-Text      |    9.2%     |          13.9%            |   17.2%   |  13.4%  |\n|         Whisper Large          |    5.3%     |          4.4%             |   12.5%   |  7.4%   |\n|         Whisper Medium         |    8.3%     |          7.6%             |   13.5%   |  9.8%   |\n|         Whisper Small          |    13.8%    |          11.2%            |   16.2%   |  13.7%  |\n|          Whisper Base          |    26.9%    |          19.8%            |   24.0%   |  23.6%  |\n|          Whisper Tiny          |    39.5%    |          28.6%            |   33.0%   |  33.7%  |\n|       Picovoice Cheetah        |    8.4%     |          12.1%            |   17.0%   |  12.5%  |\n|       Picovoice Leopard        |    8.2%     |          11.6%            |   23.6%   |  14.5%  |\n\n### Italian\n\n#### Word Error Rate\n\n![](results/plots/WER_IT.png)\n\n|             Engine             | CommonVoice | Multilingual LibriSpeech  | VoxPopuli | Average |\n|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|\n|       Amazon Transcribe        |    4.1%     |          9.1%             |   16.1%   |  9.8%   |\n|      Azure Speech-to-Text      |    5.8%     |          14.0%            |   17.8%   |  12.5%  |\n|     Google Speech-to-Text      |    5.5%     |          19.6%            |   18.7%   |  14.6%  |\n|         Whisper Large          |    4.9%     |          8.8%             |   21.8%   |  11.8%  |\n|         Whisper Medium         |    8.7%     |          14.9%            |   19.3%   |  14.3%  |\n|         Whisper Small          |    15.4%    |          20.6%            |   22.7%   |  19.6%  |\n|          Whisper Base          |    32.3%    |          31.6%            |   31.6%   |  31.8%  |\n|          Whisper Tiny          |    48.1%    |          43.3%            |   43.5%   |  45.0%  |\n|       Picovoice Cheetah        |    8.6%     |          17.6%            |   20.1%   |  15.4%  |\n|       Picovoice Leopard        |    13.0%    |          27.7%            |   22.2%   |  21.0%  |\n\n### Spanish\n\n#### Word Error Rate\n\n![](results/plots/WER_ES.png)\n\n|             Engine             | CommonVoice | Multilingual LibriSpeech  | VoxPopuli | Average |\n|:------------------------------:|:-----------:|:-------------------------:|:---------:|:-------:|\n|       Amazon Transcribe        |    3.9%     |          3.3%             |   8.7%    |  5.3%   |\n|      Azure Speech-to-Text      |    6.3%     |          5.8%             |   9.4%    |  7.2%   |\n|     Google Speech-to-Text      |    6.6%     |          9.2%             |   11.6%   |  9.1%   |\n|         Whisper Large          |    4.0%     |          2.9%             |   9.7%    |  5.5%   |\n|         Whisper Medium         |    6.2%     |          4.8%             |   9.7%    |  6.9%   |\n|         Whisper Small          |    9.8%     |          7.7%             |   11.4%   |  9.6%   |\n|          Whisper Base          |    20.2%    |          13.0%            |   15.3%   |  16.2%  |\n|          Whisper Tiny          |    33.3%    |          20.6%            |   22.7%   |  25.5%  |\n|       Picovoice Cheetah        |    8.3%     |          8.0%             |   11.4%   |  9.2%   |\n|       Picovoice Leopard        |    7.6%     |          14.9%            |   14.1%   |  12.2%  |\n\n### Portuguese\n\n#### Word Error Rate\n\n![](results/plots/WER_PT.png)\n\n|             Engine             | CommonVoice | Multilingual LibriSpeech  | Average |\n|:------------------------------:|:-----------:|:-------------------------:|:-------:|\n|       Amazon Transcribe        |    5.4%     |          7.8%             |  6.6%   |\n|      Azure Speech-to-Text      |    7.4%     |          9.0%             |  8.2%   |\n|     Google Speech-to-Text      |    8.8%     |          14.2%            |  11.5%  |\n|         Whisper Large          |    5.9%     |          5.4%             |  5.7%   |\n|         Whisper Medium         |    9.6%     |          8.1%             |  8.9%   |\n|         Whisper Small          |    15.6%    |          13.0%            |  14.3%  |\n|          Whisper Base          |    31.2%    |          22.7%            |  27.0%  |\n|          Whisper Tiny          |    47.7%    |          34.6%            |  41.2%  |\n|       Picovoice Cheetah        |    10.6%    |          16.1%            |  13.4%  |\n|       Picovoice Leopard        |    17.1%    |          20.0%            |  18.6%  |\n\n- For Amazon Transcribe, Azure Speech-to-Text, and Google Speech-to-Text, we report results with the language set to `PT-BR`, as this achieves better results compared to `PT-PT` across all engines.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpicovoice%2Fspeech-to-text-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpicovoice%2Fspeech-to-text-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpicovoice%2Fspeech-to-text-benchmark/lists"}