{"id":13579780,"url":"https://github.com/gooofy/zamia-speech","last_synced_at":"2025-04-05T14:07:59.910Z","repository":{"id":39916714,"uuid":"80461599","full_name":"gooofy/zamia-speech","owner":"gooofy","description":"Open tools and data for cloudless automatic speech recognition","archived":false,"fork":false,"pushed_at":"2021-03-30T05:25:41.000Z","size":201237,"stargazers_count":447,"open_issues_count":32,"forks_count":84,"subscribers_count":37,"default_branch":"master","last_synced_at":"2025-03-29T13:09:00.826Z","etag":null,"topics":["asr","cmu-sphinx","kaldi","language-model","lexicon","sequitur","speech-corpora","speech-recognition","voxforge"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gooofy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-01-30T20:42:45.000Z","updated_at":"2025-01-26T14:11:11.000Z","dependencies_parsed_at":"2022-09-22T23:30:27.134Z","dependency_job_id":null,"html_url":"https://github.com/gooofy/zamia-speech","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzamia-speech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzamia-speech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzamia-speech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzamia-speech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gooofy","download_url":"https://codeload.github.com/gooofy/zamia-speech/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247345853,"owners_count":20924102,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","cmu-sphinx","kaldi","language-model","lexicon","sequitur","speech-corpora","speech-recognition","voxforge"],"created_at":"2024-08-01T15:01:43.111Z","updated_at":"2025-04-05T14:07:59.883Z","avatar_url":"https://github.com/gooofy.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Zamia Speech\n\nPython scripts to compute audio and language models from voxforge.org speech data and many sources.\nModels that can be built include:\n\n* Kaldi nnet3 chain audio models\n* KenLM language models in ARPA format\n* sequitur g2p models\n* wav2letter++ models\n\n*Important*: Please note that these scripts form in no way a complete application ready for end-user consumption.\nHowever, if you are a developer interested in natural language processing you may find some of them useful.\nContributions, patches and pull requests are very welcome.\n\nAt the time of this writing, the scripts here are focused on building the English and German VoxForge models. \nHowever, there is no reason why they couldn't be used to build other language models as well, feel free to \ncontribute support for those.\n\n\nTable of Contents\n=================\n\n   * [Zamia Speech](#zamia-speech)\n   * [Table of Contents](#table-of-contents)\n   * [Download](#download)\n      * [ASR Models](#asr-models)\n      * [IPA Dictionaries (Lexicons)](#ipa-dictionaries-lexicons)\n      * [G2P Models](#g2p-models)\n      * [Language Models](#language-models)\n      * [Code](#code)\n   * [Get Started with our Pre-Trained Models](#get-started-with-our-pre-trained-models)\n      * [Run Example Applications](#run-example-applications)\n         * [Wave File Decoding Demo](#wave-file-decoding-demo)\n         * [Live Mic Demo](#live-mic-demo)\n   * [Get Started with a Demo STT Service Packaged in Docker](#get-started-with-a-demo-stt-service-packaged-in-docker)\n   * [Requirements](#requirements)\n   * [Setup Notes](#setup-notes)\n      * [~/.speechrc](#speechrc)\n      * [tmp directory](#tmp-directory)\n   * [Speech Corpora](#speech-corpora)\n      * [Adding Artificial Noise or Other Effects](#adding-artificial-noise-or-other-effects)\n   * [Text Corpora](#text-corpora)\n   * [Language Model](#language-model)\n      * [English](#english)\n      * [German](#german)\n      * [French](#french)\n   * [Submission Review and Transcription](#submission-review-and-transcription)\n   * [Lexica/Dictionaries](#lexicadictionaries)\n      * [Sequitur G2P](#sequitur-g2p)\n      * [Manual Editing](#manual-editing)\n      * [Wiktionary](#wiktionary)\n   * [Kaldi Models (recommended)](#kaldi-models-recommended)\n      * [English NNet3 Chain Models](#english-nnet3-chain-models)\n      * [German NNet3 Chain Models](#german-nnet3-chain-models)\n      * [Model Adaptation](#model-adaptation)\n   * [wav2letter   models](#wav2letter-models)\n      * [English Wav2letter Models](#english-wav2letter-models)\n      * [German Wav2letter Models](#german-wav2letter-models)\n      * [auto-reviews using wav2letter](#auto-reviews-using-wav2letter)\n   * [Audiobook Segmentation and Transcription (Manual)](#audiobook-segmentation-and-transcription-manual)\n      * [(0/3) Convert Audio to WAVE Format](#03-convert-audio-to-wave-format)\n      * [(1/3) Convert Audio to 16kHz mono](#13-convert-audio-to-16khz-mono)\n      * [(2/3) Split Audio into Segments](#23-split-audio-into-segments)\n      * [(3/3) Transcribe Audio](#33-transcribe-audio)\n   * [Audiobook Segmentation and Transcription (kaldi)](#audiobook-segmentation-and-transcription-kaldi)\n      * [Directory Layout](#directory-layout)\n      * [(1/4) Preprocess the Transcript](#14-preprocess-the-transcript)\n      * [(2/4) Model adaptation](#24-model-adaptation)\n      * [(3/4) Auto-Segment using Kaldi](#34-auto-segment-using-kaldi)\n      * [(4/4) Retrieve Segmentation Result](#44-retrieve-segmentation-result)\n   * [Training Voices for Zamia-TTS](#training-voices-for-zamia-tts)\n      * [Tacotron 2](#tacotron-2)\n      * [Tacotron](#tacotron)\n   * [Model Distribution](#model-distribution)\n   * [License](#license)\n   * [Authors](#authors)\n\nCreated by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc.go)\n\nDownload\n========\n\nWe have various models plus source code and binaries for the tools used to build these models\navailable for download. Everything is free and open source.\n\nAll our model and data downloads can be found here: [Downloads](http://goofy.zamia.org/zamia-speech/)\n\nASR Models \n----------\n\nOur pre-built ASR models can be downloaded here: [ASR Models](http://goofy.zamia.org/zamia-speech/asr-models/)\n\n+ Kaldi ASR, English:\n    + `kaldi-generic-en-tdnn_f`\n      Large nnet3-chain factorized TDNN model, trained on ~1200 hours of audio. Has decent background noise resistance and can\n      also be used on phone recordings. Should provide the best accuracy but is a bit more resource intensive than the\n      other models.\n    + `kaldi-generic-en-tdnn_sp`\n      Large nnet3-chain model, trained on ~1200 hours of audio. Has decent background noise resistance and can\n      also be used on phone recordings. Less accurate but also slightly less resource intensive than the `tddn_f` model.\n    + `kaldi-generic-en-tdnn_250`\n      Same as the larger models but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).\n    + `kaldi-generic-en-tri2b_chain`\n      GMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.\n+ Kaldi ASR, German:\n    + `kaldi-generic-de-tdnn_f`\n      Large nnet3-chain model, trained on ~400 hours of audio. Has decent background noise resistance and can\n      also be used on phone recordings.\n    + `kaldi-generic-de-tdnn_250`\n      Same as the large model but less resource intensive, suitable for use in embedded applications (e.g. a RaspberryPi 3).\n    + `kaldi-generic-de-tri2b_chain`\n      GMM Model, trained on the same data as the above two models - meant for auto segmentation tasks.\n+ wav2letter++, German:\n    + `w2l-generic-de`\n      Large model, trained on ~400 hours of audio. Has decent background noise resistance and can\n      also be used on phone recordings.\n\n*NOTE*: It is important to realize that these models can and should be adapted to your application domain. See \n        [Model Adaptation](#model-adaptation) for details.\n\nIPA Dictionaries (Lexicons)\n---------------------------\n\nOur dictionaries can be downloaded here: [Dictionaries](https://github.com/gooofy/zamia-speech/tree/master/data/src/dicts)\n\n+ IPA UTF-8, English:\n    + `dict-en.ipa`\n      Based on CMUDict with many additional entries generated via Sequitur G2P.\n+ IPA UTF-8, German:\n    + `dict-de.ipa`\n      Created manually from scratch with many additional auto-reviewed entries extracted from Wiktionary.\n\nG2P Models \n----------\n\nOur pre-built G2P models can be downloaded here: [G2P Models](http://goofy.zamia.org/zamia-speech/g2p/)\n\n+ Sequitur, English:\n    + ` sequitur-dict-en.ipa`\n      Sequitur G2P model trained on our English IPA dictionary (UTF8).\n+ Sequitur, German:\n    + ` sequitur-dict-de.ipa`\n      Sequitur G2P model trained on our German IPA dictionary (UTF8).\n\nLanguage Models\n---------------\n\nOur pre-built ARPA language models can be downloaded here: [Language Models](http://goofy.zamia.org/zamia-speech/lm/)\n\n+ KenLM, order 4, English, ARPA:\n    + `generic_en_lang_model_small`\n+ KenLM, order 6, English, ARPA:\n    + `generic_en_lang_model_large`\n+ KenLM, order 4, German, ARPA:\n    + `generic_de_lang_model_small`\n+ KenLM, order 6, German, ARPA:\n    + `generic_de_lang_model_large`\n\nCode\n----\n\n* [Zamia-Speech](https://github.com/gooofy/zamia-speech) \n    where we host all our scripts and other sources used to build our models. \n* [py-kaldi-asr](https://github.com/gooofy/py-kaldi-asr) \n    Python wrapper around Kaldi's nnet3-chain decoder complete with example\n    scripts on how to use our models in your application.\n* [Binary AI Packages](http://goofy.zamia.org/repo-ai/)\n    + [Raspbian APT Repo](http://goofy.zamia.org/repo-ai/raspbian/stretch/armhf)\n        Binary packages in Debian format for Raspbian 9 (stretch, armhf, Raspberry Pi 2/3)\n    + [Debian APT Repo](http://goofy.zamia.org/repo-ai/debian/stretch/amd64)\n        Binary packages in Debian format for Debian 9 (stretch, amd64)\n    + [CentOS YUM Repo](http://goofy.zamia.org/repo-ai/centos/7/x86_64/)\n        Binary packages in RPM format for CentOS 7 (x86_64)\n* [Source AI Packages](http://goofy.zamia.org/repo-ai/)\n    + [CentOS 7](http://goofy.zamia.org/repo-ai/centos/7/SRPMS/)\n        Source packages in SRPM format for CentOS 7\n\nGet Started with our Pre-Trained Models \n=======================================\n\nRun Example Applications\n------------------------\n\n### Wave File Decoding Demo\n\nDownload a few sample wave files\n\n```bash\n$ wget http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgz\n--2018-06-23 16:46:28--  http://goofy.zamia.org/zamia-speech/misc/demo_wavs.tgz\nResolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20\nConnecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 619852 (605K) [application/x-gzip]\nSaving to: ‘demo_wavs.tgz’\n\ndemo_wavs.tgz                     100%[==========================================================\u003e] 605.32K  2.01MB/s    in 0.3s    \n\n2018-06-23 16:46:28 (2.01 MB/s) - ‘demo_wavs.tgz’ saved [619852/619852]\n```\n\nunpack them:\n\n```bash\n$ tar xfvz demo_wavs.tgz\ndemo1.wav\ndemo2.wav\ndemo3.wav\ndemo4.wav\n```\n\ndownload the demo program \n\n\n```bash\n$ wget http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.py\n--2018-06-23 16:47:53--  http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_wav.py\nResolving goofy.zamia.org (goofy.zamia.org)... 78.47.65.20\nConnecting to goofy.zamia.org (goofy.zamia.org)|78.47.65.20|:80... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2469 (2.4K) [text/plain]\nSaving to: ‘kaldi_decode_wav.py’\n\nkaldi_decode_wav.py               100%[==========================================================\u003e]   2.41K  --.-KB/s    in 0s      \n\n2018-06-23 16:47:53 (311 MB/s) - ‘kaldi_decode_wav.py’ saved [2469/2469]\n```\n\nnow run kaldi automatic speech recognition on the demo wav files:\n\n```bash\n$ python kaldi_decode_wav.py -v demo?.wav\nDEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model...\nDEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp loading model... done, took 1.473226s.\nDEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder...\nDEBUG:root:/opt/kaldi/model/kaldi-generic-en-tdnn_sp creating decoder... done, took 0.143928s.\nDEBUG:root:demo1.wav decoding took     0.37s, likelyhood: 1.863645\ni cannot follow you she said \nDEBUG:root:demo2.wav decoding took     0.54s, likelyhood: 1.572326\ni should like to engage just for one whole life in that \nDEBUG:root:demo3.wav decoding took     0.42s, likelyhood: 1.709773\nphilip knew that she was not an indian \nDEBUG:root:demo4.wav decoding took     1.06s, likelyhood: 1.715135\nhe also contented that better confidence was established by carrying no weapons \n```\n\n### Live Mic Demo\n\nDetermine the name of your pulseaudio mic source:\n\n```bash\n$ pactl list sources\nSource #0\n    State: SUSPENDED\n    Name: alsa_input.usb-C-Media_Electronics_Inc._USB_PnP_Sound_Device-00.analog-mono\n    Description: CM108 Audio Controller Analog Mono\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n```\n\ndownload and run demo:\n\n```bash\n$ wget 'http://goofy.zamia.org/zamia-speech/misc/kaldi_decode_live.py'\n\n$ python kaldi_decode_live.py -s 'CM108'\nKaldi live demo V0.2\nLoading model from /opt/kaldi/model/kaldi-generic-en-tdnn_250 ...\nPlease speak.\nhallo computer                      \nswitch on the radio please                      \nplease switch on the light                      \nwhat about the weather in stuttgart                     \nhow are you                      \nthank you                      \ngood bye \n```\n\nGet Started with a Demo STT Service Packaged in Docker\n======================================================\n\nTo start the STT service on your local machine, execute:\n\n```bash\n$ docker pull quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611\n$ docker run --rm -p 127.0.0.1:8080:80/tcp quay.io/mpuels/docker-py-kaldi-asr-and-model:kaldi-generic-en-tdnn_sp-r20180611\n```\n\nTo transfer an audio file for transcription to the service, in a second\nterminal, execute:\n\n```bash\n$ git clone https://github.com/mpuels/docker-py-kaldi-asr-and-model.git\n$ conda env create -f environment.yml\n$ source activate py-kaldi-asr-client\n$ ./asr_client.py asr.wav\nINFO:root: 0.005s:  4000 frames ( 0.250s) decoded, status=200.\n...\nINFO:root:19.146s: 152000 frames ( 9.500s) decoded, status=200.\nINFO:root:27.136s: 153003 frames ( 9.563s) decoded, status=200.\nINFO:root:*****************************************************************\nINFO:root:** wavfn         : asr.wav\nINFO:root:** hstr          : speech recognition system requires training where individuals to exercise political system\nINFO:root:** confidence    : -0.578844\nINFO:root:** decoding time :    27.14s\nINFO:root:*****************************************************************\n```\n\nThe Docker image in the example above is the result of stacking 4 images on top\nof each other:\n\n- docker-py-kaldi-asr-and-model: [Source](https://github.com/mpuels/docker-py-kaldi-asr-and-model), [Image](https://quay.io/repository/mpuels/docker-py-kaldi-asr-and-model)\n\n- docker-py-kaldi-asr: [Source](https://github.com/mpuels/docker-py-kaldi-asr), [Image](https://quay.io/repository/mpuels/docker-py-kaldi-asr)\n\n- docker-kaldi-asr: [Source](https://github.com/mpuels/docker-kaldi-asr), [Image](https://quay.io/repository/mpuels/docker-kaldi-asr)\n\n- debian:8: https://hub.docker.com/_/debian/\n\n\nRequirements\n============\n\n*Note*: probably incomplete.\n\n* Python 2.7 with nltk, numpy, ...\n* KenLM\n* kaldi\n* wav2letter++\n* py-nltools\n* sox\n* ffmpeg\n\n*Dependencies installation example for Debian*: \n\n    apt-get install build-essential pkg-config python-pip python-dev python-setuptools python-wheel ffmpeg sox libatlas-base-dev\n    \n    # Create a symbolic link because one of the pip packages expect atlas in this location: \n    ln -s /usr/include/x86_64-linux-gnu/atlas /usr/include/atlas\n    \n    pip install numpy nltk cython\n    pip install py-kaldi-asr py-nltools\n\nSetup Notes\n===========\n\nJust some rough notes on the environment needed to get these scripts to run. This is in no way a complete set of\ninstructions, just some hints to get you started.\n\n`~/.speechrc`\n-------------\n\n```ini\n[speech]\nvf_login              = \u003cyour voxforge login\u003e\n\nspeech_arc            = /home/bofh/projects/ai/data/speech/arc\nspeech_corpora        = /home/bofh/projects/ai/data/speech/corpora\n\nkaldi_root            = /apps/kaldi-cuda\n\n; facebook's wav2letter++\nw2l_env_activate      = /home/bofh/projects/ai/w2l/bin/activate\nw2l_train             = /home/bofh/projects/ai/w2l/src/wav2letter/build/Train\nw2l_decoder           = /home/bofh/projects/ai/w2l/src/wav2letter/build/Decoder\n\nwav16                 = /home/bofh/projects/ai/data/speech/16kHz\nnoise_dir             = /home/bofh/projects/ai/data/speech/corpora/noise\n\neuroparl_de           = /home/bofh/projects/ai/data/corpora/de/europarl-v7.de-en.de\nparole_de             = /home/bofh/projects/ai/data/corpora/de/German Parole Corpus/DE_Parole/\n\neuroparl_en           = /home/bofh/projects/ai/data/corpora/en/europarl-v7.de-en.en\ncornell_movie_dialogs = /home/bofh/projects/ai/data/corpora/en/cornell_movie_dialogs_corpus\nweb_questions         = /home/bofh/projects/ai/data/corpora/en/WebQuestions\nyahoo_answers         = /home/bofh/projects/ai/data/corpora/en/YahooAnswers\n\neuroparl_fr           = /home/bofh/projects/ai/data/corpora/fr/europarl-v7.fr-en.fr\nest_republicain       = /home/bofh/projects/ai/data/corpora/fr/est_republicain.txt\n\nwiktionary_de         = /home/bofh/projects/ai/data/corpora/de/dewiktionary-20180320-pages-meta-current.xml\n\n[tts]\nhost                  = localhost\nport                  = 8300\n```\n\ntmp directory\n-------------\n\nSome scripts expect al local `tmp` directory to be present, located in the same directory where all the scripts live, i.e.\n\n```bash\nmkdir tmp\n```\n\n\nSpeech Corpora\n==============\n\nThe following list contains speech corpora supported by this script collection.\n\n- [Forschergeist (German, 2 hours)](http://goofy.zamia.org/zamia-speech/corpora/forschergeist/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/forschergeist` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/forschergeist`\n\n- [German Speechdata Package Version 2 (German, 148 hours)](http://www.repository.voxforge1.org/downloads/de/german-speechdata-package-v2.tar.gz):\n    + Unpack the archive such that the directories `dev`, `test`, and `train` are\n      direct subdirectories of `\u003c~/.speechrc:speech_arc\u003e/gspv2`. \n    + Then run run the script `./import_gspv2.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/gspv2`. \n\n- [Noise](http://goofy.zamia.org/zamia-speech/corpora/noise.tar.xz):\n    + Download the tarball \n    + unpack it into the directory `\u003c~/.speechrc:speech_corpora\u003e/` (it will generate a `noise` subdirectory there)\n\n- [LibriSpeech ASR (English, 475 hours)](http://www.openslr.org/12/):\n    + Download the set of 360 hours \"clean\" speech tarball\n    + Unpack the archive such that the directory `LibriSpeech` is a direct \n      subdirectory of `\u003c~/.speechrc:speech_arc\u003e`. \n    + Then run run the script `./import_librispeech.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/librispeech`. \n\n- [The LJ Speech Dataset (English, 24 hours)](https://keithito.com/LJ-Speech-Dataset/):\n    + Download the tarball\n    + Unpack the archive such that the directory `LJSpeech-1.1` is a direct \n      subdirectory of `\u003c~/.speechrc:speech_arc\u003e`. \n    + Then run run the script `import_ljspeech.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/lindajohnson-11`. \n\n- [Mozilla Common Voice German (German, 140 hours)](https://voice.mozilla.org/en/datasets):\n    + Download `de.tar.gz`\n    + Unpack the archive such that the directory `cv_de` is a direct \n      subdirectory of `\u003c~/.speechrc:speech_arc\u003e`. \n    + Then run run the script `./import_mozde.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/cv_de`. \n\n- [Mozilla Common Voice V1 (English, 252 hours)](https://voice.mozilla.org/en/data):\n    + Download `cv_corpus_v1.tar.gz`\n    + Unpack the archive such that the directory `cv_corpus_v1` is a direct \n      subdirectory of `\u003c~/.speechrc:speech_arc\u003e`. \n    + Then run run the script `./import_mozcv1.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/cv_corpus_v1`. \n\n- [Munich Artificial Intelligence Laboratories GmbH (M-AILABS) Speech Dataset (English, 147 hours, German, 237 hours, French, 190 hours)](http://www.m-ailabs.bayern/en/):\n    + Download `de_DE.tgz`, `en_UK.tgz`, `en_US.tgz`, `fr_FR.tgz` ([Mirror](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/))\n    + Create a subdirectory `m_ailabs` in `\u003c~/.speechrc:speech_arc\u003e`\n    + Unpack the downloaded tarbals inside the `m_ailabs` subdirectory\n    + For French, create a directory `by_book` and move `male` and `female` directories in it as the archive does not follow exactly English and German structures\n    + Then run run the script `./import_mailabs.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/m_ailabs_en`, `\u003c~/.speechrc:speech_corpora\u003e/m_ailabs_de` and `\u003c~/.speechrc:speech_corpora\u003e/m_ailabs_fr`.\n\n- [TED-LIUM Release 3 (English, 210 hours)](https://www.openslr.org/51/):\n    + Download `TEDLIUM_release-3.tgz`\n    + Unpack the archive such that the directory `TEDLIUM_release-3` is a direct \n      subdirectory of `\u003c~/.speechrc:speech_arc\u003e`. \n    + Then run run the script `./import_tedlium3.py` to convert the corpus to the VoxForge\n      format. The resulting corpus will be written to `\u003c~/.speechrc:speech_corpora\u003e/tedlium3`. \n\n- [VoxForge (English, 75 hours)](http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/voxforge_en` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/voxforge_en`\n\n- [VoxForge (German, 56 hours)](http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/voxforge_de` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/voxforge_de`\n\n- [VoxForge (French, 140 hours)](http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/voxforge_fr` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/voxforge_fr`\n\n- [Zamia (English, 5 minutes)](http://goofy.zamia.org/zamia-speech/corpora/zamia_en/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/zamia_en` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/zamia_en`\n\n- [Zamia (German, 18 hours)](http://goofy.zamia.org/zamia-speech/corpora/zamia_de/):\n    + Download all .tgz files into the directory `\u003c~/.speechrc:speech_arc\u003e/zamia_de` \n    + unpack them into the directory `\u003c~/.speechrc:speech_corpora\u003e/zamia_de`\n\n\n*Technical note*: For most corpora we have corrected transcripts in our databases which can be found\nin `data/src/speech/\u003ccorpus_name\u003e/transcripts_*.csv`. As these have been created by many hours of (semi-) \nmanual review they should be of higher quality than the original prompts so they will be used during\ntraining of our ASR models.\n\nOnce you have downloaded and, if necessary, converted a corpus you need to run\n\n```bash\n./speech_audio_scan.py \u003ccorpus name\u003e\n```\n\non it. This will add missing prompts to the CSV databases and convert audio files to 16kHz mono WAVE format.\n\nAdding Artificial Noise or Other Effects\n----------------------------------------\n\nTo improve noise resistance it is possible to derive corpora from existing ones with noise added:\n\n```bash\n./speech_gen_noisy.py zamia_de\n./speech_audio_scan.py zamia_de_noisy\ncp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_noisy/\ncp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_noisy/\n./auto_review.py -a zamia_de_noisy\n./apply_review.py -l de zamia_de_noisy review-result.csv \n```\n\nThis script will run recording through typical telephone codecs. Such a corpus can be used to train models\nthat support 8kHz phone recordings:\n\n```bash\n./speech_gen_phone.py zamia_de\n./speech_audio_scan.py zamia_de_phone\ncp data/src/speech/zamia_de/spk2gender data/src/speech/zamia_de_phone/\ncp data/src/speech/zamia_de/spk_test.txt data/src/speech/zamia_de_phone/\n./auto_review.py -a zamia_de_phone\n./apply_review.py -l de zamia_de_phone review-result.csv \n```\n\nText Corpora\n============\n\nThe following list contains text corpora that can be used to train language\nmodels with the scripts contained in this repository:\n\n- [Europarl](http://www.statmt.org/europarl/), specifically [parallel corpus German-English](http://www.statmt.org/europarl/v7/de-en.tgz) and [parallel corpus French-English](http://www.statmt.org/europarl/v7/fr-en.tgz): \n    + corresponding variable in `.speechrc`: `europarl_de`, `europarl_en`, `europarl_fr`\n    + sentences extraction: run `./speech_sentences.py europarl_de`, `./speech_sentences.py europarl_en` and `./speech_sentences.py europarl_fr`\n\n- [Cornell Movie--Dialogs Corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html): \n    + corresponding variable in `.speechrc`: `cornell_movie_dialogs`\n    + sentences extraction: run `./speech_sentences.py cornell_movie_dialogs`\n\n- [German Parole Corpus](http://ota.ox.ac.uk/desc/2467): \n    + corresponding variable in `.speechrc`: `parole_de`\n    + sentences extraction: train punkt tokenizer using `./speech_train_punkt_tokenizer.py`, then run `./speech_sentences.py parole_de`\n\n- [WebQuestions](https://nlp.stanford.edu/software/sempre/): `web_questions`\n    + corresponding variable in `.speechrc`: `web_questions`\n    + sentences extraction: run `./speech_sentences.py web_questions`\n\n- [Yahoo! Answers dataset](https://cogcomp.org/page/resource_view/89): `yahoo_answers`\n    + corresponding variable in `.speechrc`: `yahoo_answers`\n    + sentences extraction: run `./speech_sentences.py yahoo_answers`\n\n- [CNRTL Est Républicain Corpus](http://cnrtl.fr/corpus/estrepublicain/), large corpus of news articles (4.3M headlines/paragraphs) available under a CC BY-NC-SA license. Download XML files and extract headlines and paragraphs to a text file with the following command: `xmllint --xpath '//*[local-name()=\"div\"][@type=\"article\"]//*[local-name()=\"p\" or local-name()=\"head\"]/text()' Annee*/*.xml | perl -pe 's/^  +//g ; s/^ (.+)/$1\\n/g ; chomp' \u003e est_republicain.txt`\n    + corresponding variable in `.speechrc`: `est_republicain`\n    + sentences extraction: run `./speech_sentences.py est_republicain`\n\nSentences can also be extracted from our speech corpora. To do that, run:\n\n- English Speech Corpora\n    + `./speech_sentences.py voxforge_en`\n    + `./speech_sentences.py librispeech`\n    + `./speech_sentences.py zamia_en`\n    + `./speech_sentences.py cv_corpus_v1`\n    + `./speech_sentences.py ljspeech`\n    + `./speech_sentences.py m_ailabs_en`\n    + `./speech_sentences.py tedlium3`\n\n- German Speech Corpora\n    + `./speech_sentences.py forschergeist`\n    + `./speech_sentences.py gspv2`\n    + `./speech_sentences.py voxforge_de`\n    + `./speech_sentences.py zamia_de`\n    + `./speech_sentences.py m_ailabs_de`\n    + `./speech_sentences.py cv_de`\n\nLanguage Model\n==============\n\nEnglish\n-------\n\nPrerequisites: \n- text corpora `europarl_en`, `cornell_movie_dialogs`, `web_questions`, and `yahoo_answers` are installed, sentences extracted (see instructions above).\n- sentences are extracted from speech corpora `librispeech`, `voxforge_en`, `zamia_en`, `cv_corpus_v1`, `ljspeech`, `m_ailabs_en`, `tedlium3`\n\nTo train a small, pruned English language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:\n\n```bash\n./speech_build_lm.py generic_en_lang_model_small europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3\n```\n\nto train a larger model of order 6 with less pruning:\n\n```bash\n./speech_build_lm.py -o 6 -p \"0 0 0 0 1\" generic_en_lang_model_large europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3\n```\n\nto train a medium size model of order 5:\n\n```bash\n./speech_build_lm.py -o 5 -p \"0 0 1 2\" generic_en_lang_model_medium europarl_en cornell_movie_dialogs web_questions yahoo_answers librispeech voxforge_en zamia_en cv_corpus_v1 ljspeech m_ailabs_en tedlium3\n```\nGerman\n------\n\nPrerequisites: \n- text corpora `europarl_de` and `parole_de` are installed, sentences extracted (see instructions above).\n- sentences are extracted from speech corpora `forschergeist`, `gspv2`, `voxforge_de`, `zamia_de`, `m_ailabs_de`, `cv_de`\n\nTo train a small, pruned German language model of order 4 using KenLM for use in both kaldi and wav2letter builds run:\n\n```bash\n./speech_build_lm.py generic_de_lang_model_small europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de\n```\nto train a larger model of order 6 with less pruning:\n\n```bash\n./speech_build_lm.py -o 6 -p \"0 0 0 0 1\" generic_de_lang_model_large europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de\n```\n\nto train a medium size model of order 5:\n\n```bash\n./speech_build_lm.py -o 5 -p \"0 0 1 2\" generic_de_lang_model_medium europarl_de parole_de forschergeist gspv2 voxforge_de zamia_de m_ailabs_de cv_de\n```\n\nFrench\n------\n\nPrerequisites:\n- text corpora `europarl_fr` and `est_republicain` are installed, sentences extracted (see instructions above).\n- sentences are extracted from speech corpora `voxforge_fr` and `m_ailabs_fr`\n\nTo train a French language model using KenLM run:\n```bash\n./speech_build_lm.py generic_fr_lang_model europarl_fr est_republicain voxforge_fr m_ailabs_fr\n```\n\nSubmission Review and Transcription\n===================================\n\nThe main tool used for submission review, transcription and lexicon expansion is:\n\n```bash\n./speech_editor.py\n```\n\n\nLexica/Dictionaries\n===================\n\n*NOTE*: We use the terms lexicon and dictionary interchangably in this documentation and our scripts.\n\nCurrently, we have two lexica, one for English and one for German (in `data/src/dicts`):\n\n- dict-en.ipa\n    + English\n    + originally based on The CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict)\n    + additional manual and Sequitur G2P based entries\n\n- dict-de.ipa\n    + started manually from scratch\n    + once enough entries existed to train a reasonable Sequitur G2P model, many entries where converted from German wiktionary (see below)\n\nThe native format of our lexica is in (UTF8) IPA with semicolons as separator. This format is then converted to\nwhatever format is used by the target ASR engine by the corresponding export scripts.\n\nSequitur G2P\n------------\n\nMany lexicon-related tools rely on Sequitur G2P to compute pronunciations for words missing from the dictionary. The\nnecessary models can be downloaded from our file server: http://goofy.zamia.org/zamia-speech/g2p/ . \nFor installation, download and unpack them and then put links to them under `data/models` like so:\n\n```bash\ndata/models/sequitur-dict-de.ipa-latest -\u003e \u003cyour model dir\u003e/sequitur-dict-de.ipa-r20180510\ndata/models/sequitur-dict-en.ipa-latest -\u003e \u003cyour model dir\u003e/sequitur-dict-en.ipa-r20180510\n```\n\nTo train your own Sequitur G2P models, use the export and train scripts provided, e.g.:\n\n```bash\n[guenter@dagobert speech]$ ./speech_sequitur_export.py -d dict-de.ipa\nINFO:root:loading lexicon...\nINFO:root:loading lexicon...done.\nINFO:root:sequitur workdir data/dst/dict-models/dict-de.ipa/sequitur done.\n[guenter@dagobert speech]$ ./speech_sequitur_train.sh dict-de.ipa\ntraining sample: 322760 + 16988 devel\niteration: 0\n...\n```\n\nManual Editing\n--------------\n\n```bash\n./speech_lex_edit.py word [word2 ...]\n```\n\nis the main curses based, interactive lexicon editor. It will automatically\nproduce candidate entries for new words using Sequitur G2P, MaryTTS and\neSpeakNG. The user can then edit these entries manually if necessary and check\nthem by listening to them being synthesized via MaryTTS in different voices.\n\nThe lexicon editor is also integrated into various other tools, `speech_editor.py` in particular\nwhich allows you to transcribe, review and add missing words for new audio samples\nwithin one tool - which is recommended.\n\n\nI also tend to review lexicon entries randomly from time to time. For that I have a small script which will pick 20\nrandom entries where Sequitur G2P disagrees with the current transcription in the lexicon:\n\n```bash\n./speech_lex_edit.py `./speech_lex_review.py`\n```\n\nAlso, I sometimes use this command to add missing words from transcripts in batch mode:\n\n```bash\n./speech_lex_edit.py `./speech_lex_missing.py`\n```\n\nWiktionary\n----------\n\nFor the German lexicon, entries can be extracted from the German wiktionary using a set of scripts.\nTo do that, the first step is to extract a set of candidate entries from an wiktionary XML dump:\n\n```bash\n./wiktionary_extract_ipa.py \n```\n\nthis will output extracted entries to `data/dst/speech/de/dict_wiktionary_de.txt`. We now need to \ntrain a Sequitur G2P model that translates these entries into our own IPA style and phoneme set:\n\n```bash\n./wiktionary_sequitur_export.py\n./wiktionary_sequitur_train.sh\n```\n\nfinally, we translate the entries and check them against the predictions from our regular Sequitur G2P model:\n\n```bash\n./wiktionary_sequitur_gen.py\n```\n\nthis script produces two output files: `data/dst/speech/de/dict_wiktionary_gen.txt` contains acceptable entries,\n`data/dst/speech/de/dict_wiktionary_rej.txt` contains rejected entries.\n\n\nKaldi Models (recommended)\n==========================\n\nEnglish NNet3 Chain Models\n--------------------------\n\nThe following recipe trains Kaldi models for English. \n\nBefore running it, make sure all prerequisites are met (see above for instructions on these):\n\n- language model `generic_en_lang_model_small` built\n- some or all speech corpora of `voxforge_en`, `librispeech`, `cv_corpus_v1`, `ljspeech`, `m_ailabs_en`, `tedlium3` and `zamia_en` are installed, converted and scanned.\n- optionally noise augmented corpora: `voxforge_en_noisy`, `voxforge_en_phone`, `librispeech_en_noisy`, `librispeech_en_phone`, `cv_corpus_v1_noisy`, `cv_corpus_v1_phone`, `zamia_en_noisy` and `zamia_en_phone`\n\n```bash\n./speech_kaldi_export.py generic-en-small dict-en.ipa generic_en_lang_model_small voxforge_en librispeech zamia_en \ncd data/dst/asr-models/kaldi/generic-en-small\n./run-chain.sh\n```\n\nexport run with noise augmented corpora included:\n\n```bash\n./speech_kaldi_export.py generic-en dict-en.ipa generic_en_lang_model_small voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone\n```\n\nGerman NNet3 Chain Models\n-------------------------\n\nThe following recipe trains Kaldi models for German. \n\nBefore running it, make sure all prerequisites are met (see above for instructions on these):\n\n- language model `generic_de_lang_model_small` built\n- some or all speech corpora of `voxforge_de`, `gspv2`, `forschergeist`, `zamia_de`, `m_ailabs_de`, `cv_de` are installed, converted and scanned.\n- optionally noise augmented corpora: `voxforge_de_noisy`, `voxforge_de_phone`, `zamia_de_noisy` and `zamia_de_phone`\n\n```bash\n./speech_kaldi_export.py generic-de-small dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 [ forschergeist zamia_de ...]\ncd data/dst/asr-models/kaldi/generic-de-small\n./run-chain.sh\n```\n\nexport run with noise augmented corpora included:\n\n```bash\n./speech_kaldi_export.py generic-de dict-de.ipa generic_de_lang_model_small voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_de\n```\n\nModel Adaptation\n----------------\n\nFor a standalone kaldi model adaptation tool that does not require a complete zamia-speech setup, see\n\n[kaldi-adapt-lm](https://github.com/gooofy/kaldi-adapt-lm)\n\n\nExisting kaldi models (such as the ones we provide for download but also those you may train from scratch using our scripts)\ncan be adapted to (typically domain specific) language models, JSGF grammars and grammar FSTs.\n\nHere is an example how to adapt our English model to a simple command and control JSGF grammar. Please note that this is just\na toy example - for real world usage you will probably want to add garbage phoneme loops to the grammar or produce a language\nmodel that has some noise resistance built in right away. \n\nHere is the grammar we will use: \n\n```jsgf\n#JSGF V1.0;\n\ngrammar org.zamia.control;\n\npublic \u003ccontrol\u003e = \u003cwake\u003e | \u003cpoliteCommand\u003e ;\n\n\u003cwake\u003e = ( good morning | hello | ok | activate ) computer;\n\n\u003cpoliteCommand\u003e = [ please | kindly | could you ] \u003ccommand\u003e [ please | thanks | thank you ];\n\n\u003ccommand\u003e = \u003conOffCommand\u003e | \u003cmuteCommand\u003e | \u003cvolumeCommand\u003e | \u003cweatherCommand\u003e;\n\n\u003conOffCommand\u003e = [ turn | switch ] [the] ( light | fan | music | radio ) (on | off) ;\n\n\u003cvolumeCommand\u003e = turn ( up | down ) the ( volume | music | radio ) ;\n\n\u003cmuteCommand\u003e = mute the ( music | radio ) ;\n\n\u003cweatherCommand\u003e = (what's | what) is the ( temperature | weather ) ;\n```\n\nthe next step is to set up a kaldi model adaptation experiment using this script:\n\n```bash\n./speech_kaldi_adapt.py data/models/kaldi-generic-en-tdnn_250-latest dict-en.ipa control.jsgf control-en\n```\n\nhere, `data/models/kaldi-generic-en-tdnn_250-latest` is the model to be adapted, `dict-en.ipa` is the dictionary which\nwill be used by the new model, `control.jsgf` is the JSGF grammar we want the model to be adapted to (you could specify an\nFST source file or a language model instead here) and `control-en` is the name of the new model that will be created.\n\nTo run the actual adaptation, change into the model directory and run the adaptation script there:\n\n```bash\ncd data/dst/asr-models/kaldi/control-en\n./run-adaptation.sh \n```\n\nfinally, you can create a tarball from the newly created model:\n\n```bash\ncd ../../../../..\n./speech_dist.sh control-en kaldi adapt\n```\n\n\nwav2letter++ models\n===================\n\nEnglish Wav2letter Models\n-------------------------\n\n```bash\n./wav2letter_export.py -l en -v generic-en dict-en.ipa generic_en_lang_model_large voxforge_en cv_corpus_v1 librispeech ljspeech m_ailabs_en tedlium3 zamia_en voxforge_en_noisy librispeech_noisy cv_corpus_v1_noisy cv_corpus_v1_phone zamia_en_noisy voxforge_en_phone librispeech_phone zamia_en_phone\npushd data/dst/asr-models/wav2letter/generic-en/\nbash run_train.sh\n```\n\nGerman Wav2letter Models\n------------------------\n\n```bash\n./wav2letter_export.py -l de -v generic-de dict-de.ipa generic_de_lang_model_large voxforge_de gspv2 forschergeist zamia_de voxforge_de_noisy voxforge_de_phone zamia_de_noisy zamia_de_phone m_ailabs_de cv_de\npushd data/dst/asr-models/wav2letter/generic-de/\nbash run_train.sh\n```\n\nauto-reviews using wav2letter\n-----------------------------\n\ncreate auto-review case:\n\n```bash\n./wav2letter_auto_review.py -l de w2l-generic-de-latest gspv2\n```\n\nrun it:\n```bash\npushd tmp/w2letter_auto_review\nbash run_auto_review.sh\npopd\n```\n\napply the results:\n```bash\n./wav2letter_apply_review.py\n```\n\n\nAudiobook Segmentation and Transcription (Manual)\n=================================================\n\nSome notes on how to segment and transcribe audiobooks or other audio sources (e.g. from librivox) using\nthe abook scripts provided:\n\n(0/3) Convert Audio to WAVE Format\n----------------------------------\n\nMP3\n~~~\n```bash\nffmpeg -i foo.mp3 foo.wav\n```\n\nMKV\n~~~\n```bash\nmkvextract tracks foo.mkv 0:foo.ogg\nopusdec foo.ogg foo.wav\n```\n\n(1/3) Convert Audio to 16kHz mono\n---------------------------------\n\n```bash\nsox foo.wav -r 16000 -c 1 -b 16 foo_16m.wav\n```\n\n\n(2/3) Split Audio into Segments\n-------------------------------\n\nThis tool will use silence detection to find good cut-points. You may want to adjust\nits settings to achieve a good balance of short-segments but few words split in half.\n\n\n```bash\n./abook-segment.py foo_16m.wav\n```\n\nsettings:\n\n```bash\n[guenter@dagobert speech]$ ./abook-segment.py -h\nUsage: abook-segment.py [options] foo.wav\n\nOptions:\n  -h, --help            show this help message and exit\n  -s SILENCE_LEVEL, --silence-level=SILENCE_LEVEL\n                        silence level (default: 2048 / 65536)\n  -l MIN_SIL_LENGTH, --min-sil-length=MIN_SIL_LENGTH\n                        minimum silence length (default:  0.07s)\n  -m MIN_UTT_LENGTH, --min-utt-length=MIN_UTT_LENGTH\n                        minimum utterance length (default:  2.00s)\n  -M MAX_UTT_LENGTH, --max-utt-length=MAX_UTT_LENGTH\n                        maximum utterance length (default:  9.00s)\n  -o OUTDIRFN, --out-dir=OUTDIRFN\n                        output directory (default: abook/segments)\n  -v, --verbose         enable debug output\n```\n\nby default, the resulting segments will end up in abook/segments\n\n(3/3) Transcribe Audio\n----------------------\n\nThe transcription tool supports up to two speakers which you can specify on the command line.\nThe resulting voxforge-packages will end up in abook/out by default.\n\n\n```bash\n./abook-transcribe.py -s speaker1 -S speaker2 abook/segments/\n```\n\nAudiobook Segmentation and Transcription (kaldi)\n================================================\n\nSome notes on how to segment and transcribe semi-automatically audiobooks or other audio sources (e.g. from librivox) using\nkaldi:\n\nDirectory Layout\n----------------\n\nOur scripts rely on a fixed directory layout. As segmentation of librivox recordings is one of the main\napplications of these scripts, their terminology of books and sections is used here. For each section of \na book two source files are needed: a wave file containing the audio and a text file containing the transcript.\n\nA fixed naming scheme is used for those which is illustrated by this example:\n\n\u003cpre\u003e\nabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txt\nabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.wav\nabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.txt\nabook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-2.wav\n...\n\u003c/pre\u003e\n\nThe `abook-librivox.py` script is provided to help with retrieval of librivox recordings and setting up the\ndirectory structure. Please note that for now, the tool will not retrieve transcripts automatically but\nwill create empty .txt files (according to the naming scheme) which you will have to fill in manually.\n\nThe tool will convert the retrieved audio to 16kHz mono wav format as required by the segmentation scripts, however.\nIf you intend to segment material from other sources, make sure to convert it to that format. For suggestions on\nwhat tools to use for this step, please refer to the manual segmentation instructions in the previous section.\n\n*NOTE*: As the kaldi process is parallelized for mass-segmentation, at least 4\naudio and prompt files are needed for the process to work.\n\n(1/4) Preprocess the Transcript\n-------------------------------\n\nThis tool will tokenize the transcript and detect OOV tokens. Those can then be either\nreplaced or added to the dictionary:\n\n```bash\n./abook-preprocess-transcript.py abook/in/librivox/11442-toten-Seelen/evak-11442-toten-Seelen-1.txt\n```\n\n(2/4) Model adaptation\n----------------------\n\nFor the automatic segmentation to work, we need a GMM model that is adapted to the current dictionary (which likely had\nto be expanded during transcript preprocessing) plus uses a language model that covers the prompts.\n\nFirst, we create a language model tuned for our purpose:\n\n```bash\n./abook-sentences.py abook/in/librivox/11442-toten-Seelen/*.prompt\n./speech_build_lm.py abook_lang_model abook abook abook parole_de\n```\n\nNow we can create an adapted model using this language model and our current dict:\n\n```bash\n./speech_kaldi_adapt.py data/models/kaldi-generic-de-tri2b_chain-latest dict-de.ipa data/dst/lm/abook_lang_model/lm.arpa abook-de\npushd data/dst/asr-models/kaldi/abook-de\n./run-adaptation.sh\npopd\n./speech_dist.sh -c abook-de kaldi adapt\ntar xfvJ data/dist/asr-models/kaldi-abook-de-adapt-current.tar.xz -C data/models/\n```\n\n(3/4) Auto-Segment using Kaldi\n------------------------------\n\nNext, we need to create the kaldi directory structure and files for auto-segmentation:\n\n```bash\n./abook-kaldi-segment.py data/models/kaldi-abook-de-adapt-current abook/in/librivox/11442-toten-Seelen\n```\n\nnow we can run the segmentation:\n\n```bash\npushd data/dst/speech/asr-models/kaldi/segmentation\n./run-segmentation.sh \npopd\n```\n\n(4/4) Retrieve Segmentation Result\n----------------------------------\n\nFinally, we can retrieve the segmentation result in voxforge format:\n\n```bash\n./abook-kaldi-retrieve.py abook/in/librivox/11442-toten-Seelen/\n```\n\nTraining Voices for Zamia-TTS\n=============================\n\nZamia-TTS is an experimental project that tries to train TTS voices based on (reviewed) Zamia-Speech data. Downloads here:\n\nhttps://goofy.zamia.org/zamia-speech/tts/\n\nTacotron 2\n----------\n\nThis section describes how to train voices for [NVIDIA's Tacotron 2 implementation](https://github.com/NVIDIA/tacotron2). \nThe resulting voices will have a sample rate of 16kHz as that is the default\nsample rate used for Zamia Speech ASR model training. This means that you will have to use a 16kHz waveglow model which you can find, along with pretrained voices and sample wavs here:\n\nhttps://goofy.zamia.org/zamia-speech/tts/tacotron2/\n\nnow with that out of the way, Tacotron 2 model training is pretty straightforward. First step is to export filelists for the voice you'd like to train, e.g.:\n\n```bash\n./speech_tacotron2_export.py -l en -o ../torch/tacotron2/filelists m_ailabs_en mailabselliotmiller\n```\nnext, change into your Tacotron 2 training directory\n\n```bash\ncd ../torch/tacotron2\n```\n\nand specify file lists, sampling rate and batch size in ''hparams.py'':\n\n```\ndiff --git a/hparams.py b/hparams.py\nindex 8886f18..75e89c9 100644\n--- a/hparams.py\n+++ b/hparams.py\n@@ -25,15 +25,19 @@ def create_hparams(hparams_string=None, verbose=False):\n         # Data Parameters             #\n         ################################\n         load_mel_from_disk=False,\n-        training_files='filelists/ljs_audio_text_train_filelist.txt',\n-        validation_files='filelists/ljs_audio_text_val_filelist.txt',\n-        text_cleaners=['english_cleaners'],\n+        training_files='filelists/mailabselliotmiller_train_filelist.txt',\n+        validation_files='filelists/mailabselliotmiller_val_filelist.txt',\n+        text_cleaners=['basic_cleaners'],\n \n         ################################\n         # Audio Parameters             #\n         ################################\n         max_wav_value=32768.0,\n-        sampling_rate=22050,\n+        #sampling_rate=22050,\n+        sampling_rate=16000,\n         filter_length=1024,\n         hop_length=256,\n         win_length=1024,\n@@ -81,7 +85,8 @@ def create_hparams(hparams_string=None, verbose=False):\n         learning_rate=1e-3,\n         weight_decay=1e-6,\n         grad_clip_thresh=1.0,\n-        batch_size=64,\n+        # batch_size=64,\n+        batch_size=16,\n         mask_padding=True  # set model's padded outputs to padded values\n     )\n```\n\nand start the training:\n\n```bash\npython train.py --output_directory=elliot --log_directory=elliot/logs\n```\n\nTacotron\n--------\n\n* (1/2) Prepare a training data set\n\n```bash\n./ztts_prepare.py -l en m_ailabs_en mailabselliotmiller elliot\n```\n\n* (2/2) Run the training\n\n```bash\n./ztts_train.py -v elliot 2\u003e\u00261 | tee train_elliot.log\n```\n\nModel Distribution\n==================\n\nTo build tarballs from models, use the `speech-dist.sh` script, e.g.:\n\n\n```bash\n./speech_dist.sh generic-en kaldi tdnn_sp\n\n```\n\nLicense\n=======\n\nMy own scripts as well as the data I create (i.e. lexicon and transcripts) is\nLGPLv3 licensed unless otherwise noted in the script's copyright headers.\n\nSome scripts and files are based on works of others, in those cases it is my\nintention to keep the original license intact. Please make sure to check the\ncopyright headers inside for more information.\n\nAuthors\n=======\n\n* Guenter Bartsch \u003cguenter@zamia.org\u003e\n* Marc Puels \u003cmarc@zamia.org\u003e\n* Paul Guyot \u003cpguyot@kallisys.net\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooofy%2Fzamia-speech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooofy%2Fzamia-speech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooofy%2Fzamia-speech/lists"}