Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/codeforequity-at/botium-speech-processing
Botium Speech Processing
https://github.com/codeforequity-at/botium-speech-processing
botium speech-to-text text-to-speech
Last synced: about 15 hours ago
JSON representation
Botium Speech Processing
- Host: GitHub
- URL: https://github.com/codeforequity-at/botium-speech-processing
- Owner: codeforequity-at
- License: mit
- Created: 2019-12-03T21:06:46.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-10-11T14:25:01.000Z (4 months ago)
- Last Synced: 2025-02-07T12:07:55.620Z (8 days ago)
- Topics: botium, speech-to-text, text-to-speech
- Language: JavaScript
- Homepage: https://www.botium.at
- Size: 567 KB
- Stars: 944
- Watchers: 17
- Forks: 58
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-github-repos - codeforequity-at/botium-speech-processing - Botium Speech Processing (JavaScript)
README
# Botium Speech Processing
[![license](https://img.shields.io/github/license/mashape/apistatus.svg)]()
[![pullrequests](https://img.shields.io/badge/PR-welcome-green.svg)]()
[![awesome](https://img.shields.io/badge/Awesome-for%20sure!-green.svg)]()Botium Speech Processing is a unified, developer-friendly API to the best available free and Open-Source Speech-To-Text and Text-To-Speech services.
## What is it ?
Botium Speech Processing is a *get-shit-done*-style Open-Source software stack, the configuration options are rudimentary: it is highly opinionated about the included tools, just get the shit done.
* With [Kaldi](https://kaldi-asr.org/) a reasonable speech recogniction performance is available with freely available data sources.
* [MaryTTS](http://mary.dfki.de/) is currently the best freely available speech synthesis software
* [SoX](http://sox.sourceforge.net/) is the _swiss army-knife_ for audio file processingWhile the included tools in most cases cannot compete with the big cloud-based products, for lots of applications the trade-off between price and quality is at least reasonable.
**Read about the project history [here](https://chatbotslife.com/full-blown-open-source-speech-processing-server-available-on-github-4fb88a54d338)**
### Possible Applications
Some examples what you can do with this:
* Synthesize audio tracks for Youtube tutorials
* Build voice-enabled chatbot services (for example, IVR systems)
* see the [Rasa Custom Voice Channel](./connectors/rasa)
* Classification of audio file transcriptions
* [Automated Testing](https://wiki.botiumbox.com/how-to-guides/voice-app-testing/) of Voice services with [Botium](https://botium.ai)## Installation
### Software and Hardware Requirements
* 8GB of RAM (accessible for Docker) and 40GB free HD space (for full installation)
* Internet connectivity
* [docker](https://docs.docker.com/)
* [docker-compose](https://docs.docker.com/compose/)_Note: memory usage can be reduced if only one language for Kaldi is required - default configuration comes with two languages._
### Full Installation (Prebuilt Docker Images)
Clone or download this repository and start with docker-compose:
> docker-compose up -d
This will download the latest released prebuilt images from Dockerhub. To download the latest developer images from Dockerhub:
> docker-compose --env-file .env.develop up
Point your browser to http://127.0.0.1 to open the [Swagger UI](https://swagger.io/tools/swagger-ui/) and browse/use the API definition.
### Slim Cloud-Specific Installation (Prebuilt Docker Images)
For the major cloud providers there are additional docker-compose files. If using those, the installation is more slim, as there is only the *frontend*-service required. For instance, add your Azure subscription key and Azure region key to the file *docker-compose-azure.yml* and start the services:
> docker-compose -f docker-compose-azure.yml up -d
### Optional: Build Docker Images
You can optionally built your own docker images (if you made any changes in this repository, for instance to download the latest version of a model). Clone or download this repository and run docker-compose:
> docker-compose -f docker-compose-dev.yml up -d
This will take some time to build.
### Configuration
This repository includes a reasonable default configuration:
* Use MaryTTS for TTS
* Use Kaldi for STT
* Use SoX for audio file conversion
* Languages included:
* German
* EnglishConfiguration changes with [environment variables](./frontend/resources/.env). See comments in this file.
**Recommendation:** Do not change the _.env_ file but create a _.env.local_ file to overwrite the default settings. This will prevent troubles on future _git pull_
### Request-Specific Configuration
If there is a JSON-formatted request body, or a multipart request body, certain sections are considered:
* **credentials** to override the server default credentials for cloud services
* **config** to override the server default settings for the cloud API calls*See samples below*
### Securing the API
The environment variable _BOTIUM_API_TOKENS_ contains a list of valid API Tokens accepted by the server (separated by whitespace or comma). The HTTP Header _BOTIUM_API_TOKEN_ is validated on each call to the API.
### Caching
For performance improvements, the result of the speech-to-text and text-to-speech calls are cached (by MD5-hash of audio or input text). To enforce reprocessing empty the cache directories:
* frontent/resources/.cache/stt
* frontent/resources/.cache/tts## Testing
Point your browser to http://127.0.0.1/ to open Swagger UI to try out the API.
Point your browser to http://127.0.0.1/dictate/ to open a rudimentary [dictate.js-interface](https://github.com/Kaljurand/dictate.js) for testing speech recognition (_for Kaldi only_)
_Attention: in Google Chrome this only works with services published as HTTPS, you will have to take of this yourself. For example, you could publish it via ngrok tunnel._
Point your browser to http://127.0.0.1/tts to open a MaryTTS interface for testing speech synthesis.
### Real Time API
It is possible to stream audio from real-time audio decoding: Call the **/api/sttstream/{language}** endpoint to open a websocket stream, it will return three urls:
* wsUri - the Websocket uri to stream your audio to. By default, it accepts wav-formatted audio-chunks
* statusUri - check if the stream is still open
* endUri - end audio streaming and close websocket## File System Watcher
Place audio files in these folders to receive the transript in the folder _watcher/stt_output_:
* watcher/stt_input_de
* watcher/stt_input_enPlace text files in these folders to receive the synthesized speech in the folder _watcher/tss_output_:
* watcher/tts_input_de
* watcher/tts_input_en## API Definition
See [swagger.json](./frontend/src/swagger.json):
* HTTP POST to **/api/stt/{language}** for Speech-To-Text
> curl -X POST "http://127.0.0.1/api/stt/en" -H "Content-Type: audio/wav" -T sample.wav
* HTTP POST to **/api/stt/{language}** for Speech-To-Text with Google, including credentials
> curl -X POST "http://127.0.0.1/api/stt/en-US?stt=google" -F "google={\"credentials\": {\"private_key\": \"xxx\", \"client_email\": \"xxx\"}}" -F [email protected]
* HTTP POST to **/api/stt/{language}** for Speech-To-Text with Google, including switch to MP3 encoding
> curl -X POST "http://127.0.0.1/api/stt/en-US?stt=google" -F "google={\"config\": {\"encoding\": \"MP3\"}}" -F [email protected]
* HTTP POST to **/api/stt/{language}** for Speech-To-Text with IBM, including credentials
> curl -X POST "http://127.0.0.1/api/stt/en-US?stt=ibm" -F "google={\"credentials\": {\"apikey\": \"xxx\", \"serviceUrl\": \"xxx\"}}" -F [email protected]
* HTTP GET to **/api/tts/{language}?text=...** for Text-To-Speech
> curl -X GET "http://127.0.0.1/api/tts/en?text=hello%20world" -o tts.wav
* HTTP POST to **/api/convert/{profile}** for audio file conversion
> curl -X POST "http://127.0.0.1/api/convert/mp3tomonowav" -H "Content-Type: audio/mp3" -T sample.mp3 -o sample.wav
## Contributing
_To be done: contribution guidelines._
We are open to any kind of contributions and are happy to discuss, review and merge pull requests.
## Big Thanks
This project is standing on the shoulders of giants.
* **[Kaldi GStreamer server](https://github.com/alumae/kaldi-gstreamer-server)** and **[Docker images](https://github.com/jcsilva/docker-kaldi-gstreamer-server)**
* **[MaryTTS](http://mary.dfki.de/)**
* **SVOX Pico Text-to-Speech**
* **[Kaldi](https://kaldi-asr.org/)**
* **[Kaldi Tuda Recipe](https://github.com/uhh-lt/kaldi-tuda-de)**
* **[Zamia Speech](https://github.com/gooofy/zamia-speech)**
* **[Deepspeech](https://github.com/mozilla/DeepSpeech)** and **[Deepspeech German](https://github.com/AASHISHAG/deepspeech-german)**
* **[SoX](http://sox.sourceforge.net/)**
* **[dictate.js](https://github.com/Kaljurand/dictate.js)**## Changelog
### 2022-03-06
* Voice effects to consider audio file length### 2022-02-28
* Applied Security Best Practices (not run as root user)
### 2022-01-12
* Added support for Azure Speech Services
### 2021-12-07
* Added endpoints for streaming audio and responses
### 2021-12-01
* Added option to hand over cloud credentials in request body
### 2021-01-26
* Added several profiles for adding noise or other audio artifacts to your files
* Added custom channel for usage with Rasa### 2020-12-18
* Adding support for Google Text-To-Speech
* Adding support for listing and using available TTS voices
* Added sample docker-compose configurations for PicoTTS and Google### 2020-03-05
* Optional _start_/_end_ parameters for audio file conversion to trim an audio file by time codes formatted as mm:ss (_01:32_)
### 2020-02-22
* Additional endpoint to calculate the Word Error Rate (Levenshtein Distance) between two texts
* When giving the _hint_-parameter with the expected text to the STT-endpoint, the Word Error Rate will be calculated and returned
* When multiple STT- or TTS-engines are configured, select the one to use with the _stt_ or _tts_ parameter (in combination with the Word Error Rate calculation useful for comparing performance of two engines)### 2020-01-31
* Using pre-trained models from [Zamia Speech](https://github.com/gooofy/zamia-speech) for speech recognition
* Using latest [Kaldi](https://kaldi-asr.org/) build
* Added _file system watcher_ to transcribe and synthesize audio files