https://github.com/rhasspy/wyoming
Peer-to-peer protocol for voice assistants
https://github.com/rhasspy/wyoming
Last synced: 13 days ago
JSON representation
Peer-to-peer protocol for voice assistants
- Host: GitHub
- URL: https://github.com/rhasspy/wyoming
- Owner: rhasspy
- License: mit
- Created: 2023-09-29T16:37:53.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-08-09T20:41:47.000Z (9 months ago)
- Last Synced: 2024-09-04T10:45:05.226Z (8 months ago)
- Language: Python
- Size: 77.1 KB
- Stars: 103
- Watchers: 6
- Forks: 17
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Wyoming Protocol
A peer-to-peer protocol for voice assistants (basically [JSONL](https://jsonlines.org/) + PCM audio)
``` text
{ "type": "...", "data": { ... }, "data_length": ..., "payload_length": ... }\n```
Used in [Rhasspy](https://github.com/rhasspy/rhasspy3/) and [Home Assistant](https://www.home-assistant.io/integrations/wyoming) for communication with voice services.
[](https://www.openhomefoundation.org/)
## Wyoming Projects
* Voice satellites
* [Satellite](https://github.com/rhasspy/wyoming-satellite) for Home Assistant
* Audio input/output
* [mic-external](https://github.com/rhasspy/wyoming-mic-external)
* [snd-external](https://github.com/rhasspy/wyoming-snd-external)
* [SDL2](https://github.com/rhasspy/wyoming-sdl2)
* Wake word detection
* [openWakeWord](https://github.com/rhasspy/wyoming-openwakeword)
* [porcupine1](https://github.com/rhasspy/wyoming-porcupine1)
* [snowboy](https://github.com/rhasspy/wyoming-snowboy)
* [microWakeWord](https://github.com/rhasspy/wyoming-microwakeword)
* Speech-to-text
* [Faster Whisper](https://github.com/rhasspy/wyoming-faster-whisper)
* [Vosk](https://github.com/rhasspy/wyoming-vosk)
* [Whisper.cpp](https://github.com/rhasspy/wyoming-whisper-cpp)
* Text-to-speech
* [Piper](https://github.com/rhasspy/wyoming-piper)
* Intent handling
* [handle-external](https://github.com/rhasspy/wyoming-handle-external)## Format
1. A JSON object header as a single line with `\n` (UTF-8, required)
* `type` - event type (string, required)
* `data` - event data (object, optional)
* `data_length` - bytes of additional data (int, optional)
* `payload_length` - bytes of binary payload (int, optional)
2. Additional data (UTF-8, optional)
* JSON object with additional event-specific data
* Merged on top of header `data`
* Exactly `data_length` bytes long
* Immediately follows header `\n`
3. Payload
* Typically PCM audio but can be any binary data
* Exactly `payload_length` bytes long
* Immediately follows additional data or header `\n` if no additional data## Event Types
Available events with `type` and fields.
### Audio
Send raw audio and indicate begin/end of audio streams.
* `audio-chunk` - chunk of raw PCM audio
* `rate` - sample rate in hertz (int, required)
* `width` - sample width in bytes (int, required)
* `channels` - number of channels (int, required)
* `timestamp` - timestamp of audio chunk in milliseconds (int, optional)
* Payload is raw PCM audio samples
* `audio-start` - start of an audio stream
* `rate` - sample rate in hertz (int, required)
* `width` - sample width in bytes (int, required)
* `channels` - number of channels (int, required)
* `timestamp` - timestamp in milliseconds (int, optional)
* `audio-stop` - end of an audio stream
* `timestamp` - timestamp in milliseconds (int, optional)
### InfoDescribe available services.
* `describe` - request for available voice services
* `info` - response describing available voice services
* `asr` - list speech recognition services (optional)
* `models` - list of available models (required)
* `name` - unique name (required)
* `languages` - supported languages by model (list of string, required)
* `attribution` (required)
* `name` - name of creator (required)
* `url` - URL of creator (required)
* `installed` - true if currently installed (bool, required)
* `description` - human-readable description (string, optional)
* `version` - version of the model (string, optional)
* `tts` - list text to speech services (optional)
* `models` - list of available models
* `name` - unique name (required)
* `languages` - supported languages by model (list of string, required)
* `speakers` - list of speakers (optional)
* `name` - unique name of speaker (required)
* `attribution` (required)
* `name` - name of creator (required)
* `url` - URL of creator (required)
* `installed` - true if currently installed (bool, required)
* `description` - human-readable description (string, optional)
* `version` - version of the model (string, optional)
* `wake` - list wake word detection services( optional )
* `models` - list of available models (required)
* `name` - unique name (required)
* `languages` - supported languages by model (list of string, required)
* `attribution` (required)
* `name` - name of creator (required)
* `url` - URL of creator (required)
* `installed` - true if currently installed (bool, required)
* `description` - human-readable description (string, optional)
* `version` - version of the model (string, optional)
* `handle` - list intent handling services (optional)
* `models` - list of available models (required)
* `name` - unique name (required)
* `languages` - supported languages by model (list of string, required)
* `attribution` (required)
* `name` - name of creator (required)
* `url` - URL of creator (required)
* `installed` - true if currently installed (bool, required)
* `description` - human-readable description (string, optional)
* `version` - version of the model (string, optional)
* `intent` - list intent recognition services (optional)
* `models` - list of available models (required)
* `name` - unique name (required)
* `languages` - supported languages by model (list of string, required)
* `attribution` (required)
* `name` - name of creator (required)
* `url` - URL of creator (required)
* `installed` - true if currently installed (bool, required)
* `description` - human-readable description (string, optional)
* `version` - version of the model (string, optional)
* `satellite` - information about voice satellite (optional)
* `area` - name of area where satellite is located (string, optional)
* `has_vad` - true if the end of voice commands will be detected locally (boolean, optional)
* `active_wake_words` - list of wake words that are actively being listend for (list of string, optional)
* `max_active_wake_words` - maximum number of local wake words that can be run simultaneously (number, optional)
* `supports_trigger` - true if satellite supports remotely-triggered pipelines
* `mic` - list of audio input services (optional)
* `mic_format` - audio input format (required)
* `rate` - sample rate in hertz (int, required)
* `width` - sample width in bytes (int, required)
* `channels` - number of channels (int, required)
* `snd` - list of audio output services (optional)
* `snd_format` - audio output format (required)
* `rate` - sample rate in hertz (int, required)
* `width` - sample width in bytes (int, required)
* `channels` - number of channels (int, required)
### Speech RecognitionTranscribe audio into text.
* `transcribe` - request to transcribe an audio stream
* `name` - name of model to use (string, optional)
* `language` - language of spoken audio (string, optional)
* `context` - context from previous interactions (object, optional)
* `transcript` - response with transcription
* `text` - text transcription of spoken audio (string, required)
* `context` - context for next interaction (object, optional)### Text to Speech
Synthesize audio from text.
* `synthesize` - request to generate audio from text
* `text` - text to speak (string, required)
* `voice` - use a specific voice (optional)
* `name` - name of voice (string, optional)
* `language` - language of voice (string, optional)
* `speaker` - speaker of voice (string, optional)
### Wake WordDetect wake words in an audio stream.
* `detect` - request detection of specific wake word(s)
* `names` - wake word names to detect (list of string, optional)
* `detection` - response when detection occurs
* `name` - name of wake word that was detected (int, optional)
* `timestamp` - timestamp of audio chunk in milliseconds when detection occurred (int optional)
* `not-detected` - response when audio stream ends without a detection### Voice Activity Detection
Detects speech and silence in an audio stream.
* `voice-started` - user has started speaking
* `timestamp` - timestamp of audio chunk when speaking started in milliseconds (int, optional)
* `voice-stopped` - user has stopped speaking
* `timestamp` - timestamp of audio chunk when speaking stopped in milliseconds (int, optional)
### Intent RecognitionRecognizes intents from text.
* `recognize` - request to recognize an intent from text
* `text` - text to recognize (string, required)
* `context` - context from previous interactions (object, optional)
* `intent` - response with recognized intent
* `name` - name of intent (string, required)
* `entities` - list of entities (optional)
* `name` - name of entity (string, required)
* `value` - value of entity (any, optional)
* `text` - response for user (string, optional)
* `context` - context for next interactions (object, optional)
* `not-recognized` - response indicating no intent was recognized
* `text` - response for user (string, optional)
* `context` - context for next interactions (object, optional)### Intent Handling
Handle structured intents or text directly.
* `handled` - response when intent was successfully handled
* `text` - response for user (string, optional)
* `context` - context for next interactions (object, optional)
* `not-handled` - response when intent was not handled
* `text` - response for user (string, optional)
* `context` - context for next interactions (object, optional)### Audio Output
Play audio stream.
* `played` - response when audio finishes playing
### Voice Satellite
Control of one or more remote voice satellites connected to a central server.
* `run-satellite` - informs satellite that server is ready to run pipelines
* `pause-satellite` - informs satellite that server is not ready anymore to run pipelines
* `satellite-connected` - satellite has connected to the server
* `satellite-disconnected` - satellite has been disconnected from the server
* `streaming-started` - satellite has started streaming audio to the server
* `streaming-stopped` - satellite has stopped streaming audio to the serverPipelines are run on the server, but can be triggered remotely from the server as well.
* `run-pipeline` - runs a pipeline on the server or asks the satellite to run it when possible
* `start_stage` - pipeline stage to start at (string, required)
* `end_stage` - pipeline stage to end at (string, required)
* `wake_word_name` - name of detected wake word that started this pipeline (string, optional)
* From client only
* `wake_word_names` - names of wake words to listen for (list of string, optional)
* From server only
* `start_stage` must be "wake"
* `announce_text` - text to speak on the satellite
* From server only
* `start_stage` must be "tts"
* `restart_on_end` - true if the server should re-run the pipeline after it ends (boolean, default is false)
* Only used for always-on streaming satellites### Timers
* `timer-started` - a new timer has started
* `id` - unique id of timer (string, required)
* `total_seconds` - number of seconds the timer should run for (int, required)
* `name` - user-provided name for timer (string, optional)
* `start_hours` - hours the timer should run for as spoken by user (int, optional)
* `start_minutes` - minutes the timer should run for as spoken by user (int, optional)
* `start_seconds` - seconds the timer should run for as spoken by user (int, optional)
* `command` - optional command that the server will execute when the timer is finished
* `text` - text of command to execute (string, required)
* `language` - language of the command (string, optional)
* `timer-updated` - timer has been paused/resumed or time has been added/removed
* `id` - unique id of timer (string, required)
* `is_active` - true if timer is running, false if paused (bool, required)
* `total_seconds` - number of seconds that the timer should run for now (int, required)
* `timer-cancelled` - timer was cancelled
* `id` - unique id of timer (string, required)
* `timer-finished` - timer finished without being cancelled
* `id` - unique id of timer (string, required)## Event Flow
* → is an event from client to server
* ← is an event from server to client### Service Description
1. → `describe` (required)
2. ← `info` (required)### Speech to Text
1. → `transcribe` event with `name` of model to use or `language` (optional)
2. → `audio-start` (required)
3. → `audio-chunk` (required)
* Send audio chunks until silence is detected
4. → `audio-stop` (required)
5. ← `transcript`
* Contains text transcription of spoken audio### Text to Speech
1. → `synthesize` event with `text` (required)
2. ← `audio-start`
3. ← `audio-chunk`
* One or more audio chunks
4. ← `audio-stop`### Wake Word Detection
1. → `detect` event with `names` of wake words to detect (optional)
2. → `audio-start` (required)
3. → `audio-chunk` (required)
* Keep sending audio chunks until a `detection` is received
4. ← `detection`
* Sent for each wake word detection
5. → `audio-stop` (optional)
* Manually end audio stream
6. ← `not-detected`
* Sent after `audio-stop` if no detections occurred
### Voice Activity Detection1. → `audio-chunk` (required)
* Send audio chunks until silence is detected
2. ← `voice-started`
* When speech starts
3. ← `voice-stopped`
* When speech stops
### Intent Recognition1. → `recognize` (required)
2. ← `intent` if successful
3. ← `not-recognized` if not successful### Intent Handling
For structured intents:
1. → `intent` (required)
2. ← `handled` if successful
3. ← `not-handled` if not successfulFor text only:
1. → `transcript` with `text` to handle (required)
2. ← `handled` if successful
3. ← `not-handled` if not successful
### Audio Output1. → `audio-start` (required)
2. → `audio-chunk` (required)
* One or more audio chunks
3. → `audio-stop` (required)
4. ← `played`