An open API service indexing awesome lists of open source software.

https://github.com/instance-id/voice_assistant

Personal voice assistant pipeline
https://github.com/instance-id/voice_assistant

Last synced: about 1 month ago
JSON representation

Personal voice assistant pipeline

Awesome Lists containing this project

README

          

# Voice Assistant

An advanced voice assistant system leveraging ESP32 hardware and cloud (local) infrastructure to deliver responsive, accurate speech recognition and command processing with minimal latency.

## Overview

This project implements a distributed voice assistant system with components running on both edge devices (ESP32 Korvo 1) and server infrastructure. It features wake-word detection, voice command recognition, and natural language processing capabilities through a sophisticated pipeline of services.

### Disclaimer

This was entirely intended to be used as a proof of concept/personal project, and is not intended to be used as a production system.

I saw a neat piece of hardware ([Korvo 1](https://www.espressif.com/en/products/devkits/esp32-korvo-du1906)). One thing lead to another and I ended up building this.

## Architecture

### Hardware Components

- **Edge Device**: ESP32-based Korvo 1 development board with integrated microphone array for voice capture and initial processing
- **Server**: GPU-accelerated machine with NVIDIA 8GB graphics card for running speech recognition models

### Software Components

- **ESP32 Firmware** (voice_assistant): Handles audio capture, wake word detection, and preliminary command recognition
- **gRPC Client/Server**: Facilitates communication between edge devices and server components [inspired by schuettc/hosted-whisper-streaming](https://github.com/schuettc/hosted-whisper-streaming/blob/main/client/index.ts)
- **Whisper Server**: Containerized deployment of Faster-Whisper for high-quality speech-to-text conversion
- **Phrase Converter**: Converts text strings into structured commands for processing (Modified/expanded version of the [Espressif phrase converter](https://github.com/espressif/esp-sr/blob/afeade1ffcd8f48a198f4b944dc2db4b7f05e96c/tool/multinet_g2p.py))
- **MQTT Broker**: EMQX running in Docker for message distribution across system components

## Technology Stack

### Embedded Development

- ESP-IDF (Espressif IoT Development Framework)
- C/C++ for firmware development
- ESP Audio Front End (AFE) for audio processing
- ESP Speech Commands recognition engine

### Server-side Technologies

- Python for server components and NLP processing
- gRPC for efficient client-server communication
- Faster-Whisper for state-of-the-art speech recognition
- Docker for containerization and deployment
- EMQX MQTT broker for message handling

## Other Tools and Helpers

### hacli (Home Assistant CLI): Found in /systems/hacli

A command-line interface for interacting with Home Assistant Allows querying devices, entities, areas, services, and states Uses Python's subprocess to interact with Home Assistant CLI Can be used as an Ollama function calling tool

### mqttcmd: Found in /systems/mqttcmd

Command-line utility for sending MQTT commands to devices Supports configuring audio streaming, log levels, and providing feedback tones

### Phrase Converter:

Modified/expanded version of the [Espressif phrase converter](https://github.com/espressif/esp-sr/blob/afeade1ffcd8f48a198f4b944dc2db4b7f05e96c/tool/multinet_g2p.py)

Text phrase to phonetic string converter for ultra low latency on-device processing
Exports formatted results to ESP32 via MQTT broker

### One-offs

.wav tone generator

### General Workflow

Wake word detection (ESP32) ->

- Voice match found on-device -> Home Assistant -> Command execution

- Voice match not found on-device -> Audio stream -> MQTT -> Whisper Server -> Transcription -> Websocket -> Home Assistant -> Command execution

Upon startup of the system "grpc server" (aka the main handler of things), it will query Home Assistant and get a list of all devices. Generate a list of commands based on the devices which are known to work with the Home Assistant intent processor and will be executed without the need for
additional processing.

Send the list of commands to the ESP32 via MQTT for on-device processing. (This allows for extremely quick responses to known voice commands).

If a command is received which is not on the list, the audio will be streamed via mqtt to the grpc server and sent to the whisper server for transcription. (This process begins as soon as the wake word is detected, but is cancelled if a match is found on device)

Via a websocket connection from the grpc server to Home Assistant, the streamed transcription will be checked for potential command matches from the Home Assistant intent processor.

If a match is found, it will then be executed via Home Assistant.

## Installation and Setup

### Prerequisites

- ESP-IDF development environment
- Docker and Docker Compose
- NVIDIA GPU with CUDA support
- Python 3.8+ for server components

### ESP32 Firmware Setup

1. Clone the repository:

```bash
git clone https://github.com/instance-id/esp32/assistant.git
cd esp32/assistant/voice_assistant
```

2. Configure the project:

```bash
idf.py menuconfig
```

Set the appropriate WiFi credentials and server details under the project configuration.

3. Build and flash:
```bash
idf.py build
idf.py -p [PORT] flash
```

### Server Components Setup

1. Set up the Whisper Server:

```bash
cd ../grpc_client_server/whisperServer
docker compose -f compose.gpu.yaml up -d
```

2. Start the MQTT broker:

```bash
docker run -d --name emqx -p 1883:1883 -p 8083:8083 -p 8084:8084 -p 8883:8883 -p 18083:18083 emqx/emqx
```

3. Configure and run the phrase converter:
```bash
cd ../phrase_converter
cp .env.example .env
# Edit .env with appropriate settings
./run.sh
```

## Key Features

- **Real-time Voice Processing**: Low-latency audio capture and processing on the ESP32
- **Wake Word Detection**: Custom wake word detection using ESP-SR on the edge device
- **Voice Command Recognition**: On-device recognition of common commands for quick response times
- **Advanced Speech-to-Text**: Server-side processing using Faster-Whisper for complex queries
- **Extensible Command System**: Easy addition of new voice commands via the API
- **Multi-language Support**: Capable of processing commands in multiple languages
- **Distributed Architecture**: Optimized workload distribution between edge and cloud

## Implementation Details

- **Button Controls**: Physical buttons for volume control and manual activation
- **LED Feedback**: Visual indicators for system status
- **Secure Communication**: HTTPS and secure MQTT options for sensitive deployments
- **Persistent Storage**: NVS Flash for storing configuration and command data
- **Robust Error Handling**: Comprehensive error detection and recovery mechanisms
- **Low Power Design**: Efficient power usage for battery-operated deployments

## Development

The project uses a modular architecture that allows for independent development of components:

- Use the `justfile` for common development tasks
- Follow the C style guide for firmware contributions
- Python components use uv for dependency management