https://github.com/zkhan93/respeaker-openai-assistant
Pi Realtime Voice - A voice assistant for Raspberry Pi with local hotword detection and OpenAI Realtime API integration. Features event-driven architecture, ReSpeaker 4-Mic Array support, and bidirectional voice conversations.
https://github.com/zkhan93/respeaker-openai-assistant
edge-computing iot openai python raspberry-pi realtime-api respeaker respeaker-4mics-array speech-recognition voice-ai voice-assistant wake-word-detection
Last synced: 2 months ago
JSON representation
Pi Realtime Voice - A voice assistant for Raspberry Pi with local hotword detection and OpenAI Realtime API integration. Features event-driven architecture, ReSpeaker 4-Mic Array support, and bidirectional voice conversations.
- Host: GitHub
- URL: https://github.com/zkhan93/respeaker-openai-assistant
- Owner: zkhan93
- Created: 2025-12-27T04:12:23.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-02-24T21:25:57.000Z (4 months ago)
- Last Synced: 2026-02-25T01:50:51.443Z (4 months ago)
- Topics: edge-computing, iot, openai, python, raspberry-pi, realtime-api, respeaker, respeaker-4mics-array, speech-recognition, voice-ai, voice-assistant, wake-word-detection
- Language: Python
- Homepage:
- Size: 159 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Voice Assistant for ReSpeaker 4-Mic Array
A voice assistant service for Raspberry Pi using the ReSpeaker 4-Mic Array, with local hotword detection ("alexa") and OpenAI integration.
## Features
- π€ **Local Hotword Detection**: openWakeWord for offline "alexa" wake word detection
- π **ReSpeaker 4-Mic Array**: Full support for AC108 device (paInt16 mono)
- π€ **OpenAI Integration**: Speech-to-text with Whisper, future Realtime API support
- π― **Real-Time Audio**: Multi-consumer architecture with callback-based capture
- π‘ **Event-Driven**: Pub-sub system for decoupled components
- π **Modern Python**: Built with Python 3.11+ using `uv` package manager
## Quick Start
```bash
# 1. Install dependencies
export PATH="$HOME/.local/bin:$PATH"
uv sync
# 2. Configure
cp config/config.yaml.example config/config.yaml
nano config/config.yaml # Add your OpenAI API key
# 3. Download models
uv run voice-assistant download-models
# 4. Test event system (no OpenAI needed)
uv run voice-assistant test-events
# 5. Test speech-to-text (requires OpenAI API key)
uv run voice-assistant test-stt
```
## Hardware Requirements
- Raspberry Pi 4B (2GB or more)
- ReSpeaker 4-Mic Array for Raspberry Pi
- Internet connection for OpenAI API
## Installation
### System Dependencies
These should already be installed on Raspberry Pi OS:
- `portaudio19-dev` - Audio I/O
- `libasound2-dev` - ALSA support
- `python3-dev` - Python headers
- `libffi-dev` - FFI library
### Python Setup
```bash
cd /home/pi/llm-assistant/voice-assistant
export PATH="$HOME/.local/bin:$PATH"
uv sync
```
### Configuration
```bash
cp config/config.yaml.example config/config.yaml
nano config/config.yaml
```
Add your OpenAI API key:
```yaml
openai:
api_key: "sk-..." # Your actual API key
```
### Download Hotword Models
```bash
uv run voice-assistant download-models
```
### Verify Installation
```bash
uv run voice-assistant verify
```
## CLI Commands
### Core Commands
```bash
# Run the voice assistant (future)
uv run voice-assistant run [--log-level DEBUG]
# Show configuration
uv run voice-assistant config
# Verify setup
uv run voice-assistant verify
# Download hotword models
uv run voice-assistant download-models
```
### Test Commands
```bash
# Monitor all events in real-time (diagnostic tool)
uv run voice-assistant test-events
# Test speech-to-text (event-driven demo)
uv run voice-assistant test-stt
# Test OpenAI Realtime API (full voice conversation)
uv run voice-assistant test-realtime
# Test hotword detection (records 5s after "alexa")
uv run voice-assistant test-hotword [--debug]
# Test with native paInt16 mono (verification)
uv run voice-assistant test-hotword-native
# Test audio recording (15s capture & playback)
uv run voice-assistant record [--duration 15]
# Test audio hardware
uv run voice-assistant test-audio
```
**Recommended Testing Flow:**
1. **`test-events`** - See all events in real-time (no API key needed)
- Verify hotword detection works
- Check voice activity detection
- Understand event timing
2. **`test-stt`** - Test full STT pipeline (requires API key)
- Verifies OpenAI integration
- Tests complete event-driven flow
- **How it works:**
- Say "alexa" β System starts recording
- Continue speaking β Audio captured
- Stop speaking (~1s pause) β Transcription sent to OpenAI
- Wait 1-3 seconds β Transcription displayed
- **Important:** Only speech after "alexa" is transcribed
- If you speak without saying "alexa", it's ignored (by design)
- **Multiple hotwords:** If you say "alexa" again before stopping, the recording restarts (allows correction/new command)
3. **`test-realtime`** - Test OpenAI Realtime API (requires API key)
- Full bidirectional voice conversation with AI
- **How it works:**
- Say "alexa" β Connects to OpenAI Realtime API
- Speak your question/command β Streams audio to OpenAI
- Stop speaking (~1s pause) β AI processes and responds
- **AI speaks back!** β Plays audio response through speakers
- **Features:**
- Real-time audio streaming (no waiting for transcription)
- AI responds with voice (not just text)
- Natural conversation flow
- Say "alexa" again to interrupt/start new command
- **Best for:** Interactive conversations, questions that need spoken responses
## Architecture
### Overview
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Event-Driven Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Audio Stream (Callback Thread) β
β β β
β ββββββββββββββββββββββ β
β β AudioHandler β Emits VAD events β
β β (Producer + VAD) β Broadcasts audio to: β
β βββββββ¬βββββββββββ¬ββββ β’ hotword_queue (skip-ahead) β
β β β β’ audio_queue (buffered) β
β β VAD β Audio β
β β Events β Frames β
β β β β
β βββββββββββ ββββββββββββββββββββ β
β βEventBus β βVoiceDetection β β
β β ββββService β β
β β β β(Hotword Loop) β β
β ββββββ¬βββββ ββββββββββββββββββββ β
β β β
β β Events: β
β β β’ hotword_detected β
β β β’ voice_activity_started β
β β β’ voice_activity_stopped β
β β β
β β β
β ββββββββββββββ¬βββββββββββββ¬βββββββββββββ β
β β β β β β
β Consumer1 Consumer2 Consumer3 Consumer4 β
β (STT) (Realtime) (Recording) (Custom) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Key Concepts
#### 1. Multi-Consumer Audio
One audio stream broadcasts to multiple queues:
- **hotword_queue** (size=3): Small, skip-ahead for low latency detection
- **audio_queue** (size=100): Large, buffered for complete audio capture
```python
# Hotword detection (skip-ahead)
audio = audio_handler.read_hotword_chunk()
# Complete audio for streaming/transcription (buffered)
audio = audio_handler.read_audio_chunk()
```
#### 2. Event-Driven
Components communicate via events, not direct calls:
**Available Events:**
1. **hotword_detected** - Wake word detected
```python
event = HotwordEvent(
timestamp=now,
hotword="alexa",
score=0.95,
audio_queue_size=42
)
event_bus.publish("hotword_detected", event)
```
2. **voice_activity_started** - User started speaking
```python
event = VoiceActivityEvent(
timestamp=now,
activity_type='started'
)
event_bus.publish("voice_activity_started", event)
```
3. **voice_activity_stopped** - User stopped speaking
```python
event = VoiceActivityEvent(
timestamp=now,
activity_type='stopped',
duration=3.2 # seconds
)
event_bus.publish("voice_activity_stopped", event)
```
**Subscribing to Events:**
```python
# Subscribe to events
event_bus.subscribe("hotword_detected", on_hotword)
event_bus.subscribe("voice_activity_stopped", on_voice_stopped)
# Example: Capture exact duration of user speech
def on_hotword(event: HotwordEvent):
self.recording = True
# Start background thread to collect audio
def on_voice_stopped(event: VoiceActivityEvent):
self.recording = False
# Transcribe collected audio (exact duration!)
```
#### 3. Real-Time Performance
- **Callback mode**: Audio captured in background thread (non-blocking)
- **Skip-ahead**: Hotword queue drops old frames to stay current
- **Parallel consumers**: All process independently, no blocking
#### 4. Voice Detection Service (Core Loop)
The `VoiceDetectionService` is a reusable orchestration component that:
- Runs the main detection loop (hotword detection)
- Publishes hotword events
- Integrates with AudioHandler (which publishes voice activity events)
- Can be used by any command to build different functionality
**Why separate from commands?**
- Commands are UI/entry points
- Core loop is reusable business logic
- Different commands can use the same detection service with different consumers
**Example Usage:**
```python
# Create core components
event_bus = EventBus()
audio_handler = AudioHandler(event_bus=event_bus) # VAD events enabled
hotword_detector = HotwordDetector()
detection_service = VoiceDetectionService(audio_handler, event_bus, hotword_detector)
# Register consumers (they subscribe to events)
stt_consumer = SpeechToTextConsumer(event_bus, audio_handler, api_key)
realtime_consumer = RealtimeConsumer(event_bus, audio_handler, api_key)
# Start audio stream
audio_handler.start_stream()
# Run detection loop (blocks until stopped)
detection_service.start()
```
### Code Structure
```
src/voice_assistant/
βββ core/ # Core components (producers & orchestration)
β βββ audio_handler.py # Audio capture + VAD event emission
β βββ detection_service.py # Detection loop (hotword + orchestration)
β βββ event_bus.py # Pub-sub event system
β βββ hotword_detector.py # Wake word detection
β
βββ consumers/ # Event subscribers
β βββ stt_consumer.py # Speech-to-text consumer
β
βββ services/ # External services
β βββ openai_client.py # OpenAI Realtime API client
β βββ state_machine.py # State management
β
βββ commands/ # CLI commands (use core components)
β βββ run.py # Main service command
β βββ test_stt.py # Test STT consumer
β βββ test_hotword.py # Hotword detection test
β βββ ... # Other utilities
β
βββ cli.py # Command-line interface
βββ config.py # Configuration management
βββ main.py # Service orchestrator (future)
```
## Configuration
Edit `config/config.yaml`:
### Audio Settings
```yaml
audio:
device: "ac108" # ALSA device name
sample_rate: 16000 # Hz
channels: 1 # Mono (works best with openWakeWord)
chunk_size: 1280 # 80ms chunks (required by openWakeWord)
```
### Hotword Detection
```yaml
hotword:
model: "alexa"
threshold: 0.5 # 0.0-1.0 (lower = more sensitive)
```
**Tuning**:
- Lower (0.3-0.4): More sensitive, may have false positives
- Higher (0.6-0.7): Less sensitive, may miss wake word
- Use `--debug` to see scores and tune
**Debouncing**: The system automatically prevents multiple hotword events for a single utterance using a 2-second cooldown period. This means after detecting "alexa" once, it won't fire another event for 2 seconds, even if the detection continues (which is normal as you speak the word).
### Voice Activity Detection
```yaml
vad:
aggressiveness: 3 # 0-3 (recommended: 3)
speech_threshold: 3 # Consecutive frames (filters noise)
silence_threshold: 15 # Frames before stopping (~1 second)
```
**Aggressiveness** (0-3):
- **0**: Least aggressive - detects any sound (not recommended)
- **3**: Most aggressive - only clear speech (recommended)
- Use 3 to avoid false triggers from taps, movements, background noise
**Speech Threshold** (consecutive frames):
- **3** (default): Requires 3 consecutive speech frames (~240ms)
- **Higher (5-7)**: Stricter - ignores very brief sounds
- **Lower (1-2)**: More sensitive - may trigger on brief noises
- Filters out taps, clicks, and momentary sounds
**Silence Threshold** (frames):
- **15** (default): ~1 second of silence before stopping
- **Higher (20-25)**: Waits longer before considering speech ended
- **Lower (10-12)**: Faster response but may cut off pauses
## How It Works
### Hotword Detection
1. Audio captured in background thread (callback mode)
2. Broadcasted to `hotword_queue` (skip-ahead) and `audio_queue` (buffered)
3. Hotword detector reads from `hotword_queue`
4. When "alexa" detected and cooldown period passed β publishes `HotwordEvent`
- **Debouncing**: 2-second cooldown prevents duplicate events
- Single word "alexa" = single event (even though detection spans multiple frames)
5. All subscribed consumers react independently
### Speech-to-Text Consumer
1. Subscribes to `hotword_detected` and `voice_activity_stopped` events
2. When hotword detected:
- Starts recording from `audio_queue` in background thread
- If another hotword detected: restarts recording (allows correction)
3. When voice activity stops:
- Stops recording and sends audio to OpenAI Whisper API
- Displays transcription
4. Behavior:
- Only transcribes speech AFTER hotword
- Speech without hotword is ignored
- Multiple "alexa" β uses last one before voice stops
### Realtime API Consumer
1. Subscribes to `hotword_detected` and `voice_activity_stopped` events
2. When hotword detected:
- Connects to OpenAI Realtime API via WebSocket
- Starts streaming audio from `audio_queue` in real-time
- If another hotword detected: cancels current response, restarts
3. When voice activity stops:
- Commits audio buffer and requests AI response
- Receives audio response from OpenAI
- Plays audio back through speakers
4. Features:
- Bidirectional audio streaming (send + receive)
- Low latency (real-time processing)
- AI speaks back with voice
- Interruption support (say "alexa" to cancel/restart)
5. Architecture:
- Runs async event loop in background thread
- Audio streaming in separate thread
- Audio playback in separate thread
- All synchronized via event bus
### Adding Custom Consumers
```python
from voice_assistant.core import EventBus, HotwordEvent, AudioHandler
class MyConsumer:
def __init__(self, event_bus, audio_handler):
self.event_bus = event_bus
self.audio_handler = audio_handler
event_bus.subscribe("hotword_detected", self.on_hotword)
def on_hotword(self, event: HotwordEvent):
# React to hotword
audio = self.audio_handler.read_audio_chunk()
# ... process audio ...
def cleanup(self):
self.event_bus.unsubscribe("hotword_detected", self.on_hotword)
```
## Tuning Voice Activity Detection
### Problem: False Triggers (taps, movements, background noise)
**Symptoms:**
- Voice activity events when you tap the desk
- Events triggered by keyboard typing
- Background sounds causing false starts
**Solutions (in order of effectiveness):**
1. **Increase aggressiveness to 3** (default)
```yaml
vad:
aggressiveness: 3 # Most strict
```
2. **Increase speech threshold**
```yaml
vad:
speech_threshold: 5 # Require 5 consecutive frames (~400ms)
```
- Brief sounds (taps, clicks) won't trigger
- Real speech sustained longer than 400ms will trigger
3. **Test with `test-events`**
```bash
uv run voice-assistant test-events
# Tap desk, type, make noise
# Should NOT trigger voice activity events
# Only speaking should trigger
```
### Problem: Speech Not Detected
**Symptoms:**
- You speak but no voice activity event
- Hotword detected but no speech activity
**Solutions:**
1. **Speak louder/clearer**
- VAD aggressiveness=3 requires clear speech
- Move closer to microphone
2. **Lower speech threshold**
```yaml
vad:
speech_threshold: 2 # More sensitive
```
3. **Lower aggressiveness (not recommended)**
```yaml
vad:
aggressiveness: 2 # Less strict (may cause false triggers)
```
### Understanding the Settings
```
aggressiveness: 3 β Filters out non-speech sounds
β
speech_threshold: 3 β Requires sustained sound (not brief tap)
β
Voice Activity Started! π£οΈ
β
(user speaks...)
β
silence_threshold: 15 β Waits for pause before stopping
β
Voice Activity Stopped! π
```
**Recommended defaults** (already set):
- `aggressiveness: 3` - Only clear speech
- `speech_threshold: 3` - Filters taps/clicks (~240ms)
- `silence_threshold: 15` - Natural pause (~1 second)
## Troubleshooting
### Realtime API Errors
**Error: "Unknown parameter: 'session.modalities'"**
This error occurred in older versions. The fix:
- Removed `modalities` from session configuration
- The API now infers modalities from context
- **Already fixed** in current version
**Error: "Invalid authentication" or "401 Unauthorized"**
- Check your OpenAI API key in `config/config.yaml`
- Ensure you have access to the Realtime API (requires payment method)
- The model name should be `gpt-4o-realtime-preview-2024-12-17`
**No audio playback from AI:**
- Check speaker volume and connections
- Verify ALSA playback device is working: `speaker-test -t wav -c 2`
- OpenAI outputs 24kHz audio - ensure your speakers support it
**High latency or delays:**
- Check internet connection speed
- Realtime API requires stable, low-latency connection
- Consider using Ethernet instead of WiFi
### Hotword Not Detecting
**Check scores**:
```bash
uv run voice-assistant test-hotword --debug
```
Look for lines like:
```
Debug: Max score = 0.0129 (alexa), threshold = 0.5
```
**If scores are always 0.0000**:
- Run `uv run voice-assistant download-models`
- Check model file: `ls -lh models/`
**If scores are low (0.01-0.3)**:
- Speak louder or closer to mic
- Lower threshold in config
- Check audio levels: `alsamixer`
**If scores are good but no detection**:
- Check threshold setting
- Ensure using correct audio format (paInt16 mono)
### Audio Issues
**Test audio capture**:
```bash
uv run voice-assistant record --duration 10
```
This records 10s and plays it back.
**Check audio device**:
```bash
arecord -l
uv run voice-assistant config
```
**Low audio levels**:
```bash
alsamixer
# Adjust "Capture" or "ADC" levels
```
### Import Errors After Reorganization
Update imports:
```python
# Old
from voice_assistant.audio_handler import AudioHandler
# New
from voice_assistant.core import AudioHandler
# or
from voice_assistant.core.audio_handler import AudioHandler
```
### Performance Issues
Check queue status in logs:
```
hotword_queue: 0-3 frames (good - skip-ahead working)
audio_queue: 10-50 frames (good - buffering)
```
If audio_queue grows >80 frames, system may be falling behind.
### Audio Playback Issues
**Test speaker first:**
```bash
# Quick speaker test (plays a 440Hz beep)
uv run python test_speaker.py
# Or use system tools
speaker-test -t sine -f 440 -l 1
```
**No audio from AI response:**
1. Check if audio chunks are being received:
- Look for `"Received audio delta"` in logs
- Should see `"π AI is responding..."` message
2. Verify default output device:
```bash
aplay -l # List playback devices
```
3. Check ALSA mixer:
```bash
alsamixer
# Press F6 to select sound card
# Adjust "Master" or "PCM" volume
```
4. Test with different output device:
```python
# In realtime_consumer.py, specify device index
self.playback_stream = self.audio.open(
...
output_device_index=0, # Try different indices
)
```
5. Check logs for OpenAI events:
- `response.audio.delta` - Audio chunks received
- `response.audio.done` - Audio response complete
- `Event: response.content_part.added` - Response structure
**Audio choppy or distorted:**
- Increase `frames_per_buffer` in playback stream (try 2048 or 4096)
- Check system load: `top` or `htop`
## Running as Service (Future)
```bash
# Install
sudo cp voice-assistant.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable voice-assistant
sudo systemctl start voice-assistant
# Monitor
sudo journalctl -u voice-assistant -f
```
## Development
### Project Setup
```bash
# Clone
git clone
cd voice-assistant
# Install
uv sync
# Run tests
uv run pytest
# Lint
uv run ruff check src/
```
### Code Style
- Use `ruff` for linting and formatting
- Follow PEP 8
- Type hints where appropriate
- Docstrings for public APIs
### Adding Features
1. **New Consumer**: Add to `src/voice_assistant/consumers/`
2. **New Service**: Add to `src/voice_assistant/services/`
3. **New Command**: Add to `src/voice_assistant/commands/` and `cli.py`
4. **Core Component**: Add to `src/voice_assistant/core/`
## Technical Details
### Audio Format
- **Capture**: paInt16 mono @ 16kHz
- **Chunks**: 1280 samples (80ms) - required by openWakeWord
- **Queues**: Separate for hotword (skip-ahead) and consumers (buffered)
### Hotword Detection
- **Library**: openWakeWord (TensorFlow Lite)
- **Model**: alexa_v0.1.tflite
- **Input**: int16 numpy array (not float32!)
- **Stateful**: Needs every frame for context
### Real-Time Performance
**Before optimization**:
- Blocking read: 62ms
- Detection: 18ms
- Total: 80ms (falling behind 0.13ms/frame)
**After optimization**:
- Callback mode: ~0ms (background thread)
- Detection: 18ms
- Total: 18ms (real-time capable!)
## Known Issues
1. **NumPy 2.x incompatibility**: Constrained to numpy <2.0 for tflite-runtime
2. **ALSA warnings**: Harmless warnings about unavailable devices (ignore)
3. **GPU discovery warning**: Normal on Raspberry Pi (uses CPU)
## Future Enhancements
- [ ] OpenAI Realtime API consumer (bidirectional streaming)
- [ ] Recording consumer (save conversations)
- [ ] Analytics consumer (usage tracking)
- [ ] Web UI for monitoring
- [ ] Multi-hotword support
- [ ] Custom wake word training
## Credits
- [openWakeWord](https://github.com/dscripka/openWakeWord) - Local hotword detection
- [OpenAI](https://platform.openai.com/) - Whisper API & Realtime API
- [ReSpeaker 4-Mic Array](https://wiki.seeedstudio.com/ReSpeaker_4_Mic_Array_for_Raspberry_Pi/) - Hardware
## License
MIT