https://github.com/prakash-aryan/speech_command_server
A real-time voice command detection system that recognizes "play" and "pause" commands using Vosk speech recognition.
https://github.com/prakash-aryan/speech_command_server
fastapi python transcription uvicorn vosk websockets
Last synced: 7 months ago
JSON representation
A real-time voice command detection system that recognizes "play" and "pause" commands using Vosk speech recognition.
- Host: GitHub
- URL: https://github.com/prakash-aryan/speech_command_server
- Owner: prakash-aryan
- License: mit
- Created: 2025-02-25T05:01:00.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-02-25T05:04:03.000Z (7 months ago)
- Last Synced: 2025-02-25T06:18:27.072Z (7 months ago)
- Topics: fastapi, python, transcription, uvicorn, vosk, websockets
- Language: Python
- Homepage:
- Size: 50.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Speech Command Detector
A real-time voice command detection system that recognizes "play" and "pause" commands using Vosk speech recognition.
[Demo](https://github.com/user-attachments/assets/6401734d-8ede-4605-810c-c4d6c2280e5c)

## Features
- Real-time speech recognition with WebSockets
- Low-latency command detection
- Responsive web interface with visual feedback
- Standalone command-line interface option
- Works in modern browsers (Chrome, Firefox, Edge)## System Architecture
As shown in the architecture diagram above, the system consists of two main parts:
1. **Client Side (Browser)**:
- **Audio Capture**: Converts microphone input to 16kHz PCM format
- **WebSocket Client**: Handles bidirectional communication with the server
- **User Interface**: Displays command state and transcriptions2. **Server Side (Python)**:
- **WebSocket Server**: FastAPI and Uvicorn handle connections
- **Audio Processor**: Buffers and processes incoming audio
- **Speech Model (Vosk)**: Converts audio to text
- **Command Handler**: Detects commands in the transcription3. **Standalone CLI Version**:
- Uses the same core components without WebSocket/UI layers## Installation
1. Clone the repository:
```
git clone git@github.com:prakash-aryan/speech_command_server.git
cd speech_command_server
```2. Create a virtual environment and install dependencies:
```
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
```3. Download the Vosk speech recognition model:
```
mkdir -p models/data
cd models/data
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
cd ../..
```## Usage
### Web Interface
1. Start the server:
```
python app.py
```2. Open a web browser and navigate to:
```
http://localhost:8080
```3. Click "Start Listening" and speak commands like "play" or "pause"
### Command Line Interface
For a standalone command-line interface without the web server:
```
python simple_command_detector.py
```## How It Works
1. The browser captures audio from the microphone using the WebAudio API
2. Audio is processed, resampled to 16kHz, and converted to 16-bit PCM format
3. Audio data is sent to the server via WebSocket
4. The server processes the audio through several components:
- Audio Processor prepares and buffers the data
- Speech Model (Vosk) transcribes the audio to text
- Command Handler detects "play" or "pause" commands
5. Commands are sent back to the browser, which updates the UI accordingly
6. Transcriptions are also sent back for real-time feedback## Project Structure
```
speech_command_server/
│
├── app.py # Main FastAPI server application
├── simple_command_detector.py # Standalone CLI tool
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── sysArch.png # System architecture diagram
│
├── models/
│ ├── __init__.py # Makes models a package
│ ├── asr_model.py # Speech recognition model
│ └── data/ # Speech model data
│ └── vosk-model-small-en-us-0.15/
│
├── utils/
│ ├── __init__.py # Makes utils a package
│ ├── audio_processor.py # Audio processing utilities
│ └── command_handler.py # Command detection logic
│
└── static/
└── index.html # Web interface
```## Extending the System
This project can be extended in several ways:
### 1. Add More Voice Commands
To add new commands, modify the `CommandHandler` class in `utils/command_handler.py`:
```python
def __init__(self):
"""Initialize the command handler"""
# Define command keywords and synonyms
self.commands = {
"play": ["play", "start", "begin", "resume", "go"],
"pause": ["pause", "stop", "halt", "freeze", "wait"],
# Add new commands here:
"next": ["next", "skip", "forward"],
"previous": ["previous", "back", "backward"],
"volume_up": ["louder", "increase volume", "volume up"],
"volume_down": ["quieter", "decrease volume", "volume down"]
}
```### 2. Integrate with External Systems
You can extend the command handling to control real applications:
```python
def _apply_cooldown(self, command: str) -> Optional[str]:
"""Apply cooldown logic and handle system integration"""
# [Existing cooldown code]
# Add integration with external systems
if command == "play":
# Example: Use subprocess to control a media player
import subprocess
subprocess.run(["playerctl", "play"])
elif command == "pause":
subprocess.run(["playerctl", "pause"])
return command
```### 3. Improve Speech Recognition
You can improve recognition accuracy by:
1. Using a larger Vosk model
2. Adding custom vocabulary or word boosting:```python
def __init__(self, model_path: Optional[str] = None):
# [Existing initialization code]
# Create recognizer with model
self.recognizer = KaldiRecognizer(self.vosk_model, self.sample_rate)
# Add custom vocabulary or boost specific words
self.recognizer.SetWords(True)
self.recognizer.SetPartialWords(True)
# Boost command keywords for better recognition
grammar = '["play", "pause", "next", "previous", "stop"]'
self.recognizer.SetGrammar(grammar)
```### 4. Add Authentication
For multi-user applications, add authentication to the FastAPI server:
```python
from fastapi import Depends, FastAPI, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, OAuth2PasswordRequestForm# [Authentication code setup]
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket, current_user: User = Depends(get_current_user)):
# Only authenticated users can use the WebSocket
# [Existing WebSocket code]
```### 5. Implement Voice Profiles
For improved accuracy, add voice profile training:
```python
class UserProfile:
def __init__(self, user_id):
self.user_id = user_id
self.voice_samples = []
def add_sample(self, audio_data):
self.voice_samples.append(audio_data)
def train(self):
# Process voice samples to create user-specific model adjustments
pass
```## Troubleshooting
- **Browser doesn't detect microphone**: Make sure you're using a modern browser and accessing the site via https:// or localhost
- **No transcription appears**: Check that your microphone is working and properly selected in the browser
- **Server doesn't start**: Make sure the Vosk model is downloaded and extracted to the correct location
- **WebSocket disconnects**: Check your network connection and firewall settings## Requirements
- Python 3.10+
- Vosk speech recognition model
- Modern web browser
- Microphone## License
[MIT License](LICENSE)