https://github.com/soufiiyane/whisper-streaming-websocket
Real-time speech-to-text transcription and translation system built exclusively with faster-whisper for optimal performance. Features WebSocket support for live audio streaming and optional Google Translate integration.
https://github.com/soufiiyane/whisper-streaming-websocket
faster-whisper transcription translation whisper whisper-streaming
Last synced: about 2 months ago
JSON representation
Real-time speech-to-text transcription and translation system built exclusively with faster-whisper for optimal performance. Features WebSocket support for live audio streaming and optional Google Translate integration.
- Host: GitHub
- URL: https://github.com/soufiiyane/whisper-streaming-websocket
- Owner: soufiiyane
- Created: 2025-07-31T13:16:57.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-08-04T11:18:44.000Z (2 months ago)
- Last Synced: 2025-08-04T15:31:39.275Z (2 months ago)
- Topics: faster-whisper, transcription, translation, whisper, whisper-streaming
- Language: Python
- Homepage: https://soufiiyane.github.io/whisper-streaming-websocket/
- Size: 42.7 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Whisper Streaming WebSocket
Real-time speech-to-text transcription and translation system built exclusively with **faster-whisper** for optimal performance. Features WebSocket support for live audio streaming and optional Google Translate integration.
## Installation
### Required Dependencies
```bash
pip install librosa soundfile websockets
```### Faster-Whisper Backend
This project uses **faster-whisper** exclusively for the best balance of speed and accuracy:```bash
pip install faster-whisper
```**For GPU Support (Recommended):**
- Install NVIDIA libraries: CUDNN 8.5.0 and CUDA 11.7
- Navigate to `whisper_online.py` and uncomment the GPU model line (around line 119)
- GPU provides significantly faster real-time processing### Optional Features
- **Voice Activity Detection**: `pip install torch torchaudio`
- **Translation**: `pip install requests` (for Google Translate integration)## Usage
### Quick Start - WebSocket Server
The easiest way to get started is using the enhanced launcher:
```bash
# Basic transcription only
python start_whisper.py# With translation enabled
python start_whisper.py --translate --vac --vad# Production setup
python start_whisper.py --host 0.0.0.0 --port 43007 --model large-v3 --translate --vac --vad
```Then open `index.html` in your browser or navigate to `http://localhost:43007` for the web interface.
### Command Line Options
#### start_whisper.py
```bash
python start_whisper.py --help
```**Server Configuration:**
- `--host HOST` - Server host address (default: 0.0.0.0)
- Use `0.0.0.0` to accept connections from any IP
- Use `localhost` for local-only access
- `--port PORT` - Server port number (default: 43007)
- Choose any available port for the WebSocket server**Model Settings:**
- `--model MODEL` - Whisper model size (default: large-v3)
- Options: `tiny`, `base`, `small`, `medium`, `large-v1`, `large-v2`, `large-v3`, `large-v3-turbo`
- Larger models = better accuracy but slower processing
- `large-v3` recommended for production use
- `--language LANG` - Source language code (default: en)
- Use `auto` for automatic language detection
- Use specific codes: `en`, `es`, `fr`, `de`, `it`, `pt`, `ja`, `ko`, `zh`, etc.**Audio Processing:**
- `--chunk-size SIZE` - Minimum audio chunk size in seconds (default: 0.3)
- Smaller values = lower latency but more processing overhead
- Larger values = higher latency but more efficient processing
- `--vac` - Enable Voice Activity Controller (recommended)
- Automatically detects speech start/end for better processing
- Reduces unnecessary processing during silence
- `--vad` - Enable Voice Activity Detection
- Filters out non-speech audio segments
- Improves accuracy by focusing on actual speech**Features:**
- `--translate` - Enable real-time translation using Google Translate
- Requires `requests` library: `pip install requests`
- Provides live translation alongside transcription**Advanced Options:**
- `--warmup-file FILE` - Audio file to warm up Whisper model
- Pre-loads model with sample audio for faster first chunk processing
- Use any WAV file for warming up
- `--log-level LEVEL` - Logging verbosity (default: INFO)
- Options: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
- Use `DEBUG` for troubleshooting#### whisper_online.py (File Processing)
Process pre-recorded audio files for testing and development:```bash
python whisper_online.py demo.wav --language en --min-chunk-size 1
```**File Processing Options:**
- `--model MODEL` - Whisper model size (tiny to large-v3)
- `--language LANG` - Source language or "auto" for automatic detection
- `--min-chunk-size SIZE` - Minimum chunk size for processing (seconds)
- `--vac` - Enable Voice Activity Controller for better segmentation
- `--vad` - Enable Voice Activity Detection to filter non-speech
- `--task TASK` - Processing task: "transcribe" or "translate" (to English)
- `--buffer-trimming` - Buffer management strategy for long audio files### Web Interface Features
- **Language Selection**: Choose from 25+ supported languages for source and target
- **Real-time Language Switching**: Change languages during active recording without interruption
- **Voice Activity Indicator**: Visual feedback showing when speech is detected
- **Dual Output Panels**: Separate displays for original transcription and translation
- **WebSocket Connection**: Low-latency real-time audio streaming
- **Auto-scroll**: Automatically scrolls to show latest transcription results
- **Responsive Design**: Works on desktop and mobile devices### Chrome Extension
The included Chrome extension captures audio from any browser tab and provides real-time transcription and translation:
**Features:**
- **Tab Audio Capture**: Capture audio from YouTube, podcasts, videos, any web content
- **Side Panel Interface**: Persistent panel that stays open while browsing
- **Real-time Processing**: Live transcription and translation as audio plays
- **Language Controls**: Change source and target languages during recording
- **Text Accumulation**: Keeps all transcriptions for easy reading**Installation:**
1. Load the extension from `chrome_extension/` folder in Chrome Developer Mode
2. Start the Whisper server: `python start_whisper.py --translate`
3. Click extension icon to open side panel
4. Navigate to any tab with audio content and start translation### As a Python Module
```python
from whisper_online import *# Initialize faster-whisper ASR
asr = FasterWhisperASR("en", "large-v3")
asr.use_vad() # Enable voice activity detection# Create online processor
online = OnlineASRProcessor(asr)# Process audio chunks in real-time
while audio_available:
audio_chunk = get_audio_chunk() # Your audio source (16kHz, float32)
online.insert_audio_chunk(audio_chunk)
result = online.process_iter()
if result[2]: # If there's transcribed text
start_time, end_time, text = result
print(f"[{start_time:.2f}s - {end_time:.2f}s]: {text}")# Get final result when done
final_result = online.finish()
if final_result[2]:
print(f"Final: {final_result[2]}")
```## Project Structure
```
├── whisper_online.py # Core streaming processor with faster-whisper backend
├── whisper_websocket_server.py # WebSocket server with translation support
├── start_whisper.py # Enhanced launcher script with all options
├── index.html # Web interface for real-time transcription
├── silero_vad_iterator.py # Voice Activity Detection implementation
├── chrome_extension/ # Chrome extension for tab audio capture
│ ├── manifest.json # Extension configuration
│ ├── sidepanel.html # Extension UI
│ ├── sidepanel.js # Extension logic
│ ├── service-worker.js # Background script
│ └── offscreen.js # Audio processing
├── frontend/ # Documentation website
│ ├── index.html # Project documentation
│ └── styles.css # Documentation styling
└── demo.wav # Sample audio file for testing
```## Features
- **Real-time Streaming**: Processes audio chunks with low latency using WebSocket protocol
- **Faster-Whisper Backend**: Exclusively uses faster-whisper for optimal speed and accuracy
- **Voice Activity Detection**: Smart processing with VAD/VAC to reduce computational overhead
- **Live Translation**: Optional Google Translate integration for multilingual support
- **Web Interface**: Browser-based real-time transcription with responsive design
- **Chrome Extension**: Capture and translate audio from any browser tab
- **Language Switching**: Change source and target languages during active recording
- **GPU Acceleration**: Support for NVIDIA GPU acceleration for faster processing
- **Flexible Configuration**: Extensive command-line options for customization## Supported Languages
**Source Languages (Whisper):** All 99 Whisper-supported languages including:
- **European**: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Swedish, Danish, Norwegian, Finnish, Czech, Slovak, Hungarian, Romanian, Bulgarian, Croatian, Slovenian, Estonian, Latvian, Lithuanian, Ukrainian
- **Asian**: Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian, Thai, Vietnamese, Indonesian, Malay
- **Others**: Hebrew, Persian, Urdu, Bengali, Tamil, Telugu, Gujarati, Marathi, and many more**Translation Languages (Google Translate):** 100+ languages supported for real-time translation
**Language Detection:** Automatic language detection available with `--language auto` option
## Performance Notes
- **GPU vs CPU**: GPU acceleration provides 4-5x faster processing than CPU-only
- **Model Size Impact**:
- `tiny`: Fastest, lower accuracy
- `base/small`: Good balance for real-time use
- `medium`: Better accuracy, moderate speed
- `large-v3`: Best accuracy, requires more resources
- **Chunk Size**: Smaller chunks (0.1-0.3s) = lower latency, larger chunks (0.5-1.0s) = better accuracy
- **VAD Benefits**: Reduces processing by 30-50% by skipping silence periods