https://github.com/lpalbou/voicellm
A modular Python library for voice interactions with AI systems, featuring high-quality TTS, STT with Whisper, and memory persistence.
https://github.com/lpalbou/voicellm
ai llm python speech-to-text stt text-to-speech tts voice voice-assistant voice-interface voice-recognition whisper
Last synced: 10 months ago
JSON representation
A modular Python library for voice interactions with AI systems, featuring high-quality TTS, STT with Whisper, and memory persistence.
- Host: GitHub
- URL: https://github.com/lpalbou/voicellm
- Owner: lpalbou
- License: mit
- Created: 2025-04-26T11:51:19.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-27T01:08:41.000Z (about 1 year ago)
- Last Synced: 2025-07-05T23:06:05.549Z (11 months ago)
- Topics: ai, llm, python, speech-to-text, stt, text-to-speech, tts, voice, voice-assistant, voice-interface, voice-recognition, whisper
- Language: Python
- Homepage: https://pypi.org/project/voicellm/
- Size: 49.8 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# VoiceLLM
[](https://pypi.org/project/voicellm/)
[](https://pypi.org/project/voicellm/)
[](https://github.com/lpalbou/voicellm/blob/main/LICENSE)
[](https://github.com/lpalbou/voicellm/stargazers)
A modular Python library for voice interactions with AI systems, providing text-to-speech (TTS) and speech-to-text (STT) capabilities with interrupt handling.
While we provide CLI and WEB examples, VoiceLLM is designed to be integrated in other projects.
## Features
- **Text-to-Speech**: High-quality speech synthesis with adjustable speed
- **Speech-to-Text**: Accurate voice recognition using OpenAI's Whisper
- **Voice Activity Detection**: Efficient speech detection using WebRTC VAD
- **Interrupt Handling**: Stop TTS by speaking or using stop commands
- **Modular Design**: Easily integrate with any text generation system
## Installation
### Prerequisites
- Python 3.8+ (3.11 recommended)
- PortAudio for audio input/output
### Basic Installation
```bash
# Install from PyPI
pip install voicellm
# Or clone the repository
git clone https://github.com/lpalbou/voicellm.git
cd voicellm
pip install -e .
```
### Development Installation
```bash
# Install with development dependencies
pip install "voicellm[dev]"
```
### From Requirements File
```bash
# Install all dependencies including the package
pip install -r requirements.txt
```
## Quick Start
```python
from voicellm import VoiceManager
import time
# Initialize voice manager
voice_manager = VoiceManager(debug_mode=False)
# Text to speech
voice_manager.speak("Hello, I am an AI assistant. How can I help you today?")
# Wait for speech to complete
while voice_manager.is_speaking():
time.sleep(0.1)
# Speech to text with callback
def on_transcription(text):
print(f"User said: {text}")
if text.lower() != "stop":
# Process with your text generation system
response = f"You said: {text}"
voice_manager.speak(response)
# Start voice recognition
voice_manager.listen(on_transcription)
# Wait for user to say "stop" or press Ctrl+C
try:
while voice_manager.is_listening():
time.sleep(0.1)
except KeyboardInterrupt:
pass
# Clean up
voice_manager.cleanup()
```
## Running Examples
The package includes several examples that demonstrate different ways to use VoiceLLM.
### Voice Mode (Default)
If installed globally, you can launch VoiceLLM directly in voice mode:
```bash
# Start VoiceLLM in voice mode (TTS ON, STT ON)
voicellm
# With options
voicellm --debug --whisper base --model gemma3:latest --api http://localhost:11434/api/chat
```
Command line options:
- `--debug`: Enable debug mode with detailed logging
- `--api`: URL of the Ollama API (default: http://localhost:11434/api/chat)
- `--model`: Ollama model to use (default: granite3.3:2b)
- Other examples : cogito:3b, phi4-mini:latest, qwen2.5:latest, cogito:latest, gemma3:latest, etc.
- `--whisper`: Whisper model to use (tiny, base, small, medium, large)
- `--no-voice`: Start in text mode instead of voice mode
- `--system`: Custom system prompt
### Command-Line REPL
```bash
# Run the CLI example (TTS ON, STT OFF)
voicellm-cli cli
# With debug mode
voicellm-cli cli --debug
```
### Web API
```bash
# Run the web API example
voicellm-cli web
# With different host and port
voicellm-cli web --host 0.0.0.0 --port 8000
```
You can also run a simplified version that doesn't load the full models:
```bash
# Run the web API with simulation mode
voicellm-cli web --simulate
```
#### Troubleshooting Web API
If you encounter issues with the web API:
1. **404 Not Found**: Make sure you're accessing the correct endpoints (e.g., `/api/test`, `/api/tts`)
2. **Connection Issues**: Ensure no other service is using the port
3. **Model Loading Errors**: Try running with `--simulate` flag to test without loading models
4. **Dependencies**: Ensure all required packages are installed:
```bash
pip install flask soundfile numpy requests
```
5. **Test with a simple Flask script**:
```python
from flask import Flask
app = Flask(__name__)
@app.route('/')
def home():
return "Flask works!"
app.run(host='127.0.0.1', port=5000)
```
### Simple Demo
```bash
# Run the simple example
voicellm-cli simple
```
## Component Overview
### VoiceManager
The main class that coordinates TTS and STT functionality:
```python
# Initialize
manager = VoiceManager(tts_model="tts_models/en/ljspeech/tacotron2-DDC",
whisper_model="tiny", debug_mode=False)
# TTS
manager.speak(text, speed=1.0, callback=None)
manager.stop_speaking()
manager.is_speaking()
# STT
manager.listen(on_transcription, on_stop=None)
manager.stop_listening()
manager.is_listening()
# Configuration
manager.change_whisper_model(model_name)
manager.change_vad_aggressiveness(aggressiveness)
# Cleanup
manager.cleanup()
```
### TTSEngine
Handles text-to-speech synthesis:
```python
from voicellm.tts import TTSEngine
tts = TTSEngine(model_name="tts_models/en/ljspeech/tacotron2-DDC", debug_mode=False)
tts.speak(text, speed=1.0, callback=None)
tts.stop()
tts.is_active()
```
### VoiceRecognizer
Manages speech recognition with VAD:
```python
from voicellm.recognition import VoiceRecognizer
def on_transcription(text):
print(f"Transcribed: {text}")
def on_stop():
print("Stop command detected")
recognizer = VoiceRecognizer(transcription_callback=on_transcription,
stop_callback=on_stop,
whisper_model="tiny",
debug_mode=False)
recognizer.start(tts_interrupt_callback=None)
recognizer.stop()
recognizer.change_whisper_model("base")
recognizer.change_vad_aggressiveness(2)
```
## Integration with Text Generation Systems
VoiceLLM is designed to be used with any text generation system:
```python
from voicellm import VoiceManager
import requests
# Initialize voice manager
voice_manager = VoiceManager()
# Function to call text generation API
def generate_text(prompt):
response = requests.post("http://localhost:11434/api/chat", json={
"model": "granite3.3:2b",
"messages": [{"role": "user", "content": prompt}],
"stream": False
})
return response.json()["message"]["content"]
# Callback for speech recognition
def on_transcription(text):
if text.lower() == "stop":
return
print(f"User: {text}")
# Generate response
response = generate_text(text)
print(f"AI: {response}")
# Speak response
voice_manager.speak(response)
# Start listening
voice_manager.listen(on_transcription)
```
## Perspectives
This is a test project that I designed with examples to work with Ollama, but I will adapt the examples and voicellm to work with any LLM provider (anthropic, openai, etc).
## License and Acknowledgments
VoiceLLM is licensed under the [MIT License](LICENSE).
This project depends on several open-source libraries and models, each with their own licenses. Please see [ACKNOWLEDGMENTS.md](ACKNOWLEDGMENTS.md) for a detailed list of dependencies and their respective licenses.
Some dependencies, particularly certain TTS models, may have non-commercial use restrictions. If you plan to use VoiceLLM in a commercial application, please ensure you are using models that permit commercial use or obtain appropriate licenses.