https://github.com/rohanprichard/fastrtc-demo
A simple POC of FastRTC, a framework to use voice mode in python!
https://github.com/rohanprichard/fastrtc-demo
conversational-ai fastapi fastrtc generative-ai huggingface multimodal voice-activity-detection voice-assistant
Last synced: 2 months ago
JSON representation
A simple POC of FastRTC, a framework to use voice mode in python!
- Host: GitHub
- URL: https://github.com/rohanprichard/fastrtc-demo
- Owner: rohanprichard
- Created: 2025-03-05T09:50:21.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-02T09:48:53.000Z (about 1 year ago)
- Last Synced: 2025-04-02T10:37:22.251Z (about 1 year ago)
- Topics: conversational-ai, fastapi, fastrtc, generative-ai, huggingface, multimodal, voice-activity-detection, voice-assistant
- Language: TypeScript
- Homepage:
- Size: 89.8 KB
- Stars: 24
- Watchers: 1
- Forks: 9
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# FastRTC POC
A simple POC for a fast real-time voice chat application using FastAPI and FastRTC by [rohanprichard](https://github.com/rohanprichard). I wanted to make one as an example with more production-ready languages, rather than just Gradio.
## Setup
1. Set your OpenAI and ElevenLabs API key in an `.env` file based on the `.env.example` file
2. Create a virtual environment and install the dependencies
```bash
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```
For windows,
```bash
python -m venv env
.\env\Scripts\activate
pip install -r requirements.txt
```
3. Run the server
```bash
./run.sh
```
Windows:
```bash
uvicorn backend.server:app --host 0.0.0.0 --port 8000
```
4. Navigate into the frontend directory
```bash
cd frontend/fastrtc-demo
```
5. Run the frontend
```bash
npm install
npm run dev
```
6. Click the microphone icon and start chatting!
7. Reset chats by clicking the trash button on the bottom right
## Notes
- The STT is currently using the ElevenLabs API.
- The LLM is currently using the OpenAI API.
- The TTS is currently using the ElevenLabs API.
- The VAD is currently using the Silero VAD model.
- You may need to install ffmpeg if you get errors in STT
The prompt can be changed in the `backend/server.py` file and modified as you like.
### Audio Parameters
#### AlgoOptions
- **audio_chunk_duration**: Length of audio chunks in seconds. Smaller values allow for faster processing but may be less accurate.
- **started_talking_threshold**: If a chunk has more than this many seconds of speech, the system considers that the user has started talking.
- **speech_threshold**: After the user has started speaking, if a chunk has less than this many seconds of speech, the system considers that the user has paused.
#### SileroVadOptions
- **threshold**: Speech probability threshold (0.0-1.0). Values above this are considered speech. Higher values are more strict.
- **min_speech_duration_ms**: Speech segments shorter than this (in milliseconds) are filtered out.
- **min_silence_duration_ms**: The system waits for this duration of silence (in milliseconds) before considering speech to be finished.
- **speech_pad_ms**: Padding added to both ends of detected speech segments to prevent cutting off words.
- **max_speech_duration_s**: Maximum allowed duration for a speech segment in seconds. Prevents indefinite listening.
### Tuning Recommendations
- If the AI interrupts you too early:
- Increase `min_silence_duration_ms`
- Increase `speech_threshold`
- Increase `speech_pad_ms`
- If the AI is slow to respond after you finish speaking:
- Decrease `min_silence_duration_ms`
- Decrease `speech_threshold`
- If the system fails to detect some speech:
- Lower the `threshold` value
- Decrease `started_talking_threshold`
## Credits:
Credit for the UI components goes to Shadcn, Aceternity UI and Kokonut UI.