https://github.com/kiranbaby14/talkmateai
๐ญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
https://github.com/kiranbaby14/talkmateai
fastapi flash-attention-2 huggingface kokoro-tts multimodal-ai nextjs smolvlm vlm websocket whisper-ai
Last synced: 4 months ago
JSON representation
๐ญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
- Host: GitHub
- URL: https://github.com/kiranbaby14/talkmateai
- Owner: kiranbaby14
- License: mit
- Created: 2025-06-29T19:02:57.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-06-29T20:28:36.000Z (4 months ago)
- Last Synced: 2025-06-29T21:28:55.245Z (4 months ago)
- Topics: fastapi, flash-attention-2, huggingface, kokoro-tts, multimodal-ai, nextjs, smolvlm, vlm, websocket, whisper-ai
- Language: TypeScript
- Homepage:
- Size: 3.41 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ญ TalkMateAI
**Real-time Voice-Controlled 3D Avatar with Multimodal AI**
> Your 3D AI companion that never stops listening, never stops caring.
> Transform conversations into immersive experiences with AI-powered 3D avatars that see, hear, and respond naturally.
[](https://python.org)
[](https://fastapi.tiangolo.com)
[](https://nextjs.org)
[](https://developer.nvidia.com/cuda-toolkit)
[](LICENSE)
## ๐ฅ Demo Video
[](https://www.youtube.com/watch?v=dE_8TXmp2Sk)
## โจ Features
### ๐ฏ **Core Capabilities**
- **๐ค Real-time Voice Activity Detection** - Advanced VAD with configurable sensitivity
- **๐ฃ๏ธ Speech-to-Text** - Powered by OpenAI Whisper (tiny model) for instant transcription
- **๐๏ธ Vision Understanding** - SmolVLM2-256M-Video-Instruct for multimodal comprehension
- **๐ Natural Text-to-Speech** - Kokoro TTS with native word-level timing
- **๐ญ 3D Avatar Animation** - Lip-sync and emotion-driven animations using [TalkingHead](https://github.com/met4citizen/TalkingHead)
### ๐ **Advanced Features**
- **๐น Camera Integration** - Real-time image capture with voice commands
- **โก Streaming Responses** - Chunked audio generation for minimal latency
- **๐ฌ Native Timing Sync** - Perfect lip-sync using Kokoro's native timing data
- **๐จ Draggable Camera View** - Floating, resizable camera interface
- **๐ Real-time Analytics** - Voice energy visualization and transmission tracking
- **๐ WebSocket Communication** - Low-latency bidirectional data flow
## ๐๏ธ Architecture
## ๐ ๏ธ Technology Stack
### Backend (Python)
- **๐ง AI Models from HuggingFace๐ค:**
- `openai/whisper-tiny` - Speech recognition
- `HuggingFaceTB/SmolVLM2-256M-Video-Instruct` - Vision-language understanding
- `Kokoro TTS` - High-quality voice synthesis
- **โก Framework:** FastAPI with WebSocket support
- **๐ง Processing:** PyTorch, Transformers, Flash Attention 2
- **๐ต Audio:** SoundFile, NumPy for real-time processing
### Frontend (TypeScript/React)
- **๐ผ๏ธ Framework:** Next.js 15 with TypeScript
- **๐จ UI:** Tailwind CSS + shadcn/ui components
- **๐ญ 3D Rendering:** [TalkingHead](https://github.com/met4citizen/TalkingHead) library
- **๐๏ธ Audio:** Web Audio API with AudioWorklet
- **๐ก Communication:** Native WebSocket with React Context
## ๐ Requirements
### System Tested on
- **OS:** Windows 11 (Linux/macOS support coming soon, will create a docker image)
- **GPU:** NVIDIA RTX 3070 (8GB VRAM)
## ๐ Quick Start
### 1. Prerequisites
- Node.js 20+
- PNPM
- Python 3.10
- UV (Python package manager)
### 2. **Setup monorepo dependencies from root**
```bash
# will setup both frontend and backend but require the prerequisites
pnpm run monorepo-setup
```
### 4. Run the Application
**Start Development Servers**
```bash
# Run both frontend and backend from root
pnpm dev
# Or run individually
pnpm dev:client # Frontend (http://localhost:3000)
pnpm dev:server # Backend (http://localhost:8000)
```
### 5. Initial Setup
1. **Allow microphone access** when prompted
2. **Enable camera** for multimodal interactions
3. **Click "Connect"** to establish WebSocket connection
4. **Start Voice Control** and begin speaking!
## ๐ฎ Usage Guide
### Camera Controls
- **Drag** to move camera window
- **Resize** with maximize/minimize buttons
- **Toggle on/off** as needed
### Voice Settings
- **Energy Threshold:** Adjust sensitivity to background noise
- **Pause Duration:** How long to wait before processing speech
- **Min/Max Speech:** Control segment length limits
## ๐ Acknowledgments
- **TalkingHead** ([met4citizen](https://github.com/met4citizen/TalkingHead)) for 3D avatar rendering and lip-sync
- **yeyu2** ([Multimodal-local-phi4](https://github.com/yeyu2/Youtube_demos/tree/main/Multimodal-local-phi4)) for multimodal implementation inspiration
---
**โญ Star this repo if you find it useful! โญ**
Made with โค๏ธ by the Kiranbaby14