An open API service indexing awesome lists of open source software.

https://github.com/kiranbaby14/talkmateai

๐ŸŽญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
https://github.com/kiranbaby14/talkmateai

fastapi flash-attention-2 huggingface kokoro-tts multimodal-ai nextjs smolvlm vlm websocket whisper-ai

Last synced: 4 months ago
JSON representation

๐ŸŽญ Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync

Awesome Lists containing this project

README

          

# ๐ŸŽญ TalkMateAI

**Real-time Voice-Controlled 3D Avatar with Multimodal AI**

> Your 3D AI companion that never stops listening, never stops caring.

> Transform conversations into immersive experiences with AI-powered 3D avatars that see, hear, and respond naturally.

[![Python](https://img.shields.io/badge/Python-3.10-blue.svg)](https://python.org)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-green.svg)](https://fastapi.tiangolo.com)
[![Next.js](https://img.shields.io/badge/Next.js-15+-black.svg)](https://nextjs.org)
[![CUDA](https://img.shields.io/badge/CUDA-12.4-76B900.svg)](https://developer.nvidia.com/cuda-toolkit)
[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

## ๐ŸŽฅ Demo Video

[![TalkMateAI Demo](https://img.youtube.com/vi/dE_8TXmp2Sk/maxresdefault.jpg)](https://www.youtube.com/watch?v=dE_8TXmp2Sk)

## โœจ Features

### ๐ŸŽฏ **Core Capabilities**
- **๐ŸŽค Real-time Voice Activity Detection** - Advanced VAD with configurable sensitivity
- **๐Ÿ—ฃ๏ธ Speech-to-Text** - Powered by OpenAI Whisper (tiny model) for instant transcription
- **๐Ÿ‘๏ธ Vision Understanding** - SmolVLM2-256M-Video-Instruct for multimodal comprehension
- **๐Ÿ”Š Natural Text-to-Speech** - Kokoro TTS with native word-level timing
- **๐ŸŽญ 3D Avatar Animation** - Lip-sync and emotion-driven animations using [TalkingHead](https://github.com/met4citizen/TalkingHead)

### ๐Ÿš€ **Advanced Features**
- **๐Ÿ“น Camera Integration** - Real-time image capture with voice commands
- **โšก Streaming Responses** - Chunked audio generation for minimal latency
- **๐ŸŽฌ Native Timing Sync** - Perfect lip-sync using Kokoro's native timing data
- **๐ŸŽจ Draggable Camera View** - Floating, resizable camera interface
- **๐Ÿ“Š Real-time Analytics** - Voice energy visualization and transmission tracking
- **๐Ÿ”„ WebSocket Communication** - Low-latency bidirectional data flow

## ๐Ÿ—๏ธ Architecture

## ๐Ÿ› ๏ธ Technology Stack

### Backend (Python)
- **๐Ÿง  AI Models from HuggingFace๐Ÿค—:**
- `openai/whisper-tiny` - Speech recognition
- `HuggingFaceTB/SmolVLM2-256M-Video-Instruct` - Vision-language understanding
- `Kokoro TTS` - High-quality voice synthesis
- **โšก Framework:** FastAPI with WebSocket support
- **๐Ÿ”ง Processing:** PyTorch, Transformers, Flash Attention 2
- **๐ŸŽต Audio:** SoundFile, NumPy for real-time processing

### Frontend (TypeScript/React)
- **๐Ÿ–ผ๏ธ Framework:** Next.js 15 with TypeScript
- **๐ŸŽจ UI:** Tailwind CSS + shadcn/ui components
- **๐ŸŽญ 3D Rendering:** [TalkingHead](https://github.com/met4citizen/TalkingHead) library
- **๐ŸŽ™๏ธ Audio:** Web Audio API with AudioWorklet
- **๐Ÿ“ก Communication:** Native WebSocket with React Context

## ๐Ÿ“‹ Requirements

### System Tested on
- **OS:** Windows 11 (Linux/macOS support coming soon, will create a docker image)
- **GPU:** NVIDIA RTX 3070 (8GB VRAM)

## ๐Ÿš€ Quick Start

### 1. Prerequisites
- Node.js 20+
- PNPM
- Python 3.10
- UV (Python package manager)

### 2. **Setup monorepo dependencies from root**
```bash
# will setup both frontend and backend but require the prerequisites
pnpm run monorepo-setup
```

### 4. Run the Application

**Start Development Servers**
```bash
# Run both frontend and backend from root
pnpm dev

# Or run individually
pnpm dev:client # Frontend (http://localhost:3000)
pnpm dev:server # Backend (http://localhost:8000)
```

### 5. Initial Setup
1. **Allow microphone access** when prompted
2. **Enable camera** for multimodal interactions
3. **Click "Connect"** to establish WebSocket connection
4. **Start Voice Control** and begin speaking!

## ๐ŸŽฎ Usage Guide

### Camera Controls
- **Drag** to move camera window
- **Resize** with maximize/minimize buttons
- **Toggle on/off** as needed

### Voice Settings
- **Energy Threshold:** Adjust sensitivity to background noise
- **Pause Duration:** How long to wait before processing speech
- **Min/Max Speech:** Control segment length limits

## ๐Ÿ™ Acknowledgments

- **TalkingHead** ([met4citizen](https://github.com/met4citizen/TalkingHead)) for 3D avatar rendering and lip-sync
- **yeyu2** ([Multimodal-local-phi4](https://github.com/yeyu2/Youtube_demos/tree/main/Multimodal-local-phi4)) for multimodal implementation inspiration

---

**โญ Star this repo if you find it useful! โญ**

Made with โค๏ธ by the Kiranbaby14