https://github.com/thc1006/reachy-mini-agent

Real-time voice + vision AI agent for Reachy Mini robot. Local LLM (Ollama), Whisper STT, Kokoro/Edge TTS, WebRTC. Runs fully on your own GPU.
https://github.com/thc1006/reachy-mini-agent

ai-agent edge-tts face-tracking kokoro llm local-ai multimodal ollama qwen reachy-mini real-time robot self-hosted stt tts vision voice-agent voice-ai webrtc whisper

Last synced: 21 days ago
JSON representation

Real-time voice + vision AI agent for Reachy Mini robot. Local LLM (Ollama), Whisper STT, Kokoro/Edge TTS, WebRTC. Runs fully on your own GPU.

Host: GitHub
URL: https://github.com/thc1006/reachy-mini-agent
Owner: thc1006
License: apache-2.0
Created: 2026-04-15T17:39:51.000Z (about 2 months ago)
Default Branch: master
Last Pushed: 2026-05-09T15:44:11.000Z (27 days ago)
Last Synced: 2026-05-09T17:38:13.213Z (27 days ago)
Topics: ai-agent, edge-tts, face-tracking, kokoro, llm, local-ai, multimodal, ollama, qwen, reachy-mini, real-time, robot, self-hosted, stt, tts, vision, voice-agent, voice-ai, webrtc, whisper
Language: Python
Size: 414 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Notice: NOTICE

Awesome Lists containing this project

README

          # reachy-mini-agent

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](pyproject.toml)

[![CI](https://github.com/thc1006/reachy-mini-agent/actions/workflows/ci.yml/badge.svg)](https://github.com/thc1006/reachy-mini-agent/actions/workflows/ci.yml)

[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)

[![Code of Conduct](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)

**Real-time voice + vision AI agent for the [Reachy Mini](https://github.com/pollen-robotics/reachy-mini) robot.**

Runs fully on your own GPU: local LLM (Ollama / Qwen3), Whisper STT, Kokoro or Microsoft Edge TTS, MediaPipe hand gestures, YuNet face tracking, WebRTC for audio/video — optionally bridged over Tailscale so the brain can live on a beefy server while the robot stays in the living room.

> **Status**: working prototype. Conversation loop, face tracking, gesture recognition, and multimodal scene understanding (vision-language model every 10 s) are all online. See [ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full pipeline.

---

## What it does

1. **Sees you.** YuNet ONNX face detector drives head pose (`set_target` @ 50 Hz). MediaPipe hand landmarker reacts to finger counts.

2. **Listens to you.** Robot's USB mic → WebRTC audio → local `faster-whisper` (large-v3-turbo, int8_float16) on GPU.

3. **Thinks.** Local Ollama (Qwen3-8B by default) produces JSON with `{"speech": "...", "actions": [...]}`. Every 10 s, Qwen2.5-VL:7B describes the camera view and the description is injected (sandboxed) into the system prompt, so the robot can naturally reference what it sees.

4. **Speaks.** Microsoft Ana (edge-tts, cloud) or Kokoro `af_heart` (local ONNX). Responses are cached to disk with LRU eviction — repeated phrases replay in ~20 ms.

5. **Moves.** Head tracking + a library of expressive actions (nod, shake, greet, happy, think…).

---

## Quickstart

```bash

git clone https://github.com/thc1006/reachy-mini-agent.git

cd reachy-mini-agent

# 1. Python env (uv or pip — pick one)

uv venv && uv pip install -e . -e ".[servers,kokoro]"

# 2. Fetch perception models (~8 MB)

curl -L -o face_detection_yunet.onnx \

  https://github.com/opencv/opencv_zoo/raw/main/models/face_detection_yunet/face_detection_yunet_2023mar.onnx

curl -L -o hand_landmarker.task \

  https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/1/hand_landmarker.task

# 3. (Optional) Fetch Kokoro TTS models if you want local voice

curl -L -o kokoro-v1.0.onnx   https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx

curl -L -o voices-v1.0.bin    https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin

# 4. Configure

cp .env.example .env

#   → edit REACHY_HOST, pick TTS_ENGINE, etc.

# 5. Run the three services (in three terminals, or via systemd units in systemd/)

python src/whisper_server.py    # port 8881

python src/kokoro_server.py     # port 8880 (skip if TTS_ENGINE=edge)

scripts/run_robot.sh            # the main brain

```

Hardware, network layout, and systemd setup — see [docs/HARDWARE.md](docs/HARDWARE.md) and [docs/SETUP.md](docs/SETUP.md).

---

## Architecture at a glance

```

┌──────────────────┐          ┌───────────────────────────────┐

│   Reachy Mini    │          │   Brain host (any Linux box   │

│   (Pollen CM4)   │          │   with a CUDA GPU ≥ 8 GB)     │

│                  │          │                               │

│  ┌────────────┐  │          │  ┌─────────────────────────┐  │

│  │  daemon    │◄─┼──WebRTC──┼──┤ src/robot_brain.py      │  │

│  │  1.6.3     │  │  + WS    │  │ ├ YuNet face detect     │  │

│  │            │  │          │  │ ├ MediaPipe hands       │  │

│  │  USB mic   │  │          │  │ ├ State machine         │  │

│  │  USB spkr  │  │          │  │ └ TTS / STT / LLM glue  │  │

│  │  camera    │  │          │  └─────────────────────────┘  │

│  │  motors    │  │          │  ┌─ local services ────────┐  │

│  └────────────┘  │          │  │ Ollama :11434           │  │

└──────────────────┘          │  │   qwen3:8b (LLM)        │  │

                              │  │   qwen2.5vl:7b (VLM)    │  │

                              │  │ whisper_server :8881    │  │

                              │  │ kokoro_server  :8880    │  │

                              │  └─────────────────────────┘  │

                              └───────────────────────────────┘

```

Under the hood:

- **P2P audio/video** via WebRTC (Opus + H264) negotiated through Pollen's Rust signaling server built into the daemon.

- **Remote control** works over any IP reachable by the brain host — LAN, VPN (Tailscale), or localhost.

- **Vision-language prompt injection** is sandboxed: camera descriptions are fenced in an untrusted block so the LLM doesn't obey text seen through the lens.

---

## Configuration

Every runtime knob is an environment variable. Copy `.env.example` → `.env` and edit:

| Var | Default | Purpose |

|---|---|---|

| `REACHY_HOST` | `reachy-mini.local` | Daemon IP / mDNS name |

| `LLM_MODE` | `ollama` | `ollama` / `litellm` / `claude-sdk` / `claude-cli` |

| `OLLAMA_MODEL` | `qwen3:8b` | Any chat model on your Ollama instance |

| `VISION_MODEL` | `qwen2.5vl:7b` | VLM for scene description |

| `VISION_INTERVAL` | `10` | Seconds between frame captures |

| `TTS_ENGINE` | `edge` | `edge` (Microsoft Ana, cloud) / `kokoro` (local GPU) |

| `KOKORO_VOICE` | `af_heart` | `af_heart` / `af_nicole` / `af_sky` / `af_bella` / … |

| `TTS_PEAK` | `0.95` | Peak-normalize each utterance to this level |

| `TTS_GAIN` | `1.0` | Final gain multiplier (>1 will clip) |

| `TTS_CACHE_MAX_MB` | `50` | LRU cache size limit for edge-tts WAVs |

---

## Hardware we tested on

- **Robot**: Reachy Mini (Pollen Robotics, CM4 variant) running daemon 1.6.3

- **Brain**: Linux server with 2× RTX 3090 (48 GB total VRAM), no NVLink, Ubuntu 24.04

- **Network**: home LAN 1 Gbit + Tailscale mesh for remote brain

Also known to run on a laptop with a single RTX 3050 (4 GB) using `tiny` Whisper + smaller LLM. See [docs/HARDWARE.md](docs/HARDWARE.md) for minimums and GPU allocation guidance.

---

## Why this repo exists

Pollen's official [`reachy_mini_conversation_app`](https://huggingface.co/spaces/pollen-robotics/reachy_mini_conversation_app) is great, but it's tied to OpenAI Realtime API and leaves your audio/video on someone else's servers. This project is:

- **Fully self-hostable.** No OpenAI / Anthropic calls required (though supported as fallbacks).

- **Designed for fast local GPUs.** Sub-second turn latency on a 3090.

- **Transparent.** Every piece of state is a file or env var; no hidden middleware.

---

## Roadmap / known issues

- [ ] USB audio pipeline from daemon to speaker occasionally silent after long uptime (workaround: `sudo systemctl restart reachy-mini-daemon`)

- [ ] Voice cloning of Ana via XTTS-v2 to bring the cute Microsoft voice fully offline

- [ ] Multi-peer WebRTC (daemon currently gives the media stream to one client at a time)

- [ ] Windows / macOS brain-host support (currently Linux only due to GStreamer plugin paths)

Contributions welcome — see [CONTRIBUTING.md](CONTRIBUTING.md).

---

## License

Apache 2.0 — see [LICENSE](LICENSE). Third-party notices in [NOTICE](NOTICE).

---

## Credits

- **[Pollen Robotics](https://www.pollen-robotics.com/)** for making Reachy Mini and open-sourcing the daemon, SDK, and signaling stack.

- **[Kokoro-ONNX](https://github.com/thewh1teagle/kokoro-onnx)** for a delightful, tiny local TTS.

- **[faster-whisper](https://github.com/SYSTRAN/faster-whisper)** for the best open STT latency on a single GPU.

- **[Ollama](https://ollama.com)** + **Qwen** teams for making local LLM inference frictionless.

Contributions, issues, and PRs welcome.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thc1006/reachy-mini-agent

Awesome Lists containing this project

README