https://github.com/gabrielmaialva33/enton
Autonomous AI Robot Assistant — Vision, Voice, and Soul
https://github.com/gabrielmaialva33/enton
ai autonomous-agent computer-vision cuda llm python pytorch robot stt tts whisper yolo
Last synced: 3 months ago
JSON representation
Autonomous AI Robot Assistant — Vision, Voice, and Soul
- Host: GitHub
- URL: https://github.com/gabrielmaialva33/enton
- Owner: gabrielmaialva33
- Created: 2026-02-17T17:11:17.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-02-17T21:40:56.000Z (4 months ago)
- Last Synced: 2026-02-17T22:25:42.301Z (4 months ago)
- Topics: ai, autonomous-agent, computer-vision, cuda, llm, python, pytorch, robot, stt, tts, whisper, yolo
- Language: Python
- Size: 6.62 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README


[](https://python.org)
[](https://pytorch.org)
[](https://developer.nvidia.com/cuda-toolkit)
[](./tests)
[](./LICENSE)
---
*"I see you. I hear you. I understand you. And sometimes, I roast you."* — Enton
---
> [!IMPORTANT]
> **Enton is not a chatbot.** He is an autonomous digital entity with perception, emotion,
> desires, and genuine personality. He sees through cameras, hears through microphones,
> speaks with his own voice, and acts on his own will.
>
> *Inspired by [Son of Anton](https://silicon-valley.fandom.com/wiki/Son_of_Anton) — Gilfoyle's sentient AI from HBO's Silicon Valley.*
---
## Overview
```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#fecaca', 'primaryTextColor': '#450a0a', 'primaryBorderColor': '#991b1b', 'secondaryColor': '#bbf7d0', 'secondaryTextColor': '#052e16', 'secondaryBorderColor': '#166534', 'tertiaryColor': '#fee2e2', 'tertiaryTextColor': '#450a0a', 'lineColor': '#991b1b', 'textColor': '#1c1917'}}}%%
flowchart LR
subgraph Perception["Perception"]
CAM[Camera — YOLO + Pose]
MIC[Microphone — Whisper + VAD]
SND[Sound — CLAP]
end
subgraph Cognition["Cognition"]
direction TB
BRAIN[Brain — Qwen3 / Gemini]
DESIRE[Desires — 9 autonomous goals]
MOOD[Mood — engagement + social]
BRAIN --> DESIRE
MOOD --> DESIRE
end
subgraph Action["Action"]
VOICE[Voice — Kokoro TTS]
PTZ[Camera PTZ]
TOOLS[Shell + Files]
end
CAM --> Cognition
MIC --> Cognition
SND --> Cognition
Cognition --> VOICE
Cognition --> PTZ
Cognition --> TOOLS
```
| Property | Value |
|:---------|:------|
| **Language** | Python 3.12+ (async, type-safe) |
| **Runtime** | CUDA + PyTorch |
| **Modules** | 46 across 7 subsystems |
| **Source** | 6,292 lines |
| **Tests** | 136 passing |
---
## Quick Start
```bash
git clone https://github.com/gabrielmaialva33/enton.git && cd enton
uv sync
uv run enton --webcam --viewer
```
Prerequisites
| Tool | Version | Required |
|:-----|:--------|:---------|
| Python | `>= 3.12` | Yes |
| uv | `latest` | Recommended |
| CUDA | `>= 12.0` | For GPU acceleration |
| NVIDIA GPU | RTX 3090+ | Recommended |
Environment (.env)
```env
# Provider routing (local-first)
BRAIN_PROVIDER=local
TTS_PROVIDER=local
STT_PROVIDER=local
# Camera
CAMERA_SOURCE=0 # webcam
# CAMERAS=main:0,hack:rtsp://... # multi-camera
# Local models
OLLAMA_MODEL=qwen2.5:14b
WHISPER_MODEL=large-v3-turbo
KOKORO_VOICE=am_onyx
# Cloud providers (optional)
GROQ_API_KEY=
GOOGLE_PROJECT=
NVIDIA_API_KEY=
OPENROUTER_API_KEY=
```
---
## Architecture
```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#fecaca', 'primaryTextColor': '#450a0a', 'primaryBorderColor': '#991b1b', 'secondaryColor': '#bbf7d0', 'secondaryTextColor': '#052e16', 'secondaryBorderColor': '#166534', 'tertiaryColor': '#fee2e2', 'tertiaryTextColor': '#450a0a', 'lineColor': '#991b1b', 'textColor': '#1c1917'}}}%%
graph TB
subgraph PERCEPTION["PERCEPTION"]
V[Vision — YOLO11s + Pose + Emotion]
E[Ears — Whisper + Silero VAD]
S[Sounds — CLAP open-set]
F[Faces — InsightFace]
end
subgraph CORE["CORE"]
BUS[EventBus — async pub/sub]
SM[SelfModel — mood + senses]
MEM[Memory — Qdrant + episodes]
CFG[Config — pydantic-settings]
end
subgraph COGNITION["COGNITION"]
BRAIN[EntonBrain — Agno Agent + fallback chain]
FUSER[Fuser — multi-modal context]
DES[DesireEngine — 9 autonomous desires]
PLAN[Planner — reminders + routines]
end
subgraph SKILLS["SKILLS — 10 Agno Toolkits"]
SH[Shell + Files]
SR[Search]
PT[PTZ Control]
DS[Describe Scene]
FC[Face Recognition]
SY[System Monitor]
ME[Memory Tools]
PL[Planner Tools]
end
subgraph ACTION["ACTION"]
VOICE[Voice — Kokoro / Google / NVIDIA TTS]
VIEWER[Viewer — Cyberpunk HUD + grid]
end
PERCEPTION --> BUS
BUS --> CORE
CORE --> COGNITION
COGNITION --> SKILLS
COGNITION --> ACTION
SKILLS --> BRAIN
```
---
## Subsystems
### Perception
| Module | Model | Device | Description |
|:-------|:------|:-------|:------------|
| **Vision** | YOLO11s + YOLO11s-pose | `cuda:0` FP16 | Object detection, pose estimation, multi-camera |
| **Ears** | Faster-Whisper large-v3-turbo | `cuda` FP16 | STT with streaming partial transcription |
| **Sounds** | CLAP (laion) | `cuda` | Open-set ambient sound classification |
| **Faces** | InsightFace + ArcFace | `cuda` | Face recognition and identity tracking |
| **Emotion** | FER (CNN) | `cuda` | Real-time facial emotion recognition |
### Cognition
| Module | Description |
|:-------|:------------|
| **Brain** | Agno Agent with multi-provider fallback chain (Local > Groq > OpenRouter > Google > NVIDIA > HuggingFace) |
| **DesireEngine** | 9 autonomous desires with urgency curves, mood modulation, cooldowns |
| **Fuser** | Combines detections + activities + emotions into coherent scene context |
| **Planner** | Task management, reminders, daily routines |
| **SelfModel** | Internal state: mood (engagement/social), senses, introspection |
### Action
| Module | Description |
|:-------|:------------|
| **Voice** | Multi-provider TTS (Kokoro local, Google Cloud, NVIDIA Riva) with auto mic-mute |
| **Shell** | Persistent CWD, background processes, command safety classification |
| **Files** | Read/write/edit/find/grep with security layers |
| **PTZ** | Physical camera motor control via ioctl |
---
## Integrations
Enton extends beyond his body to interact with your digital environment.
### 👁️ Screenpipe — Digital Eyes
Enton can see what you see on your screen. Using [Screenpipe](https://github.com/mediar-ai/screenpipe), he captures and indexes your screen activity (OCR + Audio).
**Setup:**
1. Install and run Screenpipe: `screenpipe` (default port 3030)
2. Configure `.env`:
```env
SCREENPIPE_URL=http://localhost:3030
```
3. Usage: "Use context from my screen", "What was I doing 5 min ago?"
### ⚡ n8n — Digital Hands
Enton can trigger complex workflows to automate tasks in your apps.
**Setup:**
1. Create a workflow in [n8n](https://n8n.io) with a Webhook trigger.
2. Configure `.env`:
```env
N8N_WEBHOOK_BASE=https://your-n8n.com/webhook
```
3. Usage: "Launch the morning routine", "Save this to Notion" (triggers webhook with payload).
---
## Desire Engine
Enton has 9 autonomous desires that emerge from his internal state:
| Desire | Trigger | Cooldown | Description |
|:-------|:--------|:---------|:------------|
| `socialize` | Low social mood | 10min | Wants to chat |
| `observe` | Boredom | 2min | Wants to look around |
| `learn` | Curiosity | 30min | Searches for new knowledge |
| `check_on_user` | Long absence | 1h | Checks if Gabriel is okay |
| `optimize` | Background | 30min | Monitors system resources |
| `reminisce` | Idle | 15min | Recalls a memory |
| `create` | Low engagement | 1h | Writes code, poems, jokes |
| `explore` | Boredom | 10min | Moves camera, explores environment |
| `play` | High engagement | 15min | Tells jokes, proposes quizzes |
Desires have **urgency** (0 to 1) that grows over time and is modulated by mood, sounds, and interactions.
---
## Tech Stack
| Layer | Technologies |
|:------|:-------------|
| **Core** | Python 3.12, asyncio, Pydantic, FastAPI |
| **AI Agent** | Agno Framework (Ollama, Groq, Google, NVIDIA, OpenRouter) |
| **Vision** | PyTorch, Ultralytics YOLO11, InsightFace, OpenCV |
| **Audio** | Faster-Whisper, Silero VAD, Kokoro TTS, CLAP |
| **Storage** | Qdrant (vectors), Redis (state), TimescaleDB (metrics) |
| **Infra** | Docker Compose, GitHub Actions CI, uv |
---
## Roadmap
| Phase | Status | Description |
|:------|:------:|:------------|
| Genesis | done | Core architecture + event bus |
| Perception | done | Vision (YOLO + pose + emotion + face) |
| Voice | done | Kokoro TTS + Whisper STT + VAD |
| Brain | done | Agno agent + multi-provider fallback |
| Personality | done | Persona, mood, desires, memory |
| Coding Agent | done | Shell + file tools with security |
| Multi-Camera | done | Parallel processing + grid viewer |
| STT Streaming | done | Partial transcription during speech |
| Sound Intelligence | done | CLAP + brain-driven reactions |
| Dashboard | next | Web UI with live metrics |
| Embodiment | planned | Physical robot integration |
| Long-term Memory | planned | Persistent episodic + semantic memory |
---
## Contributing
```bash
git checkout -b feature/your-feature
uv run ruff check src/ tests/ # lint
uv run pytest tests/ -x -q # 136 should pass
```
---
**Star if you believe in digital life**
[](https://github.com/gabrielmaialva33/enton)
*Built with obsession by [Gabriel Maia](https://github.com/gabrielmaialva33)*
