An open API service indexing awesome lists of open source software.

https://github.com/getstream/vision-agents

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
https://github.com/getstream/vision-agents

agentic-ai agents ai ai-agents realtime stt tts video-agents video-ai vision-ai voice-ai

Last synced: 2 months ago
JSON representation

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

Awesome Lists containing this project

README

          

Readme

# Open Vision Agents by Stream

[![build](https://github.com/GetStream/Vision-Agents/actions/workflows/ci.yml/badge.svg)](https://github.com/GetStream/Vision-Agents/actions)
[![PyPI version](https://badge.fury.io/py/vision-agents.svg)](http://badge.fury.io/py/vision-agents)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/vision-agents.svg)
[![License](https://img.shields.io/github/license/GetStream/Vision-Agents)](https://github.com/GetStream/Vision-Agents/blob/main/LICENSE)
[![Discord](https://img.shields.io/discord/1108586339550638090)](https://discord.gg/RkhX9PxMS6)

---

## Build Real-Time Vision AI Agents


Watch the demo

### Multi-modal AI agents that watch, listen, and understand video.

Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

### Key Highlights

- **Video AI:** Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
- **Low Latency:** Join quickly (500ms) and maintain audio/video latency under 30ms using [Stream's edge network](https://getstream.io/video/).
- **Open:** Built by Stream, but works with any video edge network.
- **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.
- **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.

---

## See It In Action

### Sports Coaching

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime.
Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

```python
# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
edge=getstream.Edge(),
agent_user=agent_user,
instructions="Read @golf_coach.md",
llm=openai.Realtime(fps=10),
#llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")],
)
```

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime.
Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.


Golf Example

### Cluely style Invisible Assistant (coming soon)

Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant.
It combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach
is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses

Demo video

```python
agent = Agent(
edge=StreamEdge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
agent_user=agent_user, # the user object for the agent (name, image etc)
instructions="You are silently helping the user pass this interview. See @interview_coach.md",
# gemini realtime, no need to set tts, or sst (though that's also supported)
llm=gemini.Realtime()
)
```

## Quick Start

**Step 1: Install via uv**

`uv add vision-agents`

**Step 2: (Optional) Install with extra integrations**

`uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"`

**Step 3: Obtain your Stream API credentials**

Get a free API key from [Stream](https://getstream.io/). Developers receive **333,000 participant minutes** per month, plus extra credits via the Maker Program.

## Features

| **Feature** | **Description** |
| ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| **True real-time via WebRTC** | Stream directly to model providers that support it for instant visual understanding. |
| **Interval/processor pipeline** | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |
| **Turn detection & diarization** | Keep conversations natural; know when the agent should speak or stay quiet and who's talking. |
| **Voice activity detection (VAD)** | Trigger actions intelligently and use resources efficiently. |
| **Speech↔Text↔Speech** | Enable low-latency loops for smooth, conversational voice UX. |
| **Tool/function calling** | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services. |
| **Built-in memory via Stream Chat** | Agents recall context naturally across turns and sessions. |
| **Text back-channel** | Message the agent silently during a call. |

## Out-of-the-Box Integrations

| **Plugin Name** | **Description** | **Docs Link** |
|-------------|-------------|-----------|
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | [AWS Polly](https://visionagents.ai/integrations/aws-polly) |
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | [Cartesia](https://visionagents.ai/integrations/cartesia) |
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | [Deepgram](https://visionagents.ai/integrations/deepgram) |
| ElevenLabs | TTS plugin with highly realistic and expressive voices for conversational agents | [ElevenLabs](https://visionagents.ai/integrations/elevenlabs) |
| Fish Audio | STT and TTS plugin with automatic language detection and voice cloning capabilities | [Fish Audio](https://visionagents.ai/integrations/fish) |
| Gemini | Realtime API for building conversational agents with support for both voice and video | [Gemini](https://visionagents.ai/integrations/gemini) |
| HeyGen | Realtime interactive avatars powered by [HeyGen](https://heygen.com/) | [Heygen](https://visionagents.ai/integrations/heygen) |
| Inworld | TTS plugin with high-quality streaming voices for real-time conversational AI agents | [Inworld](https://visionagents.ai/integrations/inworld) |
| Kokoro | Local TTS engine for offline voice synthesis with low latency | [Kokoro](https://visionagents.ai/integrations/kokoro) |
| Moondream | Moondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream's Detect, Caption and VQA skills out-of-the-box. | [Moondream](https://visionagents.ai/integrations/moondream) |
| OpenAI | Realtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTS | [OpenAI](https://visionagents.ai/integrations/openai) |
| Smart Turn | Advanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flow | [Smart Turn](https://visionagents.ai/integrations/smart-turn) |
| Vogent | Neural turn detection system for intelligent turn-taking in voice conversations | [Vogent](https://visionagents.ai/integrations/vogent) |
| Wizper | STT plugin with real-time translation capabilities powered by Whisper v3 | [Wizper](https://visionagents.ai/integrations/wizper) |

## Processors

Processors let your agent **manage state** and **handle audio/video** in real-time.

They take care of the hard stuff, like:

- Running smaller models
- Making API calls
- Transforming media

… so you can focus on your agent logic.

## Documentation

Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai/).

**Quickstart:** [Building a Voice AI app](https://visionagents.ai/introduction/voice-agents)
**Quickstart:** [Building a Video AI app](https://visionagents.ai/introduction/video-agents)
**Tutorial:** [Building real-time sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)
**Tutorial:** [Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)

## Development

See [DEVELOPMENT.md](DEVELOPMENT.md)

## Open Platform

Want to add your platform or provider? Reach out to **nash@getstream.io**.

## Awesome Video AI

Our favorite people & projects to follow for vision AI

| [](https://x.com/demishassabis) | [](https://x.com/OfficialLoganK) | [](https://x.com/ultralytics) |
| :----------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: |
| [@demishassabis](https://x.com/demishassabis)
CEO @ Google DeepMind
Won a Nobel prize | [@OfficialLoganK](https://x.com/OfficialLoganK)
Product Lead @ Gemini
Posts about robotics vision | [@ultralytics](https://x.com/ultralytics)
Various fast vision AI models
Pose, detect, segment, classify |

| [](https://x.com/skalskip92) | [](https://x.com/moondreamai) | [](https://x.com/kwindla) |
| :---------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |
| [@skalskip92](https://x.com/skalskip92)
Open Source Lead @ Roboflow
Building tools for vision AI | [@moondreamai](https://x.com/moondreamai)
The tiny vision model that could
Lightweight, fast, efficient | [@kwindla](https://x.com/kwindla)
Pipecat / Daily
Sharing AI and vision insights |

| [](https://x.com/juberti) | [](https://x.com/romainhuet) | [](https://x.com/thorwebdev) |
| :-------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |
| [@juberti](https://x.com/juberti)
Head of Realtime AI @ OpenAI
Realtime AI systems | [@romainhuet](https://x.com/romainhuet)
Head of DX @ OpenAI
Developer tooling & APIs | [@thorwebdev](https://x.com/thorwebdev)
Eleven Labs
Voice and AI experiments |

| [](https://x.com/mervenoyann) | [](https://x.com/stash_pomichter) | [](https://x.com/Mentraglass) |
| :------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: |
| [@mervenoyann](https://x.com/mervenoyann)
Hugging Face
Posts extensively about Video AI | [@stash_pomichter](https://x.com/stash_pomichter)
Spatial memory for robots
Robotics & AI navigation | [@Mentraglass](https://x.com/Mentraglass)
Open-source smart glasses
Open-Source, hackable AR glasses with AI capabilities built in |

| [](https://x.com/vikhyatk) |
| :----------------------------------------------------------------------------------------------------------------------: |
| [@vikhyatk](https://x.com/vikhyatk)
AI Engineer
Open-source AI projects, Creator of Moondream AI |

## Inspiration

- Livekit Agents: Great syntax, Livekit only
- Pipecat: Flexible, but more verbose.
- OpenAI Agents: Focused on openAI only

## Roadmap

### 0.1 – First Release - Oct

- Working TTS, Gemini & OpenAI

### 0.2 - Simplification - Nov

- Simplify the library & improved code quality
- Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
- Improved openAI & Gemini realtime performance
- Audio & Video utilities

### 0.3 - Demos - Nov/Dec

### 0.4 - Deploys

- Tips on deploying agents at scale, monitoring them etc.

### Later

[ ] Buffered video capture (for "catch the moment" scenarios)

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=GetStream/vision-agents&type=timeline&legend=top-left)](https://www.star-history.com/#GetStream/vision-agents&type=timeline&legend=top-left)