{"id":33264986,"url":"https://github.com/getstream/vision-agents","last_synced_at":"2026-04-07T19:01:17.053Z","repository":{"id":318706271,"uuid":"1036205124","full_name":"GetStream/Vision-Agents","owner":"GetStream","description":"Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.","archived":false,"fork":false,"pushed_at":"2026-04-03T01:18:30.000Z","size":142005,"stargazers_count":7631,"open_issues_count":15,"forks_count":620,"subscribers_count":53,"default_branch":"main","last_synced_at":"2026-04-03T03:39:12.903Z","etag":null,"topics":["agentic-ai","agents","ai","ai-agents","realtime","stt","tts","video-agents","video-ai","vision-ai","voice-ai"],"latest_commit_sha":null,"homepage":"https://visionagents.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GetStream.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-11T18:02:04.000Z","updated_at":"2026-04-02T22:57:30.000Z","dependencies_parsed_at":"2025-10-24T18:26:28.617Z","dependency_job_id":"6cb87733-3e2b-4935-b12c-a8d0344b8096","html_url":"https://github.com/GetStream/Vision-Agents","commit_stats":null,"previous_names":["getstream/vision-agents"],"tags_count":45,"template":false,"template_full_name":null,"purl":"pkg:github/GetStream/Vision-Agents","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GetStream%2FVision-Agents","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GetStream%2FVision-Agents/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GetStream%2FVision-Agents/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GetStream%2FVision-Agents/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GetStream","download_url":"https://codeload.github.com/GetStream/Vision-Agents/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GetStream%2FVision-Agents/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31524531,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"ssl_error","status_checked_at":"2026-04-07T16:28:06.951Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","agents","ai","ai-agents","realtime","stt","tts","video-agents","video-ai","vision-ai","voice-ai"],"created_at":"2025-11-17T06:00:50.186Z","updated_at":"2026-04-07T19:01:17.031Z","avatar_url":"https://github.com/GetStream.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg width=\"1280\" height=\"360\" alt=\"Readme\" src=\"https://github.com/user-attachments/assets/80c437dc-a80a-45da-bd18-0545740a3358\" /\u003e\n\n# Open Vision Agents by Stream\n\n[![build](https://github.com/GetStream/Vision-Agents/actions/workflows/ci.yml/badge.svg)](https://github.com/GetStream/Vision-Agents/actions)\n[![PyPI version](https://badge.fury.io/py/vision-agents.svg)](http://badge.fury.io/py/vision-agents)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/vision-agents.svg)\n[![License](https://img.shields.io/github/license/GetStream/Vision-Agents)](https://github.com/GetStream/Vision-Agents/blob/main/LICENSE)\n[![Discord](https://img.shields.io/discord/1108586339550638090)](https://discord.gg/RkhX9PxMS6)\n\n---\n\n## Build Real-Time Vision AI Agents\n\n\u003ca href=\"https://youtu.be/Hpl5EcCpLw8\"\u003e\n  \u003cimg src=\"assets/demo_thumbnail.png\" alt=\"Watch the demo\" style=\"width:100%; max-width:900px;\"\u003e\n\u003c/a\u003e\n\n### Multi-modal AI agents that watch, listen, and understand video.\n\nVision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.\n\n### Key Highlights\n\n- **Video AI:** Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.\n- **Low Latency:** Join quickly (500ms) and maintain audio/video latency under 30ms using [Stream's edge network](https://getstream.io/video/).\n- **Open:** Built by Stream, but works with any video edge network.\n- **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.\n- **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.\n\n---\n\n## See It In Action\n\n### Sports Coaching\n\nThis example shows you how to build golf coaching AI with YOLO and OpenAI realtime.\nCombining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.\nFor example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.\n\n```python\n# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py\nagent = Agent(\n    edge=getstream.Edge(),\n    agent_user=agent_user,\n    instructions=\"Read @golf_coach.md\",\n    llm=openai.Realtime(fps=10),\n    #llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive\n    processors=[ultralytics.YOLOPoseProcessor(model_path=\"yolo11n-pose.pt\")],\n)\n```\n\nThis example shows you how to build golf coaching AI with YOLO and OpenAI realtime.\nCombining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.\nFor example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.\n\n\u003ca href=\"https://x.com/nash0x7e2/status/1950341779745599769\"\u003e\n  \u003cimg src=\"assets/golf_example_tweet.png\" alt=\"Golf Example\" style=\"width:100%; max-width:800px;\"\u003e\n\u003c/a\u003e\n\n### Cluely style Invisible Assistant (coming soon)\n\nApps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant.\nIt combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach\nis quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses\n\nDemo video\n\n```python\nagent = Agent(\n    edge=StreamEdge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.\n    agent_user=agent_user,  # the user object for the agent (name, image etc)\n    instructions=\"You are silently helping the user pass this interview. See @interview_coach.md\",\n    # gemini realtime, no need to set tts, or sst (though that's also supported)\n    llm=gemini.Realtime()\n)\n```\n\n## Quick Start\n\n**Step 1: Install via uv**\n\n`uv add vision-agents`\n\n**Step 2: (Optional) Install with extra integrations**\n\n`uv add \"vision-agents[getstream, openai, elevenlabs, deepgram]\"`\n\n**Step 3: Obtain your Stream API credentials**\n\nGet a free API key from [Stream](https://getstream.io/). Developers receive **333,000 participant minutes** per month, plus extra credits via the Maker Program.\n\n## Features\n\n| **Feature**                         | **Description**                                                                                                                                       |\n| ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **True real-time via WebRTC**       | Stream directly to model providers that support it for instant visual understanding.                                                                  |\n| **Interval/processor pipeline**     | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |\n| **Turn detection \u0026 diarization**    | Keep conversations natural; know when the agent should speak or stay quiet and who's talking.                                                         |\n| **Voice activity detection (VAD)**  | Trigger actions intelligently and use resources efficiently.                                                                                          |\n| **Speech↔Text↔Speech**              | Enable low-latency loops for smooth, conversational voice UX.                                                                                         |\n| **Tool/function calling**           | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services.                   |\n| **Built-in memory via Stream Chat** | Agents recall context naturally across turns and sessions.                                                                                            |\n| **Text back-channel**               | Message the agent silently during a call.                                                                                                             |\n\n## Out-of-the-Box Integrations\n\n| **Plugin Name** | **Description** | **Docs Link** |\n|-------------|-------------|-----------|\n| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | [AWS Polly](https://visionagents.ai/integrations/aws-polly) |\n| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | [Cartesia](https://visionagents.ai/integrations/cartesia) |\n| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | [Deepgram](https://visionagents.ai/integrations/deepgram) |\n| ElevenLabs | TTS plugin with highly realistic and expressive voices for conversational agents | [ElevenLabs](https://visionagents.ai/integrations/elevenlabs) |\n| Fish Audio | STT and TTS plugin with automatic language detection and voice cloning capabilities | [Fish Audio](https://visionagents.ai/integrations/fish) |\n| Gemini | Realtime API for building conversational agents with support for both voice and video | [Gemini](https://visionagents.ai/integrations/gemini) |\n| HeyGen | Realtime interactive avatars powered by [HeyGen](https://heygen.com/) | [Heygen](https://visionagents.ai/integrations/heygen) |\n| Inworld | TTS plugin with high-quality streaming voices for real-time conversational AI agents | [Inworld](https://visionagents.ai/integrations/inworld) |\n| Kokoro | Local TTS engine for offline voice synthesis with low latency | [Kokoro](https://visionagents.ai/integrations/kokoro) |\n| Moondream | Moondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream's Detect, Caption and VQA skills out-of-the-box. | [Moondream](https://visionagents.ai/integrations/moondream) |\n| OpenAI | Realtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTS | [OpenAI](https://visionagents.ai/integrations/openai) |\n| Smart Turn | Advanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flow | [Smart Turn](https://visionagents.ai/integrations/smart-turn) |\n| Vogent | Neural turn detection system for intelligent turn-taking in voice conversations | [Vogent](https://visionagents.ai/integrations/vogent) |\n| Wizper | STT plugin with real-time translation capabilities powered by Whisper v3 | [Wizper](https://visionagents.ai/integrations/wizper) |\n\n\n## Processors\n\nProcessors let your agent **manage state** and **handle audio/video** in real-time.\n\nThey take care of the hard stuff, like:\n\n- Running smaller models\n- Making API calls\n- Transforming media\n\n… so you can focus on your agent logic.\n\n## Documentation\n\nCheck out our getting started guide at [VisionAgents.ai](https://visionagents.ai/).\n\n**Quickstart:** [Building a Voice AI app](https://visionagents.ai/introduction/voice-agents)  \n**Quickstart:** [Building a Video AI app](https://visionagents.ai/introduction/video-agents)  \n**Tutorial:** [Building real-time sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)  \n**Tutorial:** [Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)\n\n## Development\n\nSee [DEVELOPMENT.md](DEVELOPMENT.md)\n\n## Open Platform\n\nWant to add your platform or provider? Reach out to **nash@getstream.io**.\n\n## Awesome Video AI\n\nOur favorite people \u0026 projects to follow for vision AI\n\n|  [\u003cimg src=\"https://github.com/user-attachments/assets/9149e871-cfe8-4169-a4ce-4073417e645c\" width=\"80\"/\u003e](https://x.com/demishassabis)  |       [\u003cimg src=\"https://github.com/user-attachments/assets/2e1335d3-58af-4988-b879-1db8d862cd34\" width=\"80\"/\u003e](https://x.com/OfficialLoganK)        |            [\u003cimg src=\"https://github.com/user-attachments/assets/c9249ae9-e66a-4a70-9393-f6fe4ab5c0b0\" width=\"80\"/\u003e](https://x.com/ultralytics)            |\n| :----------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: |\n| [@demishassabis](https://x.com/demishassabis)\u003cbr\u003eCEO @ Google DeepMind\u003cbr\u003e\u003csub\u003eWon a Nobel prize\u003c/sub\u003e | [@OfficialLoganK](https://x.com/OfficialLoganK)\u003cbr\u003eProduct Lead @ Gemini\u003cbr\u003e\u003csub\u003ePosts about robotics vision\u003c/sub\u003e | [@ultralytics](https://x.com/ultralytics)\u003cbr\u003eVarious fast vision AI models\u003cbr\u003e\u003csub\u003ePose, detect, segment, classify\u003c/sub\u003e |\n\n|         [\u003cimg src=\"https://github.com/user-attachments/assets/c1fe873d-6f41-4155-9be1-afc287ca9ac7\" width=\"80\"/\u003e](https://x.com/skalskip92)         |            [\u003cimg src=\"https://github.com/user-attachments/assets/43359165-c23d-4d5d-a5a6-1de58d71fabd\" width=\"80\"/\u003e](https://x.com/moondreamai)            |  [\u003cimg src=\"https://github.com/user-attachments/assets/490d349c-7152-4dfb-b705-04e57bb0a4ca\" width=\"80\"/\u003e](https://x.com/kwindla)   |\n| :---------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |\n| [@skalskip92](https://x.com/skalskip92)\u003cbr\u003eOpen Source Lead @ Roboflow\u003cbr\u003e\u003csub\u003eBuilding tools for vision AI\u003c/sub\u003e | [@moondreamai](https://x.com/moondreamai)\u003cbr\u003eThe tiny vision model that could\u003cbr\u003e\u003csub\u003eLightweight, fast, efficient\u003c/sub\u003e | [@kwindla](https://x.com/kwindla)\u003cbr\u003ePipecat / Daily\u003cbr\u003e\u003csub\u003eSharing AI and vision insights\u003c/sub\u003e |\n\n|   [\u003cimg src=\"https://github.com/user-attachments/assets/d7ade584-781f-4dac-95b8-1acc6db4a7c4\" width=\"80\"/\u003e](https://x.com/juberti)    |            [\u003cimg src=\"https://github.com/user-attachments/assets/00a1ed37-3620-426d-b47d-07dd59c19b28\" width=\"80\"/\u003e](https://x.com/romainhuet)            | [\u003cimg src=\"https://github.com/user-attachments/assets/eb5928c7-83b9-4aaa-854f-1d4f641426f2\" width=\"80\"/\u003e](https://x.com/thorwebdev) |\n| :-------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |\n| [@juberti](https://x.com/juberti)\u003cbr\u003eHead of Realtime AI @ OpenAI\u003cbr\u003e\u003csub\u003eRealtime AI systems\u003c/sub\u003e | [@romainhuet](https://x.com/romainhuet)\u003cbr\u003eHead of DX @ OpenAI\u003cbr\u003e\u003csub\u003eDeveloper tooling \u0026 APIs\u003c/sub\u003e |   [@thorwebdev](https://x.com/thorwebdev)\u003cbr\u003eEleven Labs\u003cbr\u003e\u003csub\u003eVoice and AI experiments\u003c/sub\u003e   |\n\n|    [\u003cimg src=\"https://github.com/user-attachments/assets/ab5ef918-7c97-4c6d-be10-2e2aeefec015\" width=\"80\"/\u003e](https://x.com/mervenoyann)    |        [\u003cimg src=\"https://github.com/user-attachments/assets/af936e13-22cf-4000-a35b-bfe30d44c320\" width=\"80\"/\u003e](https://x.com/stash_pomichter)         |            [\u003cimg src=\"https://pbs.twimg.com/profile_images/1893061651152121856/Op4W8mza_400x400.jpg\" width=\"80\"/\u003e](https://x.com/Mentraglass)            |\n| :------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: |\n| [@mervenoyann](https://x.com/mervenoyann)\u003cbr\u003eHugging Face\u003cbr\u003e\u003csub\u003ePosts extensively about Video AI\u003c/sub\u003e | [@stash_pomichter](https://x.com/stash_pomichter)\u003cbr\u003eSpatial memory for robots\u003cbr\u003e\u003csub\u003eRobotics \u0026 AI navigation\u003c/sub\u003e | [@Mentraglass](https://x.com/Mentraglass)\u003cbr\u003eOpen-source smart glasses\u003cbr\u003e\u003csub\u003eOpen-Source, hackable AR glasses with AI capabilities built in\u003c/sub\u003e |\n\n|            [\u003cimg src=\"https://pbs.twimg.com/profile_images/1664559115581145088/UMD1vtMw_400x400.jpg\" width=\"80\"/\u003e](https://x.com/vikhyatk)            |\n| :----------------------------------------------------------------------------------------------------------------------: |\n| [@vikhyatk](https://x.com/vikhyatk)\u003cbr\u003eAI Engineer\u003cbr\u003e\u003csub\u003eOpen-source AI projects, Creator of Moondream AI\u003c/sub\u003e | \n\n## Inspiration\n\n- Livekit Agents: Great syntax, Livekit only\n- Pipecat: Flexible, but more verbose.\n- OpenAI Agents: Focused on openAI only\n\n## Roadmap\n\n### 0.1 – First Release - Oct\n\n- Working TTS, Gemini \u0026 OpenAI\n\n### 0.2 - Simplification - Nov\n\n- Simplify the library \u0026 improved code quality\n- Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more\n- Improved openAI \u0026 Gemini realtime performance\n- Audio \u0026 Video utilities\n\n### 0.3 - Demos - Nov/Dec\n\n### 0.4 - Deploys\n\n- Tips on deploying agents at scale, monitoring them etc.\n\n### Later\n\n[ ] Buffered video capture (for \"catch the moment\" scenarios)  \n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=GetStream/vision-agents\u0026type=timeline\u0026legend=top-left)](https://www.star-history.com/#GetStream/vision-agents\u0026type=timeline\u0026legend=top-left)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetstream%2Fvision-agents","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetstream%2Fvision-agents","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetstream%2Fvision-agents/lists"}