https://github.com/videosdk-live/agents
Open-source framework for developing real-time multimodal conversational AI agents.
https://github.com/videosdk-live/agents
Last synced: 2 months ago
JSON representation
Open-source framework for developing real-time multimodal conversational AI agents.
- Host: GitHub
- URL: https://github.com/videosdk-live/agents
- Owner: videosdk-live
- License: apache-2.0
- Created: 2025-05-02T06:49:27.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-02-17T10:23:31.000Z (2 months ago)
- Last Synced: 2026-02-17T11:35:47.298Z (2 months ago)
- Language: Python
- Homepage: https://docs.videosdk.live/ai_agents/introduction
- Size: 8.11 MB
- Stars: 592
- Watchers: 9
- Forks: 82
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Notice: NOTICE.txt
Awesome Lists containing this project
- StarryDivineSky - videosdk-live/agents - live/agents是一个开源框架,旨在帮助开发者构建实时多模态对话式AI代理系统。该项目的核心目标是通过整合语音、视频、文本等多种输入模态,实现更自然的实时人机交互体验。框架采用模块化设计,支持开发者根据需求灵活配置不同功能模块,例如语音识别、面部表情分析、自然语言处理等模块的集成。其工作原理基于实时数据流处理架构,通过分布式计算框架对多模态数据进行同步处理,确保不同传感器输入的实时性与一致性。 项目特别强调实时性与低延迟特性,通过优化数据传输协议和并行处理算法,确保在视频会议、远程协作等场景中实现流畅的交互体验。技术实现上,框架兼容主流AI模型,支持通过预训练模型快速搭建代理系统,并提供可扩展的API接口供开发者定制功能。目前框架已集成基础的语音交互模块和视频流处理能力,支持通过摄像头和麦克风进行多模态数据采集,同时提供可视化调试工具辅助开发。该项目适用于需要实时多模态交互的场景,如智能客服、远程教育、虚拟助手等,开发者可通过文档提供的示例代码快速入门。由于其开源特性,社区开发者可基于框架进行功能扩展或二次开发,项目持续更新维护,适合对实时AI交互有需求的技术团队使用。 (语音识别与合成_其他 / 资源传输下载)
README
# VideoSDK AI Agents
Open-source framework for building real-time multimodal conversational AI agents.

[](https://pepy.tech/projects/videosdk-agents)
[](https://x.com/video_sdk)
[](https://www.youtube.com/c/VideoSDK)
[](https://www.linkedin.com/company/video-sdk/)
[](https://discord.com/invite/f2WsNDN9S5)
[](https://deepwiki.com/videosdk-live/agents)
The **VideoSDK AI Agents framework** connects your infrastructure, agent worker, VideoSDK room, and user devices, enabling **real-time, natural voice and multimodal interactions** between users and intelligent agents.

## Overview
The AI Agent SDK is a Python framework built on top of the VideoSDK Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This SDK serves as a real-time bridge between AI models (like OpenAI or Gemini) and your users, facilitating seamless voice and media interactions.
🎙️ Agent with Cascading Pipeline
Test an AI Voice Agent that uses a Cascading Pipeline for STT → LLM → TTS.
📞 AI Telephony Agent
Test an AI Agent that answers and interacts over phone calls using SIP.
💻 Agent Documentation
The VideoSDK Agent Official Documentation.
📚 SDK Reference
Reference Docs for Agents Framework.
| # | Feature | Description |
|----|----------------------------------|-----------------------------------------------------------------------------|
| 1 | **🎤 Real-time Communication (Audio/Video)** | Agents can listen, speak, and interact live in meetings. |
| 2 | **📞 SIP & Telephony Integration** | Seamlessly connect agents to phone systems via SIP for call handling, routing, and PSTN access. |
| 3 | **🧍 Virtual Avatars** | Add lifelike avatars to enhance interaction and presence using Simli. |
| 4 | **🤖 Multi-Model Support** | Integrate with OpenAI, Gemini, AWS NovaSonic, and more. |
| 5 | **🧩 Cascading Pipeline** | Integrates with different providers of STT, LLM, and TTS seamlessly. |
| 6 | **⚡ Realtime Pipeline** | Use unified realtime models (OpenAI Realtime, AWS Nova, Gemini Live) for lowest latency |
| 7 | **🧠 Conversational Flow** | Manages turn detection and VAD for smooth interactions. |
| 8 | **🛠️ Function Tools** | Extend agent capabilities with event scheduling, expense tracking, and more. |
| 9 | **🌐 MCP Integration** | Connect agents to external data sources and tools using Model Context Protocol. |
| 10 | **🔗 A2A Protocol** | Enable agent-to-agent interactions for complex workflows. |
| 11 | **📊 Observability** | Built-in OpenTelemetry tracing and metrics collection |
| 12 | **🚀 CLI Tool** | Run agents locally and test with `videosdk` CLI |
> \[!IMPORTANT]
>
> **Star VideoSDK Repositories** ⭐️
>
> Get instant notifications for new releases and updates. Your support helps us grow and improve VideoSDK!
## Pre-requisites
Before you begin, ensure you have:
- A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live))
- A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard)
- Python 3.12 or higher
- Third-Party API Keys:
- API keys for the services you intend to use (e.g., OpenAI for LLM/STT/TTS, ElevenLabs for TTS, Google for Gemini etc.).
## Installation
- Create and activate a virtual environment with Python 3.12 or higher.
macOS / Linux
```bash
python3 -m venv venv
source venv/bin/activate
```
Windows
```bash
python -m venv venv
venv\Scripts\activate
```
- Install the core VideoSDK AI Agent package
```bash
pip install videosdk-agents
```
- Install Optional Plugins. Plugins help integrate different providers for Realtime, STT, LLM, TTS, and more. Install what your use case needs:
```bash
# Example: Install the Turn Detector plugin
pip install videosdk-plugins-turn-detector
```
👉 Supported plugins (Realtime, LLM, STT, TTS, VAD, Avatar, SIP) are listed in the [Supported Libraries](#supported-libraries-and-plugins) section below.
## Generating a VideoSDK Meeting ID
Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:
### Using cURL
```bash
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"
```
For more details on the Create Room API, refer to the [VideoSDK documentation](https://docs.videosdk.live/api-reference/realtime-communication/create-room).
## Getting Started: Your First Agent
### Quick Start
Now that you've installed the necessary packages, you're ready to build!
### Step 1: Creating a Custom Agent
First, let's create a custom voice agent by inheriting from the base `Agent` class:
```python title="main.py"
from videosdk.agents import Agent, function_tool
# External Tool
# async def get_weather(self, latitude: str, longitude: str):
class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
tools=[get_weather] # You can register any external tool defined outside of this scope
)
async def on_enter(self) -> None:
"""Called when the agent first joins the meeting"""
await self.session.say("Hi there! How can I help you today?")
async def on_exit(self) -> None:
"""Called when the agent exits the meeting"""
await self.session.say("Goodbye!")
```
This code defines a basic voice agent with:
- Custom instructions that define the agent's personality and capabilities
- An entry message when joining a meeting
- State change handling to track the agent's current activity
### Step 2: Implementing Function Tools
Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:
- **External Tools:** Defined as standalone functions outside the agent class and registered via the `tools` argument in the agent's constructor.
- **Internal Tools:** Defined as methods inside the agent class and decorated with `@function_tool`.
Below is an example of both:
```python
import aiohttp
# External Function Tools
@function_tool
def get_weather(latitude: str, longitude: str):
print(f"Getting weather for {latitude}, {longitude}")
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m"
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
return {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
raise Exception(
f"Failed to get weather data, status code: {response.status}"
)
class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
@function_tool
async def get_horoscope(self, sign: str) -> dict:
horoscopes = {
"Aries": "Today is your lucky day!",
"Taurus": "Focus on your goals today.",
"Gemini": "Communication will be important today.",
}
return {
"sign": sign,
"horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
}
```
- Use external tools for reusable, standalone functions (registered via `tools=[...]`).
- Use internal tools for agent-specific logic as class methods.
- Both must be decorated with `@function_tool` for the agent to recognize and use them.
### Step 3: Setting Up the Pipeline
The pipeline connects your agent to an AI model. Here, we are using Google's Gemini for a [Real-time Pipeline](https://docs.videosdk.live/ai_agents/core-components/realtime-pipeline). You could also use a [Cascading Pipeline](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline).
```python
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.agents import RealTimePipeline, JobContext
async def start_session(context: JobContext):
# Initialize the AI model
model = GeminiRealtime(
model="gemini-2.5-flash-native-audio-preview-12-2025",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
api_key="AKZSXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)
pipeline = RealTimePipeline(model=model)
# Continue to the next steps...
```
### Step 4: Assembling and Starting the Agent Session
Now, let's put everything together and start the agent session:
```python
import asyncio
from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext
async def start_session(context: JobContext):
# ... previous setup code ...
# Create the agent session
session = AgentSession(
agent=VoiceAgent(),
pipeline=pipeline
)
try:
await context.connect()
# Start the session
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
await context.shutdown()
def make_context() -> JobContext:
room_options = RoomOptions(
room_id="", # Replace it with your actual meetingID
auth_token = "", # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth
name="Test Agent",
playground=True,
# vision= True # Only available when using the Google Gemini Live API
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()
```
### Step 5: Connecting with VideoSDK Client Applications
After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:
- [JavaScript](https://github.com/videosdk-live/quickstart/tree/main/js-rtc)
- [React](https://github.com/videosdk-live/quickstart/tree/main/react-rtc)
- [React Native](https://github.com/videosdk-live/quickstart/tree/main/react-native)
- [Android](https://github.com/videosdk-live/quickstart/tree/main/android-rtc)
- [Flutter](https://github.com/videosdk-live/quickstart/tree/main/flutter-rtc)
- [iOS](https://github.com/videosdk-live/quickstart/tree/main/ios-rtc)
- [Unity](http://github.com/videosdk-live/videosdk-rtc-unity-sdk-example)
- [IoT](https://github.com/videosdk-live/videosdk-rtc-iot-sdk-example)
When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.
### Step 6: Running the Project
Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed.
```bash
python main.py
```
> [!TIP]
>
> **Test Your Agent Instantly with the CLI Tool**
>
> Run your agent locally using:
>
> ```bash
> python main.py console
> ```
>
> Experience real-time interactions right from your terminal - no meeting room required!
> Speak and listen through your system’s mic and speakers for quick testing and rapid development.
### Step 7: Deployment
For deployment options and guide, checkout the official documentation here: [Deployment](https://docs.videosdk.live/ai_agents/deployments/introduction)
---
## Supported Libraries and Plugins
The framework supports integration with various AI models and tools, across multiple categories:
| Category | Services |
|--------------------------|----------|
| **Real-time Models** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/realtime/openai) | [Gemini](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api) | [AWS Nova Sonic](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic) | [Azure Voice Live](https://docs.videosdk.live/ai_agents/plugins/realtime/azure-voice-live)|
| **Speech-to-Text (STT)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/stt/google) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/stt/azure-ai-stt) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/azureopenai) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/stt/sarvam-ai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/stt/cartesia-stt) | [AssemblyAI](https://docs.videosdk.live/ai_agents/plugins/stt/assemblyai) | [Navana](https://docs.videosdk.live/ai_agents/plugins/stt/navana) |
| **Language Models (LLM)**| [OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/openai) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/azureopenai) | [Google](https://docs.videosdk.live/ai_agents/plugins/llm/google-llm) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/llm/sarvam-ai-llm) | [Anthropic](https://docs.videosdk.live/ai_agents/plugins/llm/anthropic-llm) | [Cerebras](https://docs.videosdk.live/ai_agents/plugins/llm/Cerebras-llm) |
| **Text-to-Speech (TTS)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/tts/google-tts) | [AWS Polly](https://docs.videosdk.live/ai_agents/plugins/tts/aws-polly-tts) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/tts/azure-ai-tts) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/azureopenai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/tts/deepgram) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/tts/sarvam-ai-tts) | [ElevenLabs](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/tts/cartesia-tts) | [Resemble AI](https://docs.videosdk.live/ai_agents/plugins/tts/resemble-ai-tts) | [Smallest AI](https://docs.videosdk.live/ai_agents/plugins/tts/smallestai-tts) | [Speechify](https://docs.videosdk.live/ai_agents/plugins/tts/speechify-tts) | [InWorld](https://docs.videosdk.live/ai_agents/plugins/tts/inworld-ai-tts) | [Neuphonic](https://docs.videosdk.live/ai_agents/plugins/tts/neuphonic-tts) | [Rime AI](https://docs.videosdk.live/ai_agents/plugins/tts/rime-ai-tts) | [Hume AI](https://docs.videosdk.live/ai_agents/plugins/tts/hume-ai-tts) | [Groq](https://docs.videosdk.live/ai_agents/plugins/tts/groq-ai-tts) | [LMNT AI](https://docs.videosdk.live/ai_agents/plugins/tts/lmnt-ai-tts) | [Papla Media](https://docs.videosdk.live/ai_agents/plugins/tts/papla-media) |
| **Voice Activity Detection (VAD)** | [SileroVAD](https://docs.videosdk.live/ai_agents/plugins/silero-vad) |
| **Turn Detection Model** | [Namo Turn Detector](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) |
| **Virtual Avatar** | [Simli](https://docs.videosdk.live/ai_agents/core-components/avatar) |
| **Denoise** | [RNNoise](https://docs.videosdk.live/ai_agents/core-components/de-noise) |
> [!TIP]
> **Installation Examples**
>
> ```bash
> # Install with specific plugins
> pip install videosdk-agents[openai,elevenlabs,silero]
>
> # Install individual plugins
> pip install videosdk-plugins-anthropic
> pip install videosdk-plugins-deepgram
> ```
## Examples
Explore the following examples to see the framework in action:
🤖 AI Voice Agent Usecases
📞 AI Telephony Agent Quickstart
Use case: Hospital appointment booking via a voice-enabled agent.
✈️ AI Whatsapp Agent Quickstart
Use case: Ask about available hotel rooms and book on the go.
👨🏫 Multi Agent System
Use case: Customer care agent that transfers loan related to queries to Loan Specialist Agent.
🛒 Agent with Knowledge (RAG)
Use case: Agent that answers questions based on documentation knowledge.
👨🏫 Agent with MCP Server
Use case: Stock Market Analyst Agent with realtime Market Data Access.
🛒 Virtual Avatar Agent
Use case: A Virtual Avatar Agent that presents weather forecast.
## Documentation
For comprehensive guides and API references:
📄 Official Documentation
Complete framework documentation
📝 API Reference
Detailed API documentation
📂 Examples Directory
Additional code examples
## Contributing
We welcome contributions! Here's how you can help:
🐞 Report Issues
Open an issue for bugs or feature requests
🔀 Submit PRs
Create a pull request with improvements
🛠️ Build Plugins
Follow our plugin development guide
💬 Join Community
Connect with us on Discord
The framework is under active development, so contributions in the form of new plugins, features, bug fixes, or documentation improvements are highly appreciated.
### 🛠️ Building Custom Plugins
Want to integrate a new AI provider? Check out **[BUILD YOUR OWN PLUGIN](BUILD_YOUR_OWN_PLUGIN.md)** for:
- Step-by-step plugin creation guide
- Directory structure and file requirements
- Implementation examples for STT, LLM, and TTS
- Testing and submission guidelines
## Community & Support
Stay connected with VideoSDK:
💬 Discord
Join our community
🐦 Twitter
@video_sdk
▶️ YouTube
VideoSDK Channel
🔗 LinkedIn
VideoSDK Company
> [!TIP]
>
> **Support the Project!** ⭐️
> Star the repository, join the community, and help us improve VideoSDK by providing feedback, reporting bugs, or contributing plugins.
---
**Made with ❤️ by The VideoSDK Team**