An open API service indexing awesome lists of open source software.

https://github.com/videosdk-live/agents

Open-source framework for developing real-time multimodal conversational AI agents.
https://github.com/videosdk-live/agents

Last synced: 2 months ago
JSON representation

Open-source framework for developing real-time multimodal conversational AI agents.

Awesome Lists containing this project

README

          


VideoSDK AI Agents Banner

# VideoSDK AI Agents
Open-source framework for building real-time multimodal conversational AI agents.

![PyPI - Version](https://img.shields.io/pypi/v/videosdk-agents)
[![PyPI Downloads](https://static.pepy.tech/badge/videosdk-agents/month)](https://pepy.tech/projects/videosdk-agents)
[![Twitter Follow](https://img.shields.io/twitter/follow/video_sdk)](https://x.com/video_sdk)
[![YouTube](https://img.shields.io/badge/YouTube-VideoSDK-red)](https://www.youtube.com/c/VideoSDK)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-VideoSDK-blue)](https://www.linkedin.com/company/video-sdk/)
[![Discord](https://img.shields.io/badge/Discord-Join%20Us-7289DA)](https://discord.com/invite/f2WsNDN9S5)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/videosdk-live/agents)

The **VideoSDK AI Agents framework** connects your infrastructure, agent worker, VideoSDK room, and user devices, enabling **real-time, natural voice and multimodal interactions** between users and intelligent agents.

![VideoSDK AI Agents High Level Architecture](https://assets.videosdk.live/images/agent-architecture.png)

## Overview

The AI Agent SDK is a Python framework built on top of the VideoSDK Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This SDK serves as a real-time bridge between AI models (like OpenAI or Gemini) and your users, facilitating seamless voice and media interactions.



🎙️ Agent with Cascading Pipeline


Test an AI Voice Agent that uses a Cascading Pipeline for STT → LLM → TTS.




📞 AI Telephony Agent


Test an AI Agent that answers and interacts over phone calls using SIP.






💻 Agent Documentation


The VideoSDK Agent Official Documentation.




📚 SDK Reference


Reference Docs for Agents Framework.



| # | Feature | Description |
|----|----------------------------------|-----------------------------------------------------------------------------|
| 1 | **🎤 Real-time Communication (Audio/Video)** | Agents can listen, speak, and interact live in meetings. |
| 2 | **📞 SIP & Telephony Integration** | Seamlessly connect agents to phone systems via SIP for call handling, routing, and PSTN access. |
| 3 | **🧍 Virtual Avatars** | Add lifelike avatars to enhance interaction and presence using Simli. |
| 4 | **🤖 Multi-Model Support** | Integrate with OpenAI, Gemini, AWS NovaSonic, and more. |
| 5 | **🧩 Cascading Pipeline** | Integrates with different providers of STT, LLM, and TTS seamlessly. |
| 6 | **⚡ Realtime Pipeline** | Use unified realtime models (OpenAI Realtime, AWS Nova, Gemini Live) for lowest latency |
| 7 | **🧠 Conversational Flow** | Manages turn detection and VAD for smooth interactions. |
| 8 | **🛠️ Function Tools** | Extend agent capabilities with event scheduling, expense tracking, and more. |
| 9 | **🌐 MCP Integration** | Connect agents to external data sources and tools using Model Context Protocol. |
| 10 | **🔗 A2A Protocol** | Enable agent-to-agent interactions for complex workflows. |
| 11 | **📊 Observability** | Built-in OpenTelemetry tracing and metrics collection |
| 12 | **🚀 CLI Tool** | Run agents locally and test with `videosdk` CLI |

> \[!IMPORTANT]
>
> **Star VideoSDK Repositories** ⭐️
>
> Get instant notifications for new releases and updates. Your support helps us grow and improve VideoSDK!

## Pre-requisites

Before you begin, ensure you have:

- A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live))
- A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard)
- Python 3.12 or higher
- Third-Party API Keys:
- API keys for the services you intend to use (e.g., OpenAI for LLM/STT/TTS, ElevenLabs for TTS, Google for Gemini etc.).

## Installation

- Create and activate a virtual environment with Python 3.12 or higher.

macOS / Linux

```bash
python3 -m venv venv
source venv/bin/activate
```


Windows

```bash
python -m venv venv
venv\Scripts\activate
```


- Install the core VideoSDK AI Agent package
```bash
pip install videosdk-agents
```
- Install Optional Plugins. Plugins help integrate different providers for Realtime, STT, LLM, TTS, and more. Install what your use case needs:
```bash
# Example: Install the Turn Detector plugin
pip install videosdk-plugins-turn-detector
```
👉 Supported plugins (Realtime, LLM, STT, TTS, VAD, Avatar, SIP) are listed in the [Supported Libraries](#supported-libraries-and-plugins) section below.

## Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

### Using cURL

```bash
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"
```

For more details on the Create Room API, refer to the [VideoSDK documentation](https://docs.videosdk.live/api-reference/realtime-communication/create-room).

## Getting Started: Your First Agent

### Quick Start

Now that you've installed the necessary packages, you're ready to build!

### Step 1: Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base `Agent` class:

```python title="main.py"
from videosdk.agents import Agent, function_tool

# External Tool
# async def get_weather(self, latitude: str, longitude: str):

class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
tools=[get_weather] # You can register any external tool defined outside of this scope
)

async def on_enter(self) -> None:
"""Called when the agent first joins the meeting"""
await self.session.say("Hi there! How can I help you today?")

async def on_exit(self) -> None:
"""Called when the agent exits the meeting"""
await self.session.say("Goodbye!")
```

This code defines a basic voice agent with:

- Custom instructions that define the agent's personality and capabilities
- An entry message when joining a meeting
- State change handling to track the agent's current activity

### Step 2: Implementing Function Tools

Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:

- **External Tools:** Defined as standalone functions outside the agent class and registered via the `tools` argument in the agent's constructor.
- **Internal Tools:** Defined as methods inside the agent class and decorated with `@function_tool`.

Below is an example of both:

```python
import aiohttp

# External Function Tools
@function_tool
def get_weather(latitude: str, longitude: str):
print(f"Getting weather for {latitude}, {longitude}")
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
return {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
raise Exception(
f"Failed to get weather data, status code: {response.status}"
)

class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
@function_tool
async def get_horoscope(self, sign: str) -> dict:
horoscopes = {
"Aries": "Today is your lucky day!",
"Taurus": "Focus on your goals today.",
"Gemini": "Communication will be important today.",
}
return {
"sign": sign,
"horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
}
```

- Use external tools for reusable, standalone functions (registered via `tools=[...]`).
- Use internal tools for agent-specific logic as class methods.
- Both must be decorated with `@function_tool` for the agent to recognize and use them.

### Step 3: Setting Up the Pipeline

The pipeline connects your agent to an AI model. Here, we are using Google's Gemini for a [Real-time Pipeline](https://docs.videosdk.live/ai_agents/core-components/realtime-pipeline). You could also use a [Cascading Pipeline](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline).

```python
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.agents import RealTimePipeline, JobContext

async def start_session(context: JobContext):
# Initialize the AI model
model = GeminiRealtime(
model="gemini-2.5-flash-native-audio-preview-12-2025",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
api_key="AKZSXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)

pipeline = RealTimePipeline(model=model)

# Continue to the next steps...
```
### Step 4: Assembling and Starting the Agent Session

Now, let's put everything together and start the agent session:

```python
import asyncio
from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext

async def start_session(context: JobContext):
# ... previous setup code ...

# Create the agent session
session = AgentSession(
agent=VoiceAgent(),
pipeline=pipeline
)

try:
await context.connect()
# Start the session
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
await context.shutdown()

def make_context() -> JobContext:
room_options = RoomOptions(
room_id="", # Replace it with your actual meetingID
auth_token = "", # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth
name="Test Agent",
playground=True,
# vision= True # Only available when using the Google Gemini Live API
)

return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()
```
### Step 5: Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

- [JavaScript](https://github.com/videosdk-live/quickstart/tree/main/js-rtc)
- [React](https://github.com/videosdk-live/quickstart/tree/main/react-rtc)
- [React Native](https://github.com/videosdk-live/quickstart/tree/main/react-native)
- [Android](https://github.com/videosdk-live/quickstart/tree/main/android-rtc)
- [Flutter](https://github.com/videosdk-live/quickstart/tree/main/flutter-rtc)
- [iOS](https://github.com/videosdk-live/quickstart/tree/main/ios-rtc)
- [Unity](http://github.com/videosdk-live/videosdk-rtc-unity-sdk-example)
- [IoT](https://github.com/videosdk-live/videosdk-rtc-iot-sdk-example)

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

### Step 6: Running the Project
Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed.

```bash
python main.py
```
> [!TIP]
>
> **Test Your Agent Instantly with the CLI Tool**
>
> Run your agent locally using:
>
> ```bash
> python main.py console
> ```
>
> Experience real-time interactions right from your terminal - no meeting room required!
> Speak and listen through your system’s mic and speakers for quick testing and rapid development.

### Step 7: Deployment

For deployment options and guide, checkout the official documentation here: [Deployment](https://docs.videosdk.live/ai_agents/deployments/introduction)

---

## Supported Libraries and Plugins

The framework supports integration with various AI models and tools, across multiple categories:

| Category | Services |
|--------------------------|----------|
| **Real-time Models** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/realtime/openai) | [Gemini](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api) | [AWS Nova Sonic](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic) | [Azure Voice Live](https://docs.videosdk.live/ai_agents/plugins/realtime/azure-voice-live)|
| **Speech-to-Text (STT)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/stt/google) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/stt/azure-ai-stt) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/azureopenai) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/stt/sarvam-ai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/stt/cartesia-stt) | [AssemblyAI](https://docs.videosdk.live/ai_agents/plugins/stt/assemblyai) | [Navana](https://docs.videosdk.live/ai_agents/plugins/stt/navana) |
| **Language Models (LLM)**| [OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/openai) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/azureopenai) | [Google](https://docs.videosdk.live/ai_agents/plugins/llm/google-llm) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/llm/sarvam-ai-llm) | [Anthropic](https://docs.videosdk.live/ai_agents/plugins/llm/anthropic-llm) | [Cerebras](https://docs.videosdk.live/ai_agents/plugins/llm/Cerebras-llm) |
| **Text-to-Speech (TTS)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/tts/google-tts) | [AWS Polly](https://docs.videosdk.live/ai_agents/plugins/tts/aws-polly-tts) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/tts/azure-ai-tts) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/azureopenai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/tts/deepgram) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/tts/sarvam-ai-tts) | [ElevenLabs](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/tts/cartesia-tts) | [Resemble AI](https://docs.videosdk.live/ai_agents/plugins/tts/resemble-ai-tts) | [Smallest AI](https://docs.videosdk.live/ai_agents/plugins/tts/smallestai-tts) | [Speechify](https://docs.videosdk.live/ai_agents/plugins/tts/speechify-tts) | [InWorld](https://docs.videosdk.live/ai_agents/plugins/tts/inworld-ai-tts) | [Neuphonic](https://docs.videosdk.live/ai_agents/plugins/tts/neuphonic-tts) | [Rime AI](https://docs.videosdk.live/ai_agents/plugins/tts/rime-ai-tts) | [Hume AI](https://docs.videosdk.live/ai_agents/plugins/tts/hume-ai-tts) | [Groq](https://docs.videosdk.live/ai_agents/plugins/tts/groq-ai-tts) | [LMNT AI](https://docs.videosdk.live/ai_agents/plugins/tts/lmnt-ai-tts) | [Papla Media](https://docs.videosdk.live/ai_agents/plugins/tts/papla-media) |
| **Voice Activity Detection (VAD)** | [SileroVAD](https://docs.videosdk.live/ai_agents/plugins/silero-vad) |
| **Turn Detection Model** | [Namo Turn Detector](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) |
| **Virtual Avatar** | [Simli](https://docs.videosdk.live/ai_agents/core-components/avatar) |
| **Denoise** | [RNNoise](https://docs.videosdk.live/ai_agents/core-components/de-noise) |

> [!TIP]
> **Installation Examples**
>
> ```bash
> # Install with specific plugins
> pip install videosdk-agents[openai,elevenlabs,silero]
>
> # Install individual plugins
> pip install videosdk-plugins-anthropic
> pip install videosdk-plugins-deepgram
> ```

## Examples

Explore the following examples to see the framework in action:

🤖 AI Voice Agent Usecases



📞 AI Telephony Agent Quickstart


Use case: Hospital appointment booking via a voice-enabled agent.




✈️ AI Whatsapp Agent Quickstart


Use case: Ask about available hotel rooms and book on the go.






👨‍🏫 Multi Agent System


Use case: Customer care agent that transfers loan related to queries to Loan Specialist Agent.




🛒 Agent with Knowledge (RAG)


Use case: Agent that answers questions based on documentation knowledge.






👨‍🏫 Agent with MCP Server


Use case: Stock Market Analyst Agent with realtime Market Data Access.




🛒 Virtual Avatar Agent


Use case: A Virtual Avatar Agent that presents weather forecast.



## Documentation

For comprehensive guides and API references:



📄 Official Documentation


Complete framework documentation




📝 API Reference


Detailed API documentation




📂 Examples Directory


Additional code examples



## Contributing

We welcome contributions! Here's how you can help:



🐞 Report Issues


Open an issue for bugs or feature requests




🔀 Submit PRs


Create a pull request with improvements




🛠️ Build Plugins


Follow our plugin development guide




💬 Join Community


Connect with us on Discord



The framework is under active development, so contributions in the form of new plugins, features, bug fixes, or documentation improvements are highly appreciated.

### 🛠️ Building Custom Plugins

Want to integrate a new AI provider? Check out **[BUILD YOUR OWN PLUGIN](BUILD_YOUR_OWN_PLUGIN.md)** for:

- Step-by-step plugin creation guide
- Directory structure and file requirements
- Implementation examples for STT, LLM, and TTS
- Testing and submission guidelines

## Community & Support

Stay connected with VideoSDK:



💬 Discord


Join our community




🐦 Twitter


@video_sdk




▶️ YouTube


VideoSDK Channel




🔗 LinkedIn


VideoSDK Company



> [!TIP]
>
> **Support the Project!** ⭐️
> Star the repository, join the community, and help us improve VideoSDK by providing feedback, reporting bugs, or contributing plugins.

---



**Made with ❤️ by The VideoSDK Team**