https://github.com/videosdk-live/agents

Open-source framework for developing real-time multimodal conversational AI agents.
https://github.com/videosdk-live/agents

Last synced: 2 months ago
JSON representation

Open-source framework for developing real-time multimodal conversational AI agents.

Host: GitHub
URL: https://github.com/videosdk-live/agents
Owner: videosdk-live
License: apache-2.0
Created: 2025-05-02T06:49:27.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-02-17T10:23:31.000Z (2 months ago)
Last Synced: 2026-02-17T11:35:47.298Z (2 months ago)
Language: Python
Homepage: https://docs.videosdk.live/ai_agents/introduction
Size: 8.11 MB
Stars: 592
Watchers: 9
Forks: 82
Open Issues: 6
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Notice: NOTICE.txt

Awesome Lists containing this project

StarryDivineSky - videosdk-live/agents - live/agents是一个开源框架，旨在帮助开发者构建实时多模态对话式AI代理系统。该项目的核心目标是通过整合语音、视频、文本等多种输入模态，实现更自然的实时人机交互体验。框架采用模块化设计，支持开发者根据需求灵活配置不同功能模块，例如语音识别、面部表情分析、自然语言处理等模块的集成。其工作原理基于实时数据流处理架构，通过分布式计算框架对多模态数据进行同步处理，确保不同传感器输入的实时性与一致性。项目特别强调实时性与低延迟特性，通过优化数据传输协议和并行处理算法，确保在视频会议、远程协作等场景中实现流畅的交互体验。技术实现上，框架兼容主流AI模型，支持通过预训练模型快速搭建代理系统，并提供可扩展的API接口供开发者定制功能。目前框架已集成基础的语音交互模块和视频流处理能力，支持通过摄像头和麦克风进行多模态数据采集，同时提供可视化调试工具辅助开发。该项目适用于需要实时多模态交互的场景，如智能客服、远程教育、虚拟助手等，开发者可通过文档提供的示例代码快速入门。由于其开源特性，社区开发者可基于框架进行功能扩展或二次开发，项目持续更新维护，适合对实时AI交互有需求的技术团队使用。 (语音识别与合成_其他 / 资源传输下载)

README

          


  



# VideoSDK AI Agents

Open-source framework for building real-time multimodal conversational AI agents.

![PyPI - Version](https://img.shields.io/pypi/v/videosdk-agents)

[![PyPI Downloads](https://static.pepy.tech/badge/videosdk-agents/month)](https://pepy.tech/projects/videosdk-agents)

[![Twitter Follow](https://img.shields.io/twitter/follow/video_sdk)](https://x.com/video_sdk)

[![YouTube](https://img.shields.io/badge/YouTube-VideoSDK-red)](https://www.youtube.com/c/VideoSDK)

[![LinkedIn](https://img.shields.io/badge/LinkedIn-VideoSDK-blue)](https://www.linkedin.com/company/video-sdk/)

[![Discord](https://img.shields.io/badge/Discord-Join%20Us-7289DA)](https://discord.com/invite/f2WsNDN9S5)

[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/videosdk-live/agents)

The **VideoSDK AI Agents framework** connects your infrastructure, agent worker, VideoSDK room, and user devices, enabling **real-time, natural voice and multimodal interactions** between users and intelligent agents.

![VideoSDK AI Agents High Level Architecture](https://assets.videosdk.live/images/agent-architecture.png)

## Overview

The AI Agent SDK is a Python framework built on top of the VideoSDK Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This SDK serves as a real-time bridge between AI models (like OpenAI or Gemini) and your users, facilitating seamless voice and media interactions.

  

    

      
🎙️ Agent with Cascading Pipeline



      Test an AI Voice Agent that uses a Cascading Pipeline for STT → LLM → TTS.

    

    

      📞 AI Telephony Agent



      Test an AI Agent that answers and interacts over phone calls using SIP.

    

  

  

    

      💻 Agent Documentation



      The VideoSDK Agent Official Documentation.

    

    

      📚 SDK Reference



      Reference Docs for Agents Framework.

    

  



| #  | Feature                         | Description                                                                 |

|----|----------------------------------|-----------------------------------------------------------------------------|

| 1  | **🎤 Real-time Communication (Audio/Video)**       | Agents can listen, speak, and interact live in meetings.                   |

| 2  | **📞 SIP & Telephony Integration**   | Seamlessly connect agents to phone systems via SIP for call handling, routing, and PSTN access. |

| 3  | **🧍 Virtual Avatars**               | Add lifelike avatars to enhance interaction and presence using Simli.     |

| 4  | **🤖 Multi-Model Support**           | Integrate with OpenAI, Gemini, AWS NovaSonic, and more.                    |

| 5  | **🧩 Cascading Pipeline**            | Integrates with different providers of STT, LLM, and TTS seamlessly.       |

| 6  | **⚡ Realtime Pipeline**         | Use unified realtime models (OpenAI Realtime, AWS Nova, Gemini Live) for lowest latency | 

| 7  | **🧠 Conversational Flow**           | Manages turn detection and VAD for smooth interactions.                    |

| 8  | **🛠️ Function Tools**               | Extend agent capabilities with event scheduling, expense tracking, and more. |

| 9  | **🌐 MCP Integration**               | Connect agents to external data sources and tools using Model Context Protocol. |

| 10  | **🔗 A2A Protocol**                  | Enable agent-to-agent interactions for complex workflows.                  |

| 11 | **📊 Observability**             | Built-in OpenTelemetry tracing and metrics collection |  

| 12 | **🚀 CLI Tool**                  | Run agents locally and test with `videosdk` CLI |  

> \[!IMPORTANT]

>

> **Star VideoSDK Repositories** ⭐️

>

> Get instant notifications for new releases and updates. Your support helps us grow and improve VideoSDK!

## Pre-requisites

Before you begin, ensure you have:

- A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live))

   - A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard)

- Python 3.12 or higher

- Third-Party API Keys:

   - API keys for the services you intend to use (e.g., OpenAI for LLM/STT/TTS, ElevenLabs for TTS, Google for Gemini etc.).

## Installation

- Create and activate a virtual environment with Python 3.12 or higher.

    

     macOS / Linux

    

    ```bash

    python3 -m venv venv

    source venv/bin/activate

    ```

     

     

     Windows

    

    ```bash

    python -m venv venv

    venv\Scripts\activate

    ```

    

    

- Install the core VideoSDK AI Agent package 

  ```bash

  pip install videosdk-agents

  ```

- Install Optional Plugins. Plugins help integrate different providers for Realtime, STT, LLM, TTS, and more. Install what your use case needs:

  ```bash

  # Example: Install the Turn Detector plugin

  pip install videosdk-plugins-turn-detector

  ```

  👉 Supported plugins (Realtime, LLM, STT, TTS, VAD, Avatar, SIP) are listed in the [Supported Libraries](#supported-libraries-and-plugins) section below.

## Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

### Using cURL

```bash

curl -X POST https://api.videosdk.live/v2/rooms \

  -H "Authorization: YOUR_JWT_TOKEN_HERE" \

  -H "Content-Type: application/json"

```

For more details on the Create Room API, refer to the [VideoSDK documentation](https://docs.videosdk.live/api-reference/realtime-communication/create-room).

## Getting Started: Your First Agent

### Quick Start

Now that you've installed the necessary packages, you're ready to build!

### Step 1: Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base `Agent` class:

```python title="main.py"

from videosdk.agents import Agent, function_tool

# External Tool

# async def get_weather(self, latitude: str, longitude: str):

class VoiceAgent(Agent):

    def __init__(self):

        super().__init__(

            instructions="You are a helpful voice assistant that can answer questions and help with tasks.",

             tools=[get_weather] # You can register any external tool defined outside of this scope

        )

    async def on_enter(self) -> None:

        """Called when the agent first joins the meeting"""

        await self.session.say("Hi there! How can I help you today?")

    

    async def on_exit(self) -> None:

      """Called when the agent exits the meeting"""

        await self.session.say("Goodbye!")

```

This code defines a basic voice agent with:

- Custom instructions that define the agent's personality and capabilities

- An entry message when joining a meeting

- State change handling to track the agent's current activity

### Step 2: Implementing Function Tools

Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:

- **External Tools:** Defined as standalone functions outside the agent class and registered via the `tools` argument in the agent's constructor.

- **Internal Tools:** Defined as methods inside the agent class and decorated with `@function_tool`.

Below is an example of both:

```python

import aiohttp

# External Function Tools

@function_tool

def get_weather(latitude: str, longitude: str):

    print(f"Getting weather for {latitude}, {longitude}")

    url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"

    async with aiohttp.ClientSession() as session:

        async with session.get(url) as response:

            if response.status == 200:

                data = await response.json()

                return {

                    "temperature": data["current"]["temperature_2m"],

                    "temperature_unit": "Celsius",

                }

            else:

                raise Exception(

                    f"Failed to get weather data, status code: {response.status}"

                )

class VoiceAgent(Agent):

# ... previous code ...

# Internal Function Tools

    @function_tool

    async def get_horoscope(self, sign: str) -> dict:

        horoscopes = {

            "Aries": "Today is your lucky day!",

            "Taurus": "Focus on your goals today.",

            "Gemini": "Communication will be important today.",

        }

        return {

            "sign": sign,

            "horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),

        }

```

- Use external tools for reusable, standalone functions (registered via `tools=[...]`).

- Use internal tools for agent-specific logic as class methods.

- Both must be decorated with `@function_tool` for the agent to recognize and use them.

### Step 3: Setting Up the Pipeline

The pipeline connects your agent to an AI model. Here, we are using Google's Gemini for a [Real-time Pipeline](https://docs.videosdk.live/ai_agents/core-components/realtime-pipeline). You could also use a [Cascading Pipeline](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline).

```python

from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

from videosdk.agents import RealTimePipeline, JobContext

async def start_session(context: JobContext):

    # Initialize the AI model

    model = GeminiRealtime(

        model="gemini-2.5-flash-native-audio-preview-12-2025",

        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter

        api_key="AKZSXXXXXXXXXXXXXXXXXXXX",

        config=GeminiLiveConfig(

            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.

            response_modalities=["AUDIO"]

        )

    )

    pipeline = RealTimePipeline(model=model)

    # Continue to the next steps...

```

### Step 4: Assembling and Starting the Agent Session

Now, let's put everything together and start the agent session:

```python

import asyncio

from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext

async def start_session(context: JobContext):

    # ... previous setup code ...

    # Create the agent session

    session = AgentSession(

        agent=VoiceAgent(),

        pipeline=pipeline

    )

    try:

       await context.connect()

        # Start the session

        await session.start()

        # Keep the session running until manually terminated

        await asyncio.Event().wait()

    finally:

        # Clean up resources when done

        await session.close()

        await context.shutdown()

def make_context() -> JobContext:

    room_options = RoomOptions(

        room_id="", # Replace it with your actual meetingID

        auth_token = "", # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth

        name="Test Agent", 

        playground=True,

        # vision= True # Only available when using the Google Gemini Live API

    )

    

    return JobContext(room_options=room_options)

if __name__ == "__main__":

    job = WorkerJob(entrypoint=start_session, jobctx=make_context)

    job.start()

```

### Step 5: Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

- [JavaScript](https://github.com/videosdk-live/quickstart/tree/main/js-rtc)

- [React](https://github.com/videosdk-live/quickstart/tree/main/react-rtc)

- [React Native](https://github.com/videosdk-live/quickstart/tree/main/react-native)

- [Android](https://github.com/videosdk-live/quickstart/tree/main/android-rtc)

- [Flutter](https://github.com/videosdk-live/quickstart/tree/main/flutter-rtc)

- [iOS](https://github.com/videosdk-live/quickstart/tree/main/ios-rtc)

- [Unity](http://github.com/videosdk-live/videosdk-rtc-unity-sdk-example)

- [IoT](https://github.com/videosdk-live/videosdk-rtc-iot-sdk-example)

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

### Step 6: Running the Project

Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed.

```bash

python main.py

```

> [!TIP]

> 

> **Test Your Agent Instantly with the CLI Tool**

>

> Run your agent locally using:

>

> ```bash

> python main.py console

> ```

>

> Experience real-time interactions right from your terminal - no meeting room required!  

> Speak and listen through your system’s mic and speakers for quick testing and rapid development.

### Step 7: Deployment

For deployment options and guide, checkout the official documentation here: [Deployment](https://docs.videosdk.live/ai_agents/deployments/introduction)

---

## Supported Libraries and Plugins

The framework supports integration with various AI models and tools, across multiple categories:

| Category                 | Services |

|--------------------------|----------|

| **Real-time Models**     | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/realtime/openai) | [Gemini](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api) | [AWS Nova Sonic](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic) | [Azure Voice Live](https://docs.videosdk.live/ai_agents/plugins/realtime/azure-voice-live)|

| **Speech-to-Text (STT)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/stt/google) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/stt/azure-ai-stt) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/stt/azureopenai) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/stt/sarvam-ai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/stt/cartesia-stt) | [AssemblyAI](https://docs.videosdk.live/ai_agents/plugins/stt/assemblyai) | [Navana](https://docs.videosdk.live/ai_agents/plugins/stt/navana) |

| **Language Models (LLM)**| [OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/openai) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/llm/azureopenai) | [Google](https://docs.videosdk.live/ai_agents/plugins/llm/google-llm) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/llm/sarvam-ai-llm) | [Anthropic](https://docs.videosdk.live/ai_agents/plugins/llm/anthropic-llm) | [Cerebras](https://docs.videosdk.live/ai_agents/plugins/llm/Cerebras-llm) |

| **Text-to-Speech (TTS)** | [OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/openai) | [Google](https://docs.videosdk.live/ai_agents/plugins/tts/google-tts) | [AWS Polly](https://docs.videosdk.live/ai_agents/plugins/tts/aws-polly-tts) | [Azure AI Speech](https://docs.videosdk.live/ai_agents/plugins/tts/azure-ai-tts) | [Azure OpenAI](https://docs.videosdk.live/ai_agents/plugins/tts/azureopenai) | [Deepgram](https://docs.videosdk.live/ai_agents/plugins/tts/deepgram) | [Sarvam AI](https://docs.videosdk.live/ai_agents/plugins/tts/sarvam-ai-tts) | [ElevenLabs](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs) | [Cartesia](https://docs.videosdk.live/ai_agents/plugins/tts/cartesia-tts) | [Resemble AI](https://docs.videosdk.live/ai_agents/plugins/tts/resemble-ai-tts) | [Smallest AI](https://docs.videosdk.live/ai_agents/plugins/tts/smallestai-tts) | [Speechify](https://docs.videosdk.live/ai_agents/plugins/tts/speechify-tts) | [InWorld](https://docs.videosdk.live/ai_agents/plugins/tts/inworld-ai-tts) | [Neuphonic](https://docs.videosdk.live/ai_agents/plugins/tts/neuphonic-tts) | [Rime AI](https://docs.videosdk.live/ai_agents/plugins/tts/rime-ai-tts) | [Hume AI](https://docs.videosdk.live/ai_agents/plugins/tts/hume-ai-tts) | [Groq](https://docs.videosdk.live/ai_agents/plugins/tts/groq-ai-tts) | [LMNT AI](https://docs.videosdk.live/ai_agents/plugins/tts/lmnt-ai-tts) | [Papla Media](https://docs.videosdk.live/ai_agents/plugins/tts/papla-media) |

| **Voice Activity Detection (VAD)** | [SileroVAD](https://docs.videosdk.live/ai_agents/plugins/silero-vad) |

| **Turn Detection Model** | [Namo Turn Detector](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) |

| **Virtual Avatar** | [Simli](https://docs.videosdk.live/ai_agents/core-components/avatar) |

| **Denoise** | [RNNoise](https://docs.videosdk.live/ai_agents/core-components/de-noise) |

> [!TIP]

> **Installation Examples**

>

> ```bash

> # Install with specific plugins

> pip install videosdk-agents[openai,elevenlabs,silero]

>

> # Install individual plugins

> pip install videosdk-plugins-anthropic

> pip install videosdk-plugins-deepgram

> ```

## Examples

Explore the following examples to see the framework in action:

🤖 AI Voice Agent Usecases


  

    

      
📞 AI Telephony Agent Quickstart



      Use case: Hospital appointment booking via a voice-enabled agent.

    

    

      ✈️ AI Whatsapp Agent Quickstart



      Use case: Ask about available hotel rooms and book on the go.

    

  

  

    

      👨‍🏫 Multi Agent System



      Use case: Customer care agent that transfers loan related to queries to Loan Specialist Agent.

    

    

      🛒 Agent with Knowledge (RAG)



      Use case: Agent that answers questions based on documentation knowledge.

    

  

  

    

      👨‍🏫 Agent with MCP Server



      Use case: Stock Market Analyst Agent with realtime Market Data Access.

    

    

      🛒 Virtual Avatar Agent



      Use case: A Virtual Avatar Agent that presents weather forecast. 

    

  

## Documentation

For comprehensive guides and API references:

  

    

      
📄 Official Documentation



      Complete framework documentation

    

    

      📝 API Reference



      Detailed API documentation

    

    

      📂 Examples Directory



      Additional code examples

    

  

## Contributing

We welcome contributions! Here's how you can help:

  

    

      
🐞 Report Issues



      Open an issue for bugs or feature requests

    

    

      🔀 Submit PRs



      Create a pull request with improvements

    

    

      🛠️ Build Plugins



      Follow our plugin development guide

    

    

      💬 Join Community



      Connect with us on Discord

    

  

The framework is under active development, so contributions in the form of new plugins, features, bug fixes, or documentation improvements are highly appreciated.

### 🛠️ Building Custom Plugins

Want to integrate a new AI provider? Check out **[BUILD YOUR OWN PLUGIN](BUILD_YOUR_OWN_PLUGIN.md)** for:

- Step-by-step plugin creation guide  

- Directory structure and file requirements  

- Implementation examples for STT, LLM, and TTS  

- Testing and submission guidelines  

## Community & Support

Stay connected with VideoSDK:

  

    

      
💬 Discord



      Join our community

    

    

      🐦 Twitter



      @video_sdk

    

    

      ▶️ YouTube



      VideoSDK Channel

    

    

      🔗 LinkedIn



      VideoSDK Company

    

  

> [!TIP]

>

> **Support the Project!** ⭐️  

> Star the repository, join the community, and help us improve VideoSDK by providing feedback, reporting bugs, or contributing plugins.

---



  



**Made with ❤️ by The VideoSDK Team**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/videosdk-live/agents

Awesome Lists containing this project

README

🎙️ Agent with Cascading Pipeline

📞 AI Telephony Agent

💻 Agent Documentation

📚 SDK Reference

🤖 AI Voice Agent Usecases

📞 AI Telephony Agent Quickstart

✈️ AI Whatsapp Agent Quickstart

👨‍🏫 Multi Agent System

🛒 Agent with Knowledge (RAG)

👨‍🏫 Agent with MCP Server

🛒 Virtual Avatar Agent

📄 Official Documentation

📝 API Reference

📂 Examples Directory

🐞 Report Issues

🔀 Submit PRs

🛠️ Build Plugins

💬 Join Community

💬 Discord

🐦 Twitter

▶️ YouTube

🔗 LinkedIn