https://github.com/svilupp/layercode-gym
Unofficial utilities for Layercode Voice Agents. Run hundreds of voice AI conversations concurrently. Test with text, audio files, or AI-driven personas.
https://github.com/svilupp/layercode-gym
evals generative-ai layercode voice-ai-agents
Last synced: 3 months ago
JSON representation
Unofficial utilities for Layercode Voice Agents. Run hundreds of voice AI conversations concurrently. Test with text, audio files, or AI-driven personas.
- Host: GitHub
- URL: https://github.com/svilupp/layercode-gym
- Owner: svilupp
- License: mit
- Created: 2025-11-02T07:52:03.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-02-02T21:09:09.000Z (4 months ago)
- Last Synced: 2026-02-03T10:34:23.642Z (4 months ago)
- Topics: evals, generative-ai, layercode, voice-ai-agents
- Language: Python
- Homepage: http://siml.earth/layercode-gym/
- Size: 521 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Roadmap: docs/roadmap.md
Awesome Lists containing this project
README
# LayerCode Gym
[](https://github.com/svilupp/layercode-gym/actions/workflows/ci.yml)
[](https://github.com/svilupp/layercode-gym/actions/workflows/docs.yml)
[](https://svilupp.github.io/layercode-gym)
A testing toolkit for voice AI agents built on [Layercode.com](https://layercode.com). Quickly spin up a testing environment to run through hundreds of scenarios and understand how your agent will perform in production.
**Note:** This is an unofficial, community-maintained project.
Perfect for regression testing, load testing, and automated evaluation of your voice AI agents.
## Features
- **Three User Simulator Types**: Fixed text, pre-recorded audio, or AI-driven personas
- **Smart Wait Handling**: AI personas intelligently wait when assistants need processing time
- **Captured Analytics**: Full transcripts with TTFAB, latency stats, and audio recordings
- **LogFire Integration**: Real-time observability and debugging
- **Batch Testing**: Run hundreds of conversations concurrently
- **CLI & Python API**: Quick testing via CLI or programmatic control, plus `api-agents` CLI to swap webhook URLs for CI
- **LLM-as-Judge**: Bring your own quality evaluation with customizable criteria as a conversational hook
- **GitHub Actions Integration**: Automated CI/CD testing with parallel persona execution
See `examples/` for reference!
## Quick Start
**Prerequisites:** Backend server configured in [Layercode dashboard](https://dash.layercode.com).
No server yet? Launch one quickly:
```bash
uvx layercode-create-app run --tunnel --unsafe-update-webhook
# Displays tunnel URL to enter in Layercode dashboard
```
!! Caution: `--unsafe-update-webhook` automatically updates the webhook URL in the Layercode dashboard!
### CLI Quick Test (No Installation)
```bash
# Set environment
export SERVER_URL="http://localhost:8001"
export LAYERCODE_AGENT_ID="your_agent_id"
# Run instantly with uvx (no installation)
uvx layercode-gym run --text "Hello, I need help with my account"
# Multiple messages
uvx layercode-gym run --text "Hi" --text "Tell me more" --text "Goodbye"
# Audio file
uvx layercode-gym run --file recording.wav
# AI agent with persona
uvx layercode-gym run --agent \
--persona-background "You are a frustrated customer" \
--persona-intent "Cancel your subscription"
```
Run `uvx layercode-gym --help` to see available commands, or `uvx layercode-gym run --help` for all run options.
### Manage Agent Webhooks (for CI)
```bash
# List all agents
uvx layercode-gym api-agents list
# Get agent details (use --json for full pipeline config)
uvx layercode-gym api-agents get --agent-id ag-123
# Update webhook URL (useful for PR testing)
uvx layercode-gym api-agents update --agent-id ag-123 --webhook-url https://pr-backend.com/webhook
```
### Cloudflare Tunnel (for Local Development)
Quickly expose your local server to the internet with a Cloudflare tunnel. This is useful for testing webhooks without deploying your backend.
**Requires:** [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/install-and-setup/installation/) to be installed.
```bash
# Basic tunnel - displays URL to copy manually
uvx layercode-gym tunnel --port 8000
# Or specify a full URL directly
uvx layercode-gym tunnel --url http://localhost:8000
# Auto-update agent webhook (recommended for development)
uvx layercode-gym tunnel --port 8000 --unsafe-update-webhook
# Explicit agent ID (overrides LAYERCODE_AGENT_ID env var)
uvx layercode-gym tunnel --port 8000 --agent-id ag-123456 --unsafe-update-webhook
```
When using `--unsafe-update-webhook`:
1. The tunnel starts and gets a base URL (e.g., `https://random-words.trycloudflare.com`)
2. The agent path is appended to create the full webhook URL (e.g., `https://random-words.trycloudflare.com/api/agent`)
3. Your agent's webhook URL is automatically updated
4. When you stop the tunnel (Ctrl+C), the original webhook URL is restored
**Agent path resolution:** `--agent-path` flag → `LAYERCODE_AGENT_PATH` env var → path from existing webhook → default `/api/agent`
**Environment Variables:**
- `LAYERCODE_AGENT_ID` - Default agent ID for webhook updates
- `LAYERCODE_API_KEY` - API key for webhook updates (required for `--unsafe-update-webhook`)
- `LAYERCODE_AGENT_PATH` - Path to append to tunnel URL (default: extracted from existing webhook, or `/api/agent`)
> **Warning:** `--unsafe-update-webhook` modifies your agent's configuration. Only use with development/test agents, not production.
See [tunnel documentation](docs/tunnel.md) for more details.
### Python API
```bash
# Install
uv add layercode-gym
# Set environment
export SERVER_URL="http://localhost:8001"
export LAYERCODE_AGENT_ID="your_agent_id"
export OPENAI_API_KEY="sk-..." # For TTS and AI personas
```
```python
from layercode_gym import LayercodeClient, UserSimulator
# Simple text messages
simulator = UserSimulator.from_text(
messages=["Hello!", "Tell me about pricing", "Thank you"],
send_as_text=True
)
client = LayercodeClient(simulator=simulator)
conversation_id = await client.run()
```
## Architecture
```
┌─────────────┐ ┌──────────────┐
│ Your Test │──1. Authorize──────▶│ Your Backend │
│ Code │ │ Server │
└─────────────┘ └──────────────┘
│ │
│ 2. Return
│ client_session_key
│ │
└──────3. Connect with key───────────┘
│
▼
┌──────────────┐
│ Layercode │
│ Platform │
└──────────────┘
```
**Flow:**
1. Client authorizes through YOUR backend server (`SERVER_URL`)
2. Backend returns `client_session_key` from LayerCode
3. Client connects to LayerCode WebSocket with that key
The client never hits LayerCode's API directly - it always goes through your backend first.
## User Simulators
Three types for different testing needs:
### 1. Fixed Text Messages
Fastest option, perfect for regression testing:
```python
simulator = UserSimulator.from_text(
messages=["Hello", "Tell me more", "Goodbye"],
send_as_text=True # or False to use TTS
)
```
### 2. Pre-recorded Audio Files
Test transcription and audio handling:
```python
from pathlib import Path
simulator = UserSimulator.from_files(
files=[Path("greeting.wav"), Path("question.wav")]
)
```
### 3. AI Agent Personas
Realistic, dynamic conversations using PydanticAI:
```python
from layercode_gym import Persona
simulator = UserSimulator.from_agent(
persona=Persona(
background_context="You are a 35-year-old small business owner",
intent="You want to understand pricing and features"
),
model="openai:gpt-5-mini",
max_turns=5
)
```
## Examples
The `examples/` directory contains ready-to-run scripts:
- **01_text_messages.py** - Simple text conversation for quick testing
- **02_audio_file.py** - Stream pre-recorded audio to test transcription
- **03_agent_persona.py** - AI-driven user with dynamic responses
- **04_callbacks_judge.py** - CriteriaJudge for automated pass/fail evaluation
- **05_batch_evaluation.py** - Run multiple conversations concurrently
- **06_outdoor_shop_eval.py** - Custom data processor with domain-specific criteria
- **07_custom_judge.py** - Build your own judge with custom PydanticAI output types
- **08_long_running_task.py** - Testing agents with wait handling for slow operations
Run any example:
```bash
python examples/01_text_messages.py
```
See [full documentation](https://svilupp.github.io/layercode-gym/examples) for detailed explanations.
## LLM-as-Judge Evaluation
Evaluate conversations against pass/fail criteria using `CriteriaJudge`:
```python
from layercode_gym import CriteriaJudge, LayercodeClient, Settings
judge = CriteriaJudge(
criteria=[
"Did the agent answer all user questions?",
"Was the agent polite and professional?",
"Did the conversation flow naturally?"
],
# Note: gpt-5-mini is fast/cheap for testing; use gpt-5 for production
model="openai:gpt-5-mini"
)
async def on_end(log):
result = await judge.evaluate(log)
print(f"Overall: {'PASS' if result.overall_pass else 'FAIL'}")
judge.save_results(result, log.conversation_id, Settings.load().output_root)
client = LayercodeClient(
simulator=simulator,
conversation_callback=on_end
)
```
Results saved to `conversations//judge_evaluation.json` with full evaluation metadata:
```json
{
"schema_version": "1.0",
"evaluated_at": "2025-12-05T13:15:41.124793+00:00",
"model": "openai:gpt-5-mini",
"criteria": [{"id": 1, "criterion": "Did the agent answer all user questions?"}],
"additional_context": "Optional context provided to the judge",
"judgment": {
"criteria_results": [{"criterion_id": 1, "passed": true}],
"overall_pass": true,
"reasoning": "The agent answered all questions clearly..."
},
"results_summary": [{"id": 1, "criterion": "...", "passed": true}]
}
```
## Batch Testing
Run hundreds of conversations concurrently:
```python
import asyncio
from tqdm.asyncio import tqdm_asyncio
scenarios = ["Message 1", "Message 2", "Message 3"]
tasks = [run_conversation(msg) for msg in scenarios]
results = await tqdm_asyncio.gather(*tasks, desc="Running conversations")
```
See `examples/05_batch_evaluation.py` for the complete pattern.
## GitHub Actions CI/CD
Run automated tests in your CI pipeline with multiple personas in parallel:
```yaml
- uses: ./.github/actions/layercode-gym-test
with:
personas: |
- background: You are a potential customer
intent: Learn about pricing and features
- background: You are a frustrated user
intent: Get help with a problem
judge-enabled: true
judge-criteria: |
- Did the agent provide clear and helpful responses?
server-url: ${{ secrets.SERVER_URL }}
layercode-agent-id: ${{ secrets.LAYERCODE_AGENT_ID }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
```
**Features:**
- Run multiple personas in parallel for maximum speed
- Automated quality evaluation with LLM judge
- Detailed artifacts with transcripts and audio recordings
- Optional LogFire observability integration
**Tip:** Use the `api-agents` CLI to update your agent's webhook URL for PR testing:
```bash
# Point agent to PR-specific backend before running tests
layercode-gym api-agents update --agent-id ag-123 --webhook-url https://pr-456.example.com/webhook
# Restore original after tests
layercode-gym api-agents update --agent-id ag-123 --webhook-url https://production.example.com/webhook
```
See [GitHub Actions documentation](docs/github-action.md) for complete setup guide, or [`api-agents` CLI docs](docs/api-agents.md) for webhook management.
## Conversation Outputs
After each conversation:
```
conversations//
├── transcript.json # Full log with timing metrics
├── conversation_mix.wav # Combined audio (user + assistant)
├── user_0.wav # Individual user turns
├── assistant_0.wav # Individual assistant turns
└── judge_evaluation.json # CriteriaJudge results (if enabled)
```
Transcript includes TTFAB, latency stats, turn counts, and full message history.
## Custom Implementations
### Custom TTS Engine
```python
from layercode_gym.simulator import TTSEngineProtocol
from pathlib import Path
class MyTTSEngine(TTSEngineProtocol):
async def synthesize(self, text: str, **kwargs) -> Path:
# Your TTS service (ElevenLabs, Azure, etc.)
return audio_file_path
simulator = UserSimulator.from_text(
messages=["Hello!"],
send_as_text=False,
tts_engine=MyTTSEngine()
)
```
### Custom LLM for Agents
Use any LLM supported by PydanticAI. **Important:** You must define the system prompt with proper placeholders.
```python
from pydantic_ai import Agent
from textprompts import TextTemplates
# Load the required prompt template
templates = TextTemplates("src/layercode_gym/simulator/prompts")
system_prompt = templates.render(
"basic_agent.txt",
background_context="Your background",
intent="Your intent"
)
# Create custom agent with proper system prompt
agent = Agent(
"anthropic:claude-3-5-sonnet",
system_prompt=system_prompt
)
simulator = UserSimulator.from_agent(agent=agent, deps=my_deps)
```
**Available models:**
- `openai:gpt-5` / `openai:gpt-5-mini`
- `anthropic:claude-3-5-sonnet`
- `ollama:llama3` (local)
- `gemini:gemini-1.5-pro`
**Prompt requirements:** The system prompt must include `{background_context}` and `{intent}` placeholders. See `src/layercode_gym/simulator/prompts/basic_agent.txt` for the default template.
### Custom Simulator
Full control via protocol implementation:
```python
from layercode_gym.simulator import UserSimulatorProtocol, UserRequest, UserResponse
class MyCustomSimulator(UserSimulatorProtocol):
async def get_response(self, request: UserRequest) -> UserResponse | None:
# Your logic here
return UserResponse(text="Hello!", audio_path=None, data=())
```
## Environment Variables
**Required:**
```bash
SERVER_URL="http://localhost:8001" # Your backend server
LAYERCODE_AGENT_ID="your_agent_id" # LayerCode agent ID
```
**Optional:**
```bash
OPENAI_API_KEY="sk-..." # For TTS and AI agents
OPENAI_TTS_MODEL="gpt-4o-mini-tts" # TTS model
OPENAI_TTS_VOICE="coral" # Voice (alloy, echo, fable, onyx, nova, shimmer, coral)
LAYERCODE_OUTPUT_ROOT="./conversations" # Save location
LOGFIRE_TOKEN="..." # Enable LogFire observability
```
## LogFire Integration
Real-time observability and debugging with [LogFire](https://logfire.pydantic.dev/):
```bash
export LOGFIRE_TOKEN="your_token_here"
```
Automatically instruments PydanticAI and OpenAI calls, providing:
- Real-time conversation tracking
- Performance metrics and spans
- Error tracking with stack traces
- Beautiful UI for exploring conversations
## Type Safety
Enforces `mypy --strict` throughout. All event schemas use `TypedDict` or dataclasses.
```bash
uv run mypy src/layercode_gym
```
## Related Projects
- **[layercode-create-app](https://github.com/svilupp/layercode-create-app)** - CLI to scaffold LayerCode backends with tunneling
- **[layercode-examples](https://github.com/svilupp/layercode-examples)** - Agent patterns and integration recipes
## Documentation
Full documentation at [svilupp.github.io/layercode-gym](https://svilupp.github.io/layercode-gym)
- [Getting Started](https://svilupp.github.io/layercode-gym/getting-started)
- [Core Concepts](https://svilupp.github.io/layercode-gym/concepts)
- [Examples](https://svilupp.github.io/layercode-gym/examples)
- [API Reference](https://svilupp.github.io/layercode-gym/api-reference)
- [Advanced Usage](https://svilupp.github.io/layercode-gym/advanced)
## Contributing
This is a minimal, focused toolkit. Extensions should be done via:
- Custom simulator strategies (implement `UserSimulatorProtocol`)
- Custom callbacks (implement `TurnCallback` or `ConversationCallback`)
- Custom TTS engines (implement `TTSEngineProtocol`)
Keep the core simple and extensible.
## License
MIT - See [LICENSE](LICENSE) file for details.