Testing an OpenAI Realtime Voice Model with Scenario
Audience: a developer whose "agent" is an OpenAI Realtime model with instructions and tools. There's no separate app layer — the model is the agent. They want regression tests.
How this differs from the hosted-provider path: ElevenLabs (and platforms like it) hosts the STT→LLM→TTS loop; you test the platform's agent. With OpenAI Realtime, the model itself is the agent — no hosting layer, the session is the product. Scenario treats both the same way via VoiceAgentAdapter, but the configuration surface differs.
Prerequisites
- Python 3.11+
OPENAI_API_KEYwithmodel.requestscope (for chat completions + Realtime + TTS + STT)
All-in-one key: unlike the ElevenLabs path, everything here runs on OpenAI: the Realtime model itself, the user simulator TTS, the judge LLM, and the default STT. One key, one provider.
Step 1 — Install
pip install scenarioNo extras flag. Voice is first-class.
Step 2 — Set env var
OPENAI_API_KEY=sk-...The key must have access to the Realtime API (most standard keys do).
Step 3 — Write a scenario
Create test_realtime_agent.py:
import pytest
import scenario
from scenario.voice import OpenAIRealtimeAgentAdapter
@pytest.mark.asyncio
async def test_realtime_handles_cancellation():
result = await scenario.run(
name="cancellation_flow",
description="Realtime agent must handle a subscription cancel without friction.",
agents=[
# The model IS the agent — instructions + voice + tools live here.
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions=(
"You are a customer support agent. Help users cancel "
"subscriptions. If they change their mind, acknowledge "
"and move on — no upsells."
),
),
scenario.UserSimulatorAgent(voice="openai/nova"),
scenario.JudgeAgent(
criteria=[
"The agent processed the cancellation request",
"The agent did not attempt to upsell or retain the customer",
]
),
],
script=[
scenario.user("I want to cancel my subscription"),
scenario.agent(),
scenario.user("No, I'm sure"),
scenario.agent(),
scenario.judge(),
],
)
assert result.success, result.reasoningThe key difference from the ElevenLabs path: OpenAIRealtimeAgentAdapter takes instructions directly. You're configuring the agent at test time, not pointing at a deployed agent. This lets you A/B test instruction variations in the same suite.
Step 4 — Run it
pytest test_realtime_agent.py -vSame as any other voice scenario. Pass/fail + verdict + audio.
Step 5 — Tools
If your agent uses OpenAI Realtime tool calling, pass the tool schemas:
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions="...",
tools=[
{
"type": "function",
"name": "cancel_subscription",
"description": "Cancel the user's subscription immediately.",
"parameters": {
"type": "object",
"properties": {
"user_id": {"type": "string"},
"reason": {"type": "string"},
},
"required": ["user_id"],
},
},
],
)Tool calls flow through as events on result.timeline — use a callable script step to assert the right tool fired with the right args:
def assert_cancellation_called(state):
calls = [e for e in state.timeline if e.type == "tool_call"]
assert any(
c.tool_name == "cancel_subscription" for c in calls
), "Expected cancel_subscription to fire"
script = [
scenario.user("Cancel my subscription"),
scenario.agent(),
assert_cancellation_called,
scenario.judge(),
]Step 6 — Simulate natural-prosody users
Normally scenario.user("text") runs the text through the OpenAI TTS (voice nova by default). For Realtime-specific tests you may want the user simulator itself to be a Realtime model with natural prosody, emotion, pacing:
from scenario.types import AgentRole
agents = [
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions="You are a helpful support agent.",
role=AgentRole.AGENT, # the agent under test
),
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="nova",
instructions="You are a confused elderly customer. Speak slowly.",
role=AgentRole.USER, # the user simulator
),
scenario.JudgeAgent(criteria=["The agent was patient and clear"]),
]When a scripted scenario.user("text") turn fires with a role=USER Realtime adapter present, the text is routed through that adapter's send_text (not the default TTS) — the Realtime model natively synthesises speech with the persona it's been given.
Step 7 — Add to CI
- run: pip install scenario
- run: pytest test_realtime_agent.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}One secret. One provider. Cheapest voice-regression CI you can set up.
Pattern: branched scenarios for A/B'ing instructions
Instruction-level changes to a Realtime agent ARE the behavior change. Testing them side-by-side is the point:
@pytest.mark.parametrize("instructions,expected_to_pass", [
("You are a support agent. Be friendly and solve problems.", True),
("You are a support agent. Prioritize upsells.", False), # should fail
])
async def test_instruction_variants(instructions, expected_to_pass):
result = await scenario.run(
agents=[
OpenAIRealtimeAgentAdapter(
model="gpt-4o-realtime-preview",
voice="alloy",
instructions=instructions,
),
scenario.UserSimulatorAgent(voice="openai/nova"),
scenario.JudgeAgent(
criteria=["The agent did not attempt to upsell"]
),
],
script=[scenario.user("Just processing my cancellation"),
scenario.agent(),
scenario.judge()],
)
assert result.success == expected_to_passThis is the thing you can't do with a hosted agent — you're configuring the agent at test time.
Getting LangWatch traces from your live app
Scenario tests give you rich LangWatch traces automatically — each scenario.run() call sets up the OTLP exporter and emits per-turn spans. Your live / production app gets nothing by default; the test scaffolding stays in the test runner.
Use realtime_langwatch_session to wrap your production Realtime session in the same span shape:
import os
import asyncio
import websockets
from scenario import realtime_langwatch_session
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
LANGWATCH_API_KEY = os.environ["LANGWATCH_API_KEY"]
async def run_realtime_session():
async with realtime_langwatch_session(
name="my-production-session",
model="gpt-4o-realtime-preview",
api_key=LANGWATCH_API_KEY,
) as session:
# Open your raw Realtime WebSocket here.
# After each agent turn, call session.log_turn() with the
# transcripts you receive from the Realtime event stream.
async with websockets.connect(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
additional_headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
) as ws:
# ... send/receive events as normal ...
# When the agent finishes a turn:
await session.log_turn(
user_transcript="I want to cancel my subscription",
agent_transcript="Of course, I can help with that.",
model="gpt-4o-realtime-preview",
latency_ms=430,
)
asyncio.run(run_realtime_session())What log_turn records: a child LLM span with input (user transcript), output (agent transcript), model, and latency_ms — the same fields LangWatch renders in the trace explorer. Traces land at your LangWatch project and are queryable alongside your scenario-test traces.
When LANGWATCH_API_KEY is absent the context manager is a no-op — no error, no side effects. Safe to ship in environments without a key set.
If you already call langwatch.setup() in your app (e.g. to instrument other LLM calls), realtime_langwatch_session detects the existing provider and attaches to it — no duplicate initialization.
Next
- ElevenLabs (hosted provider) happy path: /voice/happy-path-elevenlabs
- Feature contract:
specs/voice-agents.feature - Capability matrix per adapter: /voice/capability-matrix
