Voice Agent Testing

At-scale automated testing for voice and chat agents

The open-source platform for voice agent QA — from pre-launch simulation to production monitoring.

Open source Python · TypeScript · Go Runs in your CI
0:00 / 0:00
The problem

Common voice agent testing challenges

Audio, latency, interruptions and tool calls all have to work together — in real time.

Manual testing doesn't scale

Shipping multiple agents a week? Manual QA can't keep up. Simulation scales coverage without scaling headcount.

We're shipping 5 new agents per week. Manual testing doesn't scale.

Production-readiness uncertainty

Validate behaviour across edge cases before you go live — not after.

We go live next week. How do I know this agent is production-ready?

Edge cases hide until production

Accents, noise, interruptions and API failures rarely make the test plan — so customers hit them first.

Our patient numbers are going from 50 to 50,000. What breaks at scale?

Regression on every prompt change

Every prompt change risks a regression. Code-first tests let you iterate fast and safely.

Every prompt change needs testing. We can't iterate fast enough.
How it works

From prompt to production in 4 steps

Connect, describe the caller and verdict, run in CI, monitor in production — one scenario.run() API.

Agent Adapternative transportUser Simulatordrives the callJudge Agentscores the runVerdictpass / fail + trace
1

Connect your agent

pip install langwatch-scenario and pick an adapter — or implement one call(). Connects over your agent's native transport.

OpenAIRealtimeAgentAdapter
2

Describe caller & verdict

A UserSimulatorAgent plays the caller; a JudgeAgent scores it against plain-language criteria. No scripts, no assertion spaghetti.

UserSimulatorAgent · JudgeAgent
3

Run the simulation in CI

scenario.run() drives scripted or autonomous calls — interruptions, DTMF, audio effects — in Vitest, Jest or pytest. Cached and deterministic.

scenario.run()
4

Watch it in production

Inspect runs in the visualizer, stream OTel traces and alerts, and turn any failing call into a new regression test.

OTel · Simulations visualizer
Built for engineers

Write a real voice test in 15 lines

No new DSL, no dashboard clicking. One code-first API powers text and voice — drop it into your test runner and gate every commit in CI.

  • Same scenario.run() for text & voiceOne API across modalities — scripted turns, autonomous simulation, or both.
  • Runs in Vitest, Jest & pytestTests live in your repo, diffable and reviewable, gating every commit in CI.
  • Open source · Python, TS & GoOSS framework with first-class SDKs, OTel-native under the hood.
  • MCP server with 10 toolsDrive tracing, scenario testing and evals from Claude Code, Cursor & Codex.
See the quickstart
test_voice_agent.py
1result = await scenario.run(2    name="voice_getting_started",3    agents=[4        scenario.OpenAIRealtimeAgentAdapter(5            voice="alloy",6            instructions="You are a helpful assistant.",7            role=AgentRole.AGENT,8        ),9        scenario.UserSimulatorAgent(voice="openai/nova"),10        scenario.JudgeAgent(11            criteria=[12                "The agent responded helpfully",13                "They exchanged real audio turns",14            ]15        ),16    ],17    script=[18        scenario.user("Hi, can you help me?"),19        scenario.agent(), scenario.judge(),20    ],21)
Ship voice agents with confidence

Catch regressions, benchmark performance, iterate faster

Small prompt changes cause big quality swings. Catch them before customers do.

Turn failures into test scenarios

Replay real production calls as regression scenarios — every failure becomes a permanent test.

Real-world conditions

Layer background noise, codec degradation and interruptions onto the simulator's audio — plus barge-in and DTMF where supported.

Latency & quality, in the trace

Capture latency and turn-taking on every run, with full OTel traces and token-level cost.

Red-teaming & adversarial

RedTeamAgent runs structured attacks — jailbreaks, prompt injection, PII probing — and writes a prioritized findings report.

Works with your voice stack

Native adapters over the exact transport you ship on — no infrastructure changes.

CI/CD & REST API

Run on every deploy via the SDK or REST API — GitHub Actions, Jenkins, any pipeline. Block bad prompts before production.

User Simulator

We mimic your customers

Describe the caller's goals and the User Simulator generates natural, multi-turn conversations — then layer real-world conditions onto the audio.

  • Natural, multi-turn behaviour from a goal — no scripted lines
  • Background noise & codec degradation layered onto audio
  • Interruptions and barge-in where the adapter supports it
  • Swap in RedTeamAgent for adversarial personas
UserSimulatorYourVoice Agentaudio inaudio outJudge AgentResponded helpfullyExchanged real audio
Use cases

Cross-industry expertise in voice agent testing

For high-stakes domains where a mistake means churn, compliance issues, or lost revenue.

Appointment Scheduling

Stop testing timezone edge cases by hand.

  • Manage multiple time zones
  • Verify appointment details accurately
  • Handle complex calendar conflicts

Customer Support

Test escalation flows without real customers.

  • Escalate appropriately to a human
  • Maintain compliance standards
  • Handle emotional conversations

Healthcare Follow-ups

Test patient interactions safely.

  • Communicate instructions accurately
  • Prescription reminders & instructions
  • Emergency escalation protocols

Drive-thru & Ordering

Simulate rush hour, accents and dietary needs.

  • Dietary restrictions & allergies
  • Background noise & accent variation
  • High-volume rush-hour conditions

Outbound Sales

Qualify and convert, on script.

  • Stay on-message and on-policy
  • Handle objections gracefully
  • Hand off qualified leads cleanly

Legal Intake

Capture the right details, every time.

  • Collect intake fields accurately
  • Respect confidentiality & disclaimers
  • Route to the right practice area
For the whole team

Built for cross-functional teams

Tests as code for engineers, a no-code builder for everyone else.

Engineering

Write scenarios as code, run them in Vitest/Jest/pytest, and gate every deploy in CI.

QA

Build scenarios in the no-code builder, run them with engineering's suite.

Product

Define what 'good' sounds like and review runs in the visualizer.

Security & Compliance

Red-team for jailbreaks, PII disclosure and policy violations, with prioritized findings reports.

What you'll catch

Issues caught before they reach customers

Simulation plus production monitoring catches issues before deployment and in live calls — from misinterpretations to routing and compliance risks.

AI misinterpretations

  • Voice assistant hallucinated non-existent promotions
  • Travel agent confused airport codes, suggesting wrong bookings
  • Ordering agent misread an allergy declaration

System & usability failures

  • A prompt update made the agent ignore the caller mid-conversation
  • Call routing looped, redirecting users repeatedly
  • Latency spikes caused premature call hang-ups

Language & voice issues

  • Couldn't separate multiple voices in a group order
  • Failed to recognize accents, alienating callers
  • Non-English languages were silently ignored
Integrations

Works with your voice stack

Test the agent you actually shipped — over the exact protocol it runs on. No re-platforming, no wrappers.

Realtime API

Google Live API

WebSocket bots

Hosted agents

Phone + DTMF

Real audio, native transport

Realtime APIWebSocketHosted-agent APIsPhone (Twilio + DTMF)

Plug into the realtime model session your agent runs on, or reach it on a live phone number. Real audio, end to end.

Built on OpenTelemetry — works with everything

LangGraphCrewAILiteLLMPydantic AIVercel AI SDKLlamaIndexOpenAI AgentsDSPy+ more

Python, TypeScript & Go SDKs, plus open telemetry from any framework or model provider you already use.

The Agent Quality Platform

From first test to production monitoring

Simulation is the start. LangWatch closes the loop with live observability, alerts and ROI metrics.

Performance & ROI metrics

Escalation rate, first-call resolution, containment, NPS and latency — the numbers leadership cares about.

Production monitoring

Trace, debug and alert on live calls so regressions surface before customers ever notice.

Visualized conversations

Step through any call turn-by-turn in the Simulations visualizer and jump to the exact failure point.

Datasets & fix loop

Turn passing runs into datasets and failing runs into fixes, delivered into your Git workflow.

No-code scenario builder

Let PMs, QA and domain experts build and refine scenarios alongside engineers.

Open & self-hosted

Open-source core, OpenTelemetry-native, deployable locally or fully self-hosted.

“LangWatch reminded me of how we used to evaluate models in classic ML. I knew immediately it was what we needed to ship conversational AI with confidence.”
AHAmit HuliHead of AI · Roojoom

Ship voice agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 10 minutes.

pip install langwatch-scenario

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.