Voice agent testing

At-scale automated testing for voice and chat agents.

The open-source platform for voice agent QA, from pre-launch simulation to production monitoring.

Test your agents View documentation

Common voice agent testing challenges

Audio, latency, interruptions, and tool calls all have to work together in real time.

Manual testing doesn't scale

Shipping multiple agents a week? Manual QA can't keep up.

Production-readiness uncertainty

Validate behaviour across edge cases before going live.

Edge cases hide until production

Accents, noise, interruptions, and API failures rarely make the test plan.

Regression on every prompt change

Every prompt change risks a regression.

From prompt to production in four steps

Connect, describe the caller and verdict, run in CI, monitor in production. One scenario.run() API.

Step 01

Connect your agent

pip install, then point a thin adapter at the agent you already shipped.

Step 02

Describe caller & verdict

A UserSimulatorAgent plays the caller, a JudgeAgent decides what passing looks like.

Step 03

Run simulation in CI

scenario.run() drives interruptions, DTMF, and audio effects on every commit.

Step 04

Watch in production

The visualizer, OpenTelemetry traces, and alerts close the loop after launch.

Write a real voice test in fifteen lines

No new DSL, no dashboard clicking. One code-first API powers text and voice. Drop it into your test runner and gate every commit in CI.

test_voice_agent.pyillustrative

# test_voice_agent.py (illustrative)
import scenario

result = scenario.run(
    name="caller reschedules an appointment over the phone",
    description="A caller with background noise asks to move a booking.",
    agents=[
        my_voice_agent,                       # your agent, behind a thin adapter
        scenario.UserSimulatorAgent(
            persona="hurried caller, interrupts, mild accent",
        ),
        scenario.JudgeAgent(criteria=[
            "The agent confirms the new time in the caller's timezone",
            "The agent verifies identity before changing the booking",
        ]),
    ],
    audio=scenario.AudioConditions(noise="cafe", interruptions=True),
)
assert result.success

Same scenario.run() for text & voice

Runs in Vitest, Jest & pytest

Open source, Python/TS/Go

MCP server with 10 tools

Snippet is illustrative. See the docs for the current API.

Catch regressions, benchmark performance, iterate faster

One framework for everything that has to go right before a voice agent can answer the phone.

scenario.run() · voice4 scenarios

reschedule · cafe noise

6 turns340ms ttftpass

reschedule · hurried caller

5 turns410ms ttftpass

refund · angry escalation (DE)

9 turns520ms ttftflagged

support · accented caller

7 turns380ms ttftpass

Turn failures into test scenarios

Capture a production failure once, then replay it forever as a regression test.

Real-world conditions

Noise, codec degradation, and interruptions, so you test the call your users actually make.

Latency and quality, in the trace

Time to first token, turn latency, and judge verdicts land in one inspectable trace.

Red-teaming and adversarial testing

Probe for jailbreaks, prompt leaks, and off-policy answers before an attacker does.

Works with your voice stack

We mimic your customers: the User Simulator generates natural, multi-turn conversations over your real protocol.

CI/CD and REST API

Gate every release in your pipeline, or drive runs programmatically through the REST API.

Cross-industry voice agent testing

For high-stakes domains where a mistake means churn, compliance issues, or lost revenue.

Appointment scheduling

Timezones, identity verification, and double-booking conflicts, handled on the call.

Customer support

Escalation paths, compliance language, and emotional conversations that stay on script.

Healthcare follow-ups

Care instructions, prescriptions, and clean emergency escalation when it matters.

Drive-thru & ordering

Menu rules, background noise, and rush-hour pace without dropped items.

Outbound sales

On-message pitches, objection handling, and a clean lead handoff every time.

Legal intake

Field capture, confidentiality, and accurate routing to the right team.

Works with your voice stack

Test the agent you actually shipped, over the exact protocol it runs on. No re-platforming, no wrappers.

OpenAI Realtime

Gemini

ElevenLabs

Google Live API

Pipecat

WebSocket bots

Twilio

Phone + DTMF

Built on OpenTelemetry, works with 10+ frameworks.

From first test to production monitoring

Simulation is the start. LangWatch closes the loop with live observability, alerts, and ROI metrics.

Performance & ROI metrics

Tie agent quality to the numbers your business actually tracks.

Production monitoring

Watch live calls, catch drift, and get alerted when quality slips.

Visualized conversations

Replay any run as a full conversation you can inspect and share.

Datasets & fix loop

Turn real conversations into datasets, then close the loop on fixes.

No-code scenario builder

Let product and QA author scenarios without touching the codebase.

Open & self-hosted

Open source and self-hostable, so your call data stays yours.

“

LangWatch reminded me of how we used to evaluate models in classic ML. I knew immediately it was what we needed to ship conversational AI with confidence.

Head of AI, an enterprise voice AI team

Ship voice agents with confidence, not crossed fingers.

Get up and running with LangWatch in as little as ten minutes.

Start shipping Talk to us

$ pip install langwatch-scenario