At-scale automated testing for voice and chat agents
The open-source platform for voice agent QA — from pre-launch simulation to production monitoring.
Common voice agent testing challenges
Audio, latency, interruptions and tool calls all have to work together — in real time.
Manual testing doesn't scale
Shipping multiple agents a week? Manual QA can't keep up. Simulation scales coverage without scaling headcount.
“We're shipping 5 new agents per week. Manual testing doesn't scale.”
Production-readiness uncertainty
Validate behaviour across edge cases before you go live — not after.
“We go live next week. How do I know this agent is production-ready?”
Edge cases hide until production
Accents, noise, interruptions and API failures rarely make the test plan — so customers hit them first.
“Our patient numbers are going from 50 to 50,000. What breaks at scale?”
Regression on every prompt change
Every prompt change risks a regression. Code-first tests let you iterate fast and safely.
“Every prompt change needs testing. We can't iterate fast enough.”
From prompt to production in 4 steps
Connect, describe the caller and verdict, run in CI, monitor in production — one scenario.run() API.
Connect your agent
pip install langwatch-scenario and pick an adapter — or implement one call(). Connects over your agent's native transport.
OpenAIRealtimeAgentAdapterDescribe caller & verdict
A UserSimulatorAgent plays the caller; a JudgeAgent scores it against plain-language criteria. No scripts, no assertion spaghetti.
UserSimulatorAgent · JudgeAgentRun the simulation in CI
scenario.run() drives scripted or autonomous calls — interruptions, DTMF, audio effects — in Vitest, Jest or pytest. Cached and deterministic.
scenario.run()Watch it in production
Inspect runs in the visualizer, stream OTel traces and alerts, and turn any failing call into a new regression test.
OTel · Simulations visualizerWrite a real voice test in 15 lines
No new DSL, no dashboard clicking. One code-first API powers text and voice — drop it into your test runner and gate every commit in CI.
- Same scenario.run() for text & voiceOne API across modalities — scripted turns, autonomous simulation, or both.
- Runs in Vitest, Jest & pytestTests live in your repo, diffable and reviewable, gating every commit in CI.
- Open source · Python, TS & GoOSS framework with first-class SDKs, OTel-native under the hood.
- MCP server with 10 toolsDrive tracing, scenario testing and evals from Claude Code, Cursor & Codex.
1result = await scenario.run(2 name="voice_getting_started",3 agents=[4 scenario.OpenAIRealtimeAgentAdapter(5 voice="alloy",6 instructions="You are a helpful assistant.",7 role=AgentRole.AGENT,8 ),9 scenario.UserSimulatorAgent(voice="openai/nova"),10 scenario.JudgeAgent(11 criteria=[12 "The agent responded helpfully",13 "They exchanged real audio turns",14 ]15 ),16 ],17 script=[18 scenario.user("Hi, can you help me?"),19 scenario.agent(), scenario.judge(),20 ],21)Catch regressions, benchmark performance, iterate faster
Small prompt changes cause big quality swings. Catch them before customers do.
Turn failures into test scenarios
Replay real production calls as regression scenarios — every failure becomes a permanent test.
Real-world conditions
Layer background noise, codec degradation and interruptions onto the simulator's audio — plus barge-in and DTMF where supported.
Latency & quality, in the trace
Capture latency and turn-taking on every run, with full OTel traces and token-level cost.
Red-teaming & adversarial
RedTeamAgent runs structured attacks — jailbreaks, prompt injection, PII probing — and writes a prioritized findings report.
Works with your voice stack
Native adapters over the exact transport you ship on — no infrastructure changes.
CI/CD & REST API
Run on every deploy via the SDK or REST API — GitHub Actions, Jenkins, any pipeline. Block bad prompts before production.
We mimic your customers
Describe the caller's goals and the User Simulator generates natural, multi-turn conversations — then layer real-world conditions onto the audio.
- Natural, multi-turn behaviour from a goal — no scripted lines
- Background noise & codec degradation layered onto audio
- Interruptions and barge-in where the adapter supports it
- Swap in RedTeamAgent for adversarial personas
Cross-industry expertise in voice agent testing
For high-stakes domains where a mistake means churn, compliance issues, or lost revenue.
Appointment Scheduling
Stop testing timezone edge cases by hand.
- Manage multiple time zones
- Verify appointment details accurately
- Handle complex calendar conflicts
Customer Support
Test escalation flows without real customers.
- Escalate appropriately to a human
- Maintain compliance standards
- Handle emotional conversations
Healthcare Follow-ups
Test patient interactions safely.
- Communicate instructions accurately
- Prescription reminders & instructions
- Emergency escalation protocols
Drive-thru & Ordering
Simulate rush hour, accents and dietary needs.
- Dietary restrictions & allergies
- Background noise & accent variation
- High-volume rush-hour conditions
Outbound Sales
Qualify and convert, on script.
- Stay on-message and on-policy
- Handle objections gracefully
- Hand off qualified leads cleanly
Legal Intake
Capture the right details, every time.
- Collect intake fields accurately
- Respect confidentiality & disclaimers
- Route to the right practice area
Built for cross-functional teams
Tests as code for engineers, a no-code builder for everyone else.
Engineering
Write scenarios as code, run them in Vitest/Jest/pytest, and gate every deploy in CI.
QA
Build scenarios in the no-code builder, run them with engineering's suite.
Product
Define what 'good' sounds like and review runs in the visualizer.
Security & Compliance
Red-team for jailbreaks, PII disclosure and policy violations, with prioritized findings reports.
Issues caught before they reach customers
Simulation plus production monitoring catches issues before deployment and in live calls — from misinterpretations to routing and compliance risks.
AI misinterpretations
- Voice assistant hallucinated non-existent promotions
- Travel agent confused airport codes, suggesting wrong bookings
- Ordering agent misread an allergy declaration
System & usability failures
- A prompt update made the agent ignore the caller mid-conversation
- Call routing looped, redirecting users repeatedly
- Latency spikes caused premature call hang-ups
Language & voice issues
- Couldn't separate multiple voices in a group order
- Failed to recognize accents, alienating callers
- Non-English languages were silently ignored
Works with your voice stack
Test the agent you actually shipped — over the exact protocol it runs on. No re-platforming, no wrappers.
Realtime API
Google Live API
WebSocket bots
Hosted agents
Phone + DTMF
Real audio, native transport
Plug into the realtime model session your agent runs on, or reach it on a live phone number. Real audio, end to end.
Built on OpenTelemetry — works with everything
Python, TypeScript & Go SDKs, plus open telemetry from any framework or model provider you already use.
From first test to production monitoring
Simulation is the start. LangWatch closes the loop with live observability, alerts and ROI metrics.
Performance & ROI metrics
Escalation rate, first-call resolution, containment, NPS and latency — the numbers leadership cares about.
Production monitoring
Trace, debug and alert on live calls so regressions surface before customers ever notice.
Visualized conversations
Step through any call turn-by-turn in the Simulations visualizer and jump to the exact failure point.
Datasets & fix loop
Turn passing runs into datasets and failing runs into fixes, delivered into your Git workflow.
No-code scenario builder
Let PMs, QA and domain experts build and refine scenarios alongside engineers.
Open & self-hosted
Open-source core, OpenTelemetry-native, deployable locally or fully self-hosted.
“LangWatch reminded me of how we used to evaluate models in classic ML. I knew immediately it was what we needed to ship conversational AI with confidence.”
Ship voice agents with confidence, not crossed fingers
Get up and running with LangWatch in as little as 10 minutes.
pip install langwatch-scenario
