Quick setup? Copy the scenarios prompt into your coding agent to add simulation tests automatically.

What are Agent Simulations?

Agent simulations are a powerful approach to testing AI agents that goes beyond traditional evaluation methods. Unlike static input-output testing, simulations test your agent’s behavior in realistic, multi-turn conversations that mimic how real users would interact with your system. Powered by the open-source Scenario framework — the same scenario.run() entrypoint and judge work across text, voice, and adversarial testing.

What you can test

Conversational agents

Simulate real users across multi-turn conversations and judge behavior at any step — tool calls, recovery, and end-to-end outcomes.

Voice agents

Test over real audio through ElevenLabs, OpenAI Realtime, Twilio, Pipecat and more — with background noise, interruptions, and latency metrics.

Red teaming

Run multi-turn adversarial attacks (Crescendo escalation, refusal detection, backtracking) to find security gaps before attackers do.

The Three Levels of Agent Quality

For comprehensive agent testing, you need all three levels:

Level 1: Unit tests
Traditional unit and integration software tests to guarantee that e.g. the agent tools are working correctly from a software point of view
Level 2: Evals, Finetuning and Prompt Optimization
Measuring the performance of individual non-deterministic components of the agent, for example maximizing RAG accuracy with evals, or approximating human preference with GRPO
Level 3: Agent Simulations
End-to-end testing of the agent in different scenarios and edge cases, guaranteeing the whole agent achieves more than the sum of its parts, simulating a wide range of situations

Simulations complement evaluations by testing the agent as a whole system rather than isolated parts.

Why Traditional Evaluation Isn’t Enough for Agents

Most evaluations are based on dataset, with a static set of cases, those are hard to get specially when you are just getting started, they often require a great amount of examples to be valuable, and an expected answer to be provided, but more than anything, they are static, like input to output, or query to expected_contexts. Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Evaluation dataset (single input-output pairs):

query	expected_answer
What is your refund policy?	We offer a 30-day money-back guarantee on all purchases.
How do I cancel my subscription?	You can cancel your subscription by logging into your account and clicking the “Cancel Subscription” button.

❌ Doesn’t consider the conversational flow
❌ Can’t specify how middle steps should be evaluated
❌ Hard to interpret and debug
❌ Ignores user experience aspects
❌ Hard to come up with a good dataset

Agent simulation (full multi-turn descriptions):

script=[
  scenario.user("hey I have a problem with my order"),
  scenario.agent(),
  expect_ticket_created()
  expect_ticket_label("ecommerce")
  scenario.user("i want a refund!"),
  scenario.agent()
  expect_tool_call("search_policy")
  scenario.user("this is ridiculous! let me talk to a human being")
  scenario.agent()
  expect_tool_call("escalate_to_human")
]

✅ Describes the entire conversation
✅ Explicitly evaluates in-between steps
✅ Easy to interpret and debug
✅ Easy to replicate and reproduce an issue found in production
✅ Can run in autopilot for simulating a variety of inputs This doesn’t mean you should stop doing evaluations, in fact, having evaluations and simulations together is what composes your full agent test suite:

Use evaluations for testing the smaller parts that compose the agent, where a more “machine learning” approach is required, for optimizing a specific LLM call or retrieval for example.
Use simulation-based testing for proving the agent’s behavior is correct end-to-end, replicate specific edge cases, and guide your agent’s development without regressions.

Why Use LangWatch Scenario?

Scenario is the most advanced agent testing framework available. It provides:

Powerful simulations - Test real agent behavior by simulating users in different scenarios and edge cases
Flexible evaluations - Judge agent behavior at any point in conversations, combine with evals, test error recovery, and complex workflows
Voice testing - Test voice agents end-to-end over real audio (ElevenLabs, OpenAI Realtime, Twilio, Pipecat) — see Voice Agents
Red teaming - Run multi-turn adversarial attacks to uncover security vulnerabilities — see Red Teaming
Framework agnostic - Works with any AI agent framework
Simple integration - Just implement one call() method
Multi-language support - Python, TypeScript, and Go

Visualizing Simulations in LangWatch

Once you’ve set up your agent tests with Scenario, LangWatch provides powerful visualization tools to:

Organize simulations into sets and batches
Debug agent behavior by stepping through conversations
Track performance over time with run history
Collaborate with your team on agent improvements

See the LangWatch Simulations visualizer to learn how to inspect, replay, and compare your runs.

Next Steps

Getting Started - Set up your first simulation
Voice Agents - Test voice agents end-to-end over real audio
Red Teaming - Run adversarial attacks against your agent
Scenario Documentation - Deep dive into the testing framework

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

Help

Introduction to Agent Testing

What are Agent Simulations?

What you can test

Conversational agents

Voice agents

Red teaming

The Three Levels of Agent Quality

Why Traditional Evaluation Isn’t Enough for Agents

Evaluation dataset (single input-output pairs):

Agent simulation (full multi-turn descriptions):

Why Use LangWatch Scenario?

Visualizing Simulations in LangWatch

Next Steps

​What are Agent Simulations?

​What you can test

Conversational agents

Voice agents

Red teaming

​The Three Levels of Agent Quality

​Why Traditional Evaluation Isn’t Enough for Agents

​Evaluation dataset (single input-output pairs):

​Agent simulation (full multi-turn descriptions):

​Why Use LangWatch Scenario?

​Visualizing Simulations in LangWatch

​Next Steps

What are Agent Simulations?

What you can test

The Three Levels of Agent Quality

Why Traditional Evaluation Isn’t Enough for Agents

Evaluation dataset (single input-output pairs):

Agent simulation (full multi-turn descriptions):

Why Use LangWatch Scenario?

Visualizing Simulations in LangWatch

Next Steps