Skip to main content
Quick setup? Copy the scenarios prompt into your coding agent to add simulation tests automatically.

What are Agent Simulations?

Agent simulations are a powerful approach to testing AI agents that goes beyond traditional evaluation methods. Unlike static input-output testing, simulations test your agent’s behavior in realistic, multi-turn conversations that mimic how real users would interact with your system. Powered by the open-source Scenario framework — the same scenario.run() entrypoint and judge work across text, voice, and adversarial testing. Agent Simulations

What you can test

Conversational agents

Simulate real users across multi-turn conversations and judge behavior at any step — tool calls, recovery, and end-to-end outcomes.

Voice agents

Test over real audio through ElevenLabs, OpenAI Realtime, Twilio, Pipecat and more — with background noise, interruptions, and latency metrics.

Red teaming

Run multi-turn adversarial attacks (Crescendo escalation, refusal detection, backtracking) to find security gaps before attackers do.

The Three Levels of Agent Quality

For comprehensive agent testing, you need all three levels:
  • Level 1: Unit tests
    Traditional unit and integration software tests to guarantee that e.g. the agent tools are working correctly from a software point of view
  • Level 2: Evals, Finetuning and Prompt Optimization
    Measuring the performance of individual non-deterministic components of the agent, for example maximizing RAG accuracy with evals, or approximating human preference with GRPO
  • Level 3: Agent Simulations
    End-to-end testing of the agent in different scenarios and edge cases, guaranteeing the whole agent achieves more than the sum of its parts, simulating a wide range of situations
Simulations complement evaluations by testing the agent as a whole system rather than isolated parts.

Why Traditional Evaluation Isn’t Enough for Agents

Most evaluations are based on dataset, with a static set of cases, those are hard to get specially when you are just getting started, they often require a great amount of examples to be valuable, and an expected answer to be provided, but more than anything, they are static, like input to output, or query to expected_contexts. Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Evaluation dataset (single input-output pairs):

queryexpected_answer
What is your refund policy?We offer a 30-day money-back guarantee on all purchases.
How do I cancel my subscription?You can cancel your subscription by logging into your account and clicking the “Cancel Subscription” button.
❌ Doesn’t consider the conversational flow
❌ Can’t specify how middle steps should be evaluated
❌ Hard to interpret and debug
❌ Ignores user experience aspects
❌ Hard to come up with a good dataset

Agent simulation (full multi-turn descriptions):

script=[
  scenario.user("hey I have a problem with my order"),
  scenario.agent(),
  expect_ticket_created()
  expect_ticket_label("ecommerce")
  scenario.user("i want a refund!"),
  scenario.agent()
  expect_tool_call("search_policy")
  scenario.user("this is ridiculous! let me talk to a human being")
  scenario.agent()
  expect_tool_call("escalate_to_human")
]
✅ Describes the entire conversation
✅ Explicitly evaluates in-between steps
✅ Easy to interpret and debug
✅ Easy to replicate and reproduce an issue found in production
✅ Can run in autopilot for simulating a variety of inputs
This doesn’t mean you should stop doing evaluations, in fact, having evaluations and simulations together is what composes your full agent test suite:
  • Use evaluations for testing the smaller parts that compose the agent, where a more “machine learning” approach is required, for optimizing a specific LLM call or retrieval for example.
  • Use simulation-based testing for proving the agent’s behavior is correct end-to-end, replicate specific edge cases, and guide your agent’s development without regressions.

Why Use LangWatch Scenario?

Scenario is the most advanced agent testing framework available. It provides:
  • Powerful simulations - Test real agent behavior by simulating users in different scenarios and edge cases
  • Flexible evaluations - Judge agent behavior at any point in conversations, combine with evals, test error recovery, and complex workflows
  • Voice testing - Test voice agents end-to-end over real audio (ElevenLabs, OpenAI Realtime, Twilio, Pipecat) — see Voice Agents
  • Red teaming - Run multi-turn adversarial attacks to uncover security vulnerabilities — see Red Teaming
  • Framework agnostic - Works with any AI agent framework
  • Simple integration - Just implement one call() method
  • Multi-language support - Python, TypeScript, and Go

Visualizing Simulations in LangWatch

Once you’ve set up your agent tests with Scenario, LangWatch provides powerful visualization tools to:
  • Organize simulations into sets and batches
  • Debug agent behavior by stepping through conversations
  • Track performance over time with run history
  • Collaborate with your team on agent improvements
See the LangWatch Simulations visualizer to learn how to inspect, replay, and compare your runs. Simulations Sets

Next Steps