Simulation-Based Testing for AI Agents

Traditional evaluation methods designed for static, single-turn LLMs cannot adequately test agents. Agents are stateful, dynamic systems that make decisions over time, recover from errors, and adapt to new information. Building robust, autonomous agents requires evaluation with the same rigor used to design their architecture.

Scenario simulations test agents and create a feedback loop for continuous improvement.

The Limitations of Traditional Evaluation

Most evaluations rely on static datasets with fixed test cases. These datasets are difficult to create, require numerous examples, and demand expected answers. More importantly, they are static—simple input-output pairs or query-to-context mappings.

Agents are not simple input-output functions. They are processes that execute sequences of operations, use tools, and maintain state.

Evaluation dataset (single input-output pairs):

query	expected_answer
What is your refund policy?	We offer a 30-day money-back guarantee on all purchases.
How do I cancel my subscription?	You can cancel your subscription by logging into your account and clicking the "Cancel Subscription" button.

❌ Doesn't consider the conversational flow
❌ Can't specify how middle steps should be evaluated
❌ Hard to interpret and debug
❌ Ignores user experience aspects
❌ Hard to come up with a good dataset

Agent simulation (full multi-turn descriptions):

script=[
  scenario.user("hey I have a problem with my order"),
  scenario.agent(),
  expect_ticket_created()
  expect_ticket_label("ecommerce")
  scenario.user("i want a refund!"),
  scenario.agent()
  expect_tool_call("search_policy")
  scenario.user("this is ridiculous! let me talk to a human being")
  scenario.agent()
  expect_tool_call("escalate_to_human")
]

✅ Describes the entire conversation
✅ Explicitly evaluates in-between steps
✅ Easy to interpret and debug
✅ Easy to replicate and reproduce an issue found in production
✅ Can run in autopilot for simulating a variety of inputs

Note: Evaluations remain valuable. Combine evaluations and simulations for comprehensive agent testing:

Use evaluations for testing the smaller parts that compose the agent, where a more "machine learning" approach is required, for optimizing a specific LLM call or retrieval for example.
Use simulation-based testing for proving the agent's behavior is correct end-to-end, replicate specific edge cases, and guide your agent's development without regressions.

There is much more to expand on the reasoning behind simulation-based testing, check out those articles:

The Limitations of Traditional Evaluation

Evaluation dataset (single input-output pairs):

Agent simulation (full multi-turn descriptions):

Read More