Intro to Scenario (Testing AI agents)

Tahmid AI researcher @ LangWatch
Jun 13, 2025
You’ve built a new conversational agent. It’s powered by the latest LLM, it can call APIs, and it nails the examples you’ve written for it. You ship the demo and everyone is impressed.
But if you’ve ever tried to move an agent from a cool demo to a reliable production system, you know the real challenge isn’t building the prototype; it's making it robust. The real world isn’t a clean, linear conversation. Users change their minds, ask tangentially related questions, and trigger complex edge cases you never thought of.
How do you test for that?
Traditionally, we’ve relied on static, manually created benchmarks. Think of datasets like τ-bench
, which have been invaluable for the community. But they have limitations. They are often small, i.e. a few hundred examples at most, and creating them is a slow, expensive, manual process. More importantly, they often give you a single, coarse-grained score: success or failure. They tell you if your agent failed, but not why.
This is a massive bottleneck. We can build agents faster than ever, but we can't evaluate them at the scale and granularity needed for production. What if we could automatically generate thousands of diverse, realistic test cases? And what if we could get a fine-grained diagnosis of exactly where our agent struggles?
At LangWatch, we’re building a framework that makes this possible. Our approach turns agent testing from a manual bottleneck into a scalable, diagnostic system.
The Problem with "Pass/Fail" Evaluation
When an agent fails, the reason is rarely simple. Did it fail to understand the user's intent? Did it call the wrong tool? Did it violate a critical business policy, like failing to get user consent before modifying a reservation?
Standard benchmarks can’t easily answer these questions. They check the final state: was the flight booked correctly? But they miss the messy details of the conversation flow and policy adherence. This leaves developers guessing. You know your agent's accuracy is 65%, but you have no idea where to start improving it.
Using AI Agents to Test AI Agents
At LangWatch, we’ve built a multi-agent evaluation loop that simulates real conversations, tracks everything your agent does, and critiques its behavior against your business rules.
The pipeline has three core stages:
1. Policy-Aware Scenario Generation
Instead of asking users to write hundreds of test cases, we let them define the business policies their agent should follow — things like “don’t issue refunds without checking order status” or “verify identity before giving account details.”
From this list, we generate realistic multi-step scenarios that test how your agent handles these policies, including the tricky edge cases where policies overlap or conflict.
Each scenario comes with:
A goal (e.g., “cancel my flight and rebook with points”)
An initial world state (e.g., flight status, user profile, loyalty points)
An event trace describing which policies should be triggered
This lets us evaluate how the agent performs in a rich, grounded context.
2. Simulation and Critique
For each scenario, we run a full conversation simulation:
A User Agent plays the role of the customer, trying to achieve the scenario goal
Your Test Agent (the chatbot) interacts using your tools and logic
A Critique Agent reviews the full trace and flags where your agent followed or violated specific policies
Unlike simple benchmarks, we don’t just care about the final result. We track every step of the conversation: Did the agent ask for consent before making changes? Did it follow escalation protocols? Did it call the right tool, in the right order?
3. Evaluation Metrics
You can now easily measure the things you really care about, i.e:
Which policies were followed
Which were violated
Where in the conversation things went wrong
How performance varies with scenario complexity
This gives your team a clear roadmap for improvement.
Why This Matters
1. You don’t have to write 1,000 tests
By defining policies instead of individual scenarios, you can test more and faster. We generate diverse, realistic evaluations automatically — tailored to your app’s logic.
2. You get more than a success rate
LangWatch gives you insight into why agents fail, not just that they failed. You’ll see if it’s an intent mismatch, a bad tool call, or a policy violation. That means you can fix the root cause, not just the symptom.
3. You can pick the right model for the job
Not all LLMs fail the same way. Some degrade faster with complex tasks. LangWatch surfaces this by showing how performance shifts as complexity increases, helping you pick the right model for your use case.
From Cool Demo to Production-Ready Agents
LLMs make it easy to build agent demos. But building reliable, trustworthy, and policy-aware agents takes more than a good prompt.
LangWatch helps you close that gap, by giving you the testing infrastructure, simulation tools, and diagnostic feedback needed to ship confidently.
We’re building the agent testing layer for the next generation of AI products. If you’re building agents and want to make them production-ready, we’d love to talk.
Book a call with us via this link
Find our scenario repository here:
You’ve built a new conversational agent. It’s powered by the latest LLM, it can call APIs, and it nails the examples you’ve written for it. You ship the demo and everyone is impressed.
But if you’ve ever tried to move an agent from a cool demo to a reliable production system, you know the real challenge isn’t building the prototype; it's making it robust. The real world isn’t a clean, linear conversation. Users change their minds, ask tangentially related questions, and trigger complex edge cases you never thought of.
How do you test for that?
Traditionally, we’ve relied on static, manually created benchmarks. Think of datasets like τ-bench
, which have been invaluable for the community. But they have limitations. They are often small, i.e. a few hundred examples at most, and creating them is a slow, expensive, manual process. More importantly, they often give you a single, coarse-grained score: success or failure. They tell you if your agent failed, but not why.
This is a massive bottleneck. We can build agents faster than ever, but we can't evaluate them at the scale and granularity needed for production. What if we could automatically generate thousands of diverse, realistic test cases? And what if we could get a fine-grained diagnosis of exactly where our agent struggles?
At LangWatch, we’re building a framework that makes this possible. Our approach turns agent testing from a manual bottleneck into a scalable, diagnostic system.
The Problem with "Pass/Fail" Evaluation
When an agent fails, the reason is rarely simple. Did it fail to understand the user's intent? Did it call the wrong tool? Did it violate a critical business policy, like failing to get user consent before modifying a reservation?
Standard benchmarks can’t easily answer these questions. They check the final state: was the flight booked correctly? But they miss the messy details of the conversation flow and policy adherence. This leaves developers guessing. You know your agent's accuracy is 65%, but you have no idea where to start improving it.
Using AI Agents to Test AI Agents
At LangWatch, we’ve built a multi-agent evaluation loop that simulates real conversations, tracks everything your agent does, and critiques its behavior against your business rules.
The pipeline has three core stages:
1. Policy-Aware Scenario Generation
Instead of asking users to write hundreds of test cases, we let them define the business policies their agent should follow — things like “don’t issue refunds without checking order status” or “verify identity before giving account details.”
From this list, we generate realistic multi-step scenarios that test how your agent handles these policies, including the tricky edge cases where policies overlap or conflict.
Each scenario comes with:
A goal (e.g., “cancel my flight and rebook with points”)
An initial world state (e.g., flight status, user profile, loyalty points)
An event trace describing which policies should be triggered
This lets us evaluate how the agent performs in a rich, grounded context.
2. Simulation and Critique
For each scenario, we run a full conversation simulation:
A User Agent plays the role of the customer, trying to achieve the scenario goal
Your Test Agent (the chatbot) interacts using your tools and logic
A Critique Agent reviews the full trace and flags where your agent followed or violated specific policies
Unlike simple benchmarks, we don’t just care about the final result. We track every step of the conversation: Did the agent ask for consent before making changes? Did it follow escalation protocols? Did it call the right tool, in the right order?
3. Evaluation Metrics
You can now easily measure the things you really care about, i.e:
Which policies were followed
Which were violated
Where in the conversation things went wrong
How performance varies with scenario complexity
This gives your team a clear roadmap for improvement.
Why This Matters
1. You don’t have to write 1,000 tests
By defining policies instead of individual scenarios, you can test more and faster. We generate diverse, realistic evaluations automatically — tailored to your app’s logic.
2. You get more than a success rate
LangWatch gives you insight into why agents fail, not just that they failed. You’ll see if it’s an intent mismatch, a bad tool call, or a policy violation. That means you can fix the root cause, not just the symptom.
3. You can pick the right model for the job
Not all LLMs fail the same way. Some degrade faster with complex tasks. LangWatch surfaces this by showing how performance shifts as complexity increases, helping you pick the right model for your use case.
From Cool Demo to Production-Ready Agents
LLMs make it easy to build agent demos. But building reliable, trustworthy, and policy-aware agents takes more than a good prompt.
LangWatch helps you close that gap, by giving you the testing infrastructure, simulation tools, and diagnostic feedback needed to ship confidently.
We’re building the agent testing layer for the next generation of AI products. If you’re building agents and want to make them production-ready, we’d love to talk.
Book a call with us via this link
Find our scenario repository here:
You’ve built a new conversational agent. It’s powered by the latest LLM, it can call APIs, and it nails the examples you’ve written for it. You ship the demo and everyone is impressed.
But if you’ve ever tried to move an agent from a cool demo to a reliable production system, you know the real challenge isn’t building the prototype; it's making it robust. The real world isn’t a clean, linear conversation. Users change their minds, ask tangentially related questions, and trigger complex edge cases you never thought of.
How do you test for that?
Traditionally, we’ve relied on static, manually created benchmarks. Think of datasets like τ-bench
, which have been invaluable for the community. But they have limitations. They are often small, i.e. a few hundred examples at most, and creating them is a slow, expensive, manual process. More importantly, they often give you a single, coarse-grained score: success or failure. They tell you if your agent failed, but not why.
This is a massive bottleneck. We can build agents faster than ever, but we can't evaluate them at the scale and granularity needed for production. What if we could automatically generate thousands of diverse, realistic test cases? And what if we could get a fine-grained diagnosis of exactly where our agent struggles?
At LangWatch, we’re building a framework that makes this possible. Our approach turns agent testing from a manual bottleneck into a scalable, diagnostic system.
The Problem with "Pass/Fail" Evaluation
When an agent fails, the reason is rarely simple. Did it fail to understand the user's intent? Did it call the wrong tool? Did it violate a critical business policy, like failing to get user consent before modifying a reservation?
Standard benchmarks can’t easily answer these questions. They check the final state: was the flight booked correctly? But they miss the messy details of the conversation flow and policy adherence. This leaves developers guessing. You know your agent's accuracy is 65%, but you have no idea where to start improving it.
Using AI Agents to Test AI Agents
At LangWatch, we’ve built a multi-agent evaluation loop that simulates real conversations, tracks everything your agent does, and critiques its behavior against your business rules.
The pipeline has three core stages:
1. Policy-Aware Scenario Generation
Instead of asking users to write hundreds of test cases, we let them define the business policies their agent should follow — things like “don’t issue refunds without checking order status” or “verify identity before giving account details.”
From this list, we generate realistic multi-step scenarios that test how your agent handles these policies, including the tricky edge cases where policies overlap or conflict.
Each scenario comes with:
A goal (e.g., “cancel my flight and rebook with points”)
An initial world state (e.g., flight status, user profile, loyalty points)
An event trace describing which policies should be triggered
This lets us evaluate how the agent performs in a rich, grounded context.
2. Simulation and Critique
For each scenario, we run a full conversation simulation:
A User Agent plays the role of the customer, trying to achieve the scenario goal
Your Test Agent (the chatbot) interacts using your tools and logic
A Critique Agent reviews the full trace and flags where your agent followed or violated specific policies
Unlike simple benchmarks, we don’t just care about the final result. We track every step of the conversation: Did the agent ask for consent before making changes? Did it follow escalation protocols? Did it call the right tool, in the right order?
3. Evaluation Metrics
You can now easily measure the things you really care about, i.e:
Which policies were followed
Which were violated
Where in the conversation things went wrong
How performance varies with scenario complexity
This gives your team a clear roadmap for improvement.
Why This Matters
1. You don’t have to write 1,000 tests
By defining policies instead of individual scenarios, you can test more and faster. We generate diverse, realistic evaluations automatically — tailored to your app’s logic.
2. You get more than a success rate
LangWatch gives you insight into why agents fail, not just that they failed. You’ll see if it’s an intent mismatch, a bad tool call, or a policy violation. That means you can fix the root cause, not just the symptom.
3. You can pick the right model for the job
Not all LLMs fail the same way. Some degrade faster with complex tasks. LangWatch surfaces this by showing how performance shifts as complexity increases, helping you pick the right model for your use case.
From Cool Demo to Production-Ready Agents
LLMs make it easy to build agent demos. But building reliable, trustworthy, and policy-aware agents takes more than a good prompt.
LangWatch helps you close that gap, by giving you the testing infrastructure, simulation tools, and diagnostic feedback needed to ship confidently.
We’re building the agent testing layer for the next generation of AI products. If you’re building agents and want to make them production-ready, we’d love to talk.
Book a call with us via this link
Find our scenario repository here:
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About