Simulations to Test My Agent

Tahmid, AI Researcher @ LangWatch
May 28, 2025
If you're building or buying an AI agent, you’ve probably asked: how do I evaluate it?
Traditional AI agent testing methods fall short when agents face complex, multi-turn scenarios. Here’s how I switched to using real-world agent simulations instead and why it changed everything.
At first, it’s tempting to look at accuracy scores or whether the agent followed the right tool sequence. That works fine for simple, one-off tasks. But it breaks down fast when you're dealing with multi-turn agents - ones that have to ask follow-ups, deal with missing info, and actually resolve problems.
The agent I’m building doesn’t follow scripts. It adapts. It plans. It figures things out. So it didn’t make sense to evaluate it using rigid rules.
Instead, I started simulating real conversations - and checking if the agent did what mattered.
Why the usual way of testing AI agents doesn’t work
Most companies still evaluate AI agents the way you’d evaluate a spreadsheet macro:
User says: “I want to cancel my order”
Agent is supposed to call:
get_user() → get_order() → cancel_order()
If it follows that sequence, it passes. If not, it fails.
This works only if there’s one correct path. But in the real world, there isn’t.
Sometimes the agent can skip a step. Sometimes it can recover from missing data. Sometimes it finds a shortcut.
Judging it by a fixed sequence punishes creativity and misses the point.
The shift: from scripts to simulations
So here’s what I do now.
I write small simulated scenarios - like “user wants to cancel an order but doesn’t remember the order ID”—and let the agent handle it however it wants.
Then I define what success looks like using simple yes/no checks.
Did the agent...
Look up the right order?
Tell the user whether it could be cancelled?
Communicate clearly?
That’s it.
I don’t care which tools it used or in what order. I care whether the problem got solved.
What changed when I did this
Two things happened immediately:
I stopped getting false negatives.
Agents that were doing the right thing weren’t failing because they took a different path.
I started getting useful signals.
It became obvious which scenarios the agent struggled with—things like ambiguity, missing context, or edge cases.
That feedback loop made it 10x easier to improve the system.
This isn’t just a better eval method. It’s a better way to think.
If you’ve got agents making decisions, you need evaluation to match reality. Not a checklist. Not a script. Reality.
Simulations + binary outcomes gave me that. It's not perfect—it takes a bit of thought to write good scenarios and clear success criteria. But it’s the only approach I’ve found that scales with complexity.
It answers the real question:
“Did the agent get the job done?”
Not “Did it follow the instructions?”
Whether you’re building agents or buying them, this kind of evaluation will become the new standard. I’ve open-sourced a simple notebook showing how it works: Try it here
❓ FAQ: Evaluating AI Agents with simulations
1. Why isn’t accuracy or tool usage enough to evaluate AI agents?
Traditional metrics like accuracy or exact tool sequences fail for multi-turn agents. These agents often solve problems in diverse, flexible ways—so rigid evaluations miss real-world success.
2. What’s a simulation-based agent evaluation?
It’s a method where you run agents through realistic scenarios (e.g., missing info, follow-ups) and check if they achieved the outcome, regardless of path. It’s scalable, adaptable, and human-centric.
3. How do I start testing my AI agents this way?
You can begin by designing a few realistic user scenarios, defining success with binary outcomes, and running them in a notebook. Try our open-source example here or book a call with us for guidance.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.
Schedule a call with me via this link to learn more about Evaluating Agents
If you're building or buying an AI agent, you’ve probably asked: how do I evaluate it?
Traditional AI agent testing methods fall short when agents face complex, multi-turn scenarios. Here’s how I switched to using real-world agent simulations instead and why it changed everything.
At first, it’s tempting to look at accuracy scores or whether the agent followed the right tool sequence. That works fine for simple, one-off tasks. But it breaks down fast when you're dealing with multi-turn agents - ones that have to ask follow-ups, deal with missing info, and actually resolve problems.
The agent I’m building doesn’t follow scripts. It adapts. It plans. It figures things out. So it didn’t make sense to evaluate it using rigid rules.
Instead, I started simulating real conversations - and checking if the agent did what mattered.
Why the usual way of testing AI agents doesn’t work
Most companies still evaluate AI agents the way you’d evaluate a spreadsheet macro:
User says: “I want to cancel my order”
Agent is supposed to call:
get_user() → get_order() → cancel_order()
If it follows that sequence, it passes. If not, it fails.
This works only if there’s one correct path. But in the real world, there isn’t.
Sometimes the agent can skip a step. Sometimes it can recover from missing data. Sometimes it finds a shortcut.
Judging it by a fixed sequence punishes creativity and misses the point.
The shift: from scripts to simulations
So here’s what I do now.
I write small simulated scenarios - like “user wants to cancel an order but doesn’t remember the order ID”—and let the agent handle it however it wants.
Then I define what success looks like using simple yes/no checks.
Did the agent...
Look up the right order?
Tell the user whether it could be cancelled?
Communicate clearly?
That’s it.
I don’t care which tools it used or in what order. I care whether the problem got solved.
What changed when I did this
Two things happened immediately:
I stopped getting false negatives.
Agents that were doing the right thing weren’t failing because they took a different path.
I started getting useful signals.
It became obvious which scenarios the agent struggled with—things like ambiguity, missing context, or edge cases.
That feedback loop made it 10x easier to improve the system.
This isn’t just a better eval method. It’s a better way to think.
If you’ve got agents making decisions, you need evaluation to match reality. Not a checklist. Not a script. Reality.
Simulations + binary outcomes gave me that. It's not perfect—it takes a bit of thought to write good scenarios and clear success criteria. But it’s the only approach I’ve found that scales with complexity.
It answers the real question:
“Did the agent get the job done?”
Not “Did it follow the instructions?”
Whether you’re building agents or buying them, this kind of evaluation will become the new standard. I’ve open-sourced a simple notebook showing how it works: Try it here
❓ FAQ: Evaluating AI Agents with simulations
1. Why isn’t accuracy or tool usage enough to evaluate AI agents?
Traditional metrics like accuracy or exact tool sequences fail for multi-turn agents. These agents often solve problems in diverse, flexible ways—so rigid evaluations miss real-world success.
2. What’s a simulation-based agent evaluation?
It’s a method where you run agents through realistic scenarios (e.g., missing info, follow-ups) and check if they achieved the outcome, regardless of path. It’s scalable, adaptable, and human-centric.
3. How do I start testing my AI agents this way?
You can begin by designing a few realistic user scenarios, defining success with binary outcomes, and running them in a notebook. Try our open-source example here or book a call with us for guidance.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.
Schedule a call with me via this link to learn more about Evaluating Agents
If you're building or buying an AI agent, you’ve probably asked: how do I evaluate it?
Traditional AI agent testing methods fall short when agents face complex, multi-turn scenarios. Here’s how I switched to using real-world agent simulations instead and why it changed everything.
At first, it’s tempting to look at accuracy scores or whether the agent followed the right tool sequence. That works fine for simple, one-off tasks. But it breaks down fast when you're dealing with multi-turn agents - ones that have to ask follow-ups, deal with missing info, and actually resolve problems.
The agent I’m building doesn’t follow scripts. It adapts. It plans. It figures things out. So it didn’t make sense to evaluate it using rigid rules.
Instead, I started simulating real conversations - and checking if the agent did what mattered.
Why the usual way of testing AI agents doesn’t work
Most companies still evaluate AI agents the way you’d evaluate a spreadsheet macro:
User says: “I want to cancel my order”
Agent is supposed to call:
get_user() → get_order() → cancel_order()
If it follows that sequence, it passes. If not, it fails.
This works only if there’s one correct path. But in the real world, there isn’t.
Sometimes the agent can skip a step. Sometimes it can recover from missing data. Sometimes it finds a shortcut.
Judging it by a fixed sequence punishes creativity and misses the point.
The shift: from scripts to simulations
So here’s what I do now.
I write small simulated scenarios - like “user wants to cancel an order but doesn’t remember the order ID”—and let the agent handle it however it wants.
Then I define what success looks like using simple yes/no checks.
Did the agent...
Look up the right order?
Tell the user whether it could be cancelled?
Communicate clearly?
That’s it.
I don’t care which tools it used or in what order. I care whether the problem got solved.
What changed when I did this
Two things happened immediately:
I stopped getting false negatives.
Agents that were doing the right thing weren’t failing because they took a different path.
I started getting useful signals.
It became obvious which scenarios the agent struggled with—things like ambiguity, missing context, or edge cases.
That feedback loop made it 10x easier to improve the system.
This isn’t just a better eval method. It’s a better way to think.
If you’ve got agents making decisions, you need evaluation to match reality. Not a checklist. Not a script. Reality.
Simulations + binary outcomes gave me that. It's not perfect—it takes a bit of thought to write good scenarios and clear success criteria. But it’s the only approach I’ve found that scales with complexity.
It answers the real question:
“Did the agent get the job done?”
Not “Did it follow the instructions?”
Whether you’re building agents or buying them, this kind of evaluation will become the new standard. I’ve open-sourced a simple notebook showing how it works: Try it here
❓ FAQ: Evaluating AI Agents with simulations
1. Why isn’t accuracy or tool usage enough to evaluate AI agents?
Traditional metrics like accuracy or exact tool sequences fail for multi-turn agents. These agents often solve problems in diverse, flexible ways—so rigid evaluations miss real-world success.
2. What’s a simulation-based agent evaluation?
It’s a method where you run agents through realistic scenarios (e.g., missing info, follow-ups) and check if they achieved the outcome, regardless of path. It’s scalable, adaptable, and human-centric.
3. How do I start testing my AI agents this way?
You can begin by designing a few realistic user scenarios, defining success with binary outcomes, and running them in a notebook. Try our open-source example here or book a call with us for guidance.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.
Schedule a call with me via this link to learn more about Evaluating Agents
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Documentation
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Documentation
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Documentation
Features