Simulation Based Eval Framework

Tahmid - AI research @LangWatch
Jun 6, 2025
Introduction: A simulation-based evaluation framework
In AI-powered customer support, building an agent is just the start. The real challenge is making sure it works: reliably, accurately, and at scale. Manual spot-checking doesn’t cut it. You need systematic, repeatable evaluation. In this post, I break down a simulation-based evaluation framework I’ve built to rigorously test AI support agents in end-to-end workflows.
The core idea: simulate real user-agent conversations, then evaluate the output against known ground truths or domain-specific heuristics. This gives you hard metrics instead of gut feels.
Why Simulation?
When deploying AI agents in customer support, you face at least four persistent challenges:
Consistency: The agent should answer the same question the same way, regardless of when or how it’s asked.
Accuracy: Responses must contain correct information pulled from trusted sources or tools.
Scalability: You need to evaluate hundreds or thousands of interactions—not just a handful.
Iteration: Improvements should be measurable. If your agent gets better, your metrics should reflect it.
Traditional testing methods, such as manual QA, ad hoc prompts, or subjective evaluation, fail to scale and often miss failure patterns. Simulation addresses this by treating the agent like a system under test.
The Architecture
The simulation framework is built around three core agents:
Support Agent: The AI agent under evaluation.
Simulator Agent: Emulates realistic customer behavior and conversation flows.
Evaluator Agent: Scores the agent’s performance using pre-defined expectations or heuristics.
Each agent plays a well-defined role in the loop.
1. Support Agent
The Support Agent is the actual system you’re validating. It receives user queries, calls tools (e.g. search docs, update tickets, fetch orders), and returns answers. It may be a ReAct-style agent or a monolithic LLM with internal function routing. You can wire it up however you want.
2. Simulator Agent
The Simulator Agent acts as the user. It’s designed to:
Ask a query (e.g. “Where is my order?”)
Respond naturally to follow-ups from the support agent
Adapt to agent responses and generate multi-turn conversations
To achieve this, the simulator inverts roles during the conversation history, allowing it to take in the agent’s responses as if they came from a human:
def _filter_context(self, conversation_context: ConversationContext): for msg in conversation_context.task_history: if isinstance(msg, dict) and 'role' in msg and msg['role'] in ["user", "assistant"]: flipped_role = "user" if msg['role'] == "assistant" else "assistant" filtered_context.append({ "role": flipped_role, "content": msg.get('content', '') })
This trick lets us maintain a natural dialogue flow while still controlling the simulation deterministically.
3. Evaluator Agent
The Evaluator compares the conversation transcript to a gold standard outcome. This could be:
A golden answer (e.g. retrieved from historical ticket resolutions)
A golden instruction (e.g. “The agent should verify the customer’s email before sharing tracking info”)
A rule-based or LLM-based heuristic (e.g. accuracy of tool usage, hallucination detection, tone checking)
The Experiment
For this experiment, I scraped the LangWatch FAQ, which contains 10 curated Q/A pairs. These served as golden references for evaluation. I configured the Support Agent with access to tools that could retrieve information from our documentation and website (excluding the FAQ page to avoid leakage).
The Evaluator Agent (powered by an LLM) was given the full conversation transcript between the Simulator and Support Agent, along with the golden answer. It was asked to judge whether the Support Agent correctly answered the user's question.
The result: the agent correctly answered 7 to 8 out of 10 questions, yielding an accuracy between 70–80%. This helped identify both coverage gaps and reasoning failures in the current tool-agent setup. With this setup, I can continue adding more tools, confidently building multi-agent structures to see if each agent incrementally performs better and build a better eval set than just FAQs.
Conclusion
With this setup, I can confidently iterate: adding new tools, adding more agents (multi-agent architecture), improving retrieval and evaluating whether each change leads to measurable gains. It also sets the foundation for expanding beyond FAQ-style benchmarks toward a more realistic eval set grounded in real support scenarios.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.Schedule a call with me via this link to learn more about Evaluating Agents
Introduction: A simulation-based evaluation framework
In AI-powered customer support, building an agent is just the start. The real challenge is making sure it works: reliably, accurately, and at scale. Manual spot-checking doesn’t cut it. You need systematic, repeatable evaluation. In this post, I break down a simulation-based evaluation framework I’ve built to rigorously test AI support agents in end-to-end workflows.
The core idea: simulate real user-agent conversations, then evaluate the output against known ground truths or domain-specific heuristics. This gives you hard metrics instead of gut feels.
Why Simulation?
When deploying AI agents in customer support, you face at least four persistent challenges:
Consistency: The agent should answer the same question the same way, regardless of when or how it’s asked.
Accuracy: Responses must contain correct information pulled from trusted sources or tools.
Scalability: You need to evaluate hundreds or thousands of interactions—not just a handful.
Iteration: Improvements should be measurable. If your agent gets better, your metrics should reflect it.
Traditional testing methods, such as manual QA, ad hoc prompts, or subjective evaluation, fail to scale and often miss failure patterns. Simulation addresses this by treating the agent like a system under test.
The Architecture
The simulation framework is built around three core agents:
Support Agent: The AI agent under evaluation.
Simulator Agent: Emulates realistic customer behavior and conversation flows.
Evaluator Agent: Scores the agent’s performance using pre-defined expectations or heuristics.
Each agent plays a well-defined role in the loop.
1. Support Agent
The Support Agent is the actual system you’re validating. It receives user queries, calls tools (e.g. search docs, update tickets, fetch orders), and returns answers. It may be a ReAct-style agent or a monolithic LLM with internal function routing. You can wire it up however you want.
2. Simulator Agent
The Simulator Agent acts as the user. It’s designed to:
Ask a query (e.g. “Where is my order?”)
Respond naturally to follow-ups from the support agent
Adapt to agent responses and generate multi-turn conversations
To achieve this, the simulator inverts roles during the conversation history, allowing it to take in the agent’s responses as if they came from a human:
def _filter_context(self, conversation_context: ConversationContext): for msg in conversation_context.task_history: if isinstance(msg, dict) and 'role' in msg and msg['role'] in ["user", "assistant"]: flipped_role = "user" if msg['role'] == "assistant" else "assistant" filtered_context.append({ "role": flipped_role, "content": msg.get('content', '') })
This trick lets us maintain a natural dialogue flow while still controlling the simulation deterministically.
3. Evaluator Agent
The Evaluator compares the conversation transcript to a gold standard outcome. This could be:
A golden answer (e.g. retrieved from historical ticket resolutions)
A golden instruction (e.g. “The agent should verify the customer’s email before sharing tracking info”)
A rule-based or LLM-based heuristic (e.g. accuracy of tool usage, hallucination detection, tone checking)
The Experiment
For this experiment, I scraped the LangWatch FAQ, which contains 10 curated Q/A pairs. These served as golden references for evaluation. I configured the Support Agent with access to tools that could retrieve information from our documentation and website (excluding the FAQ page to avoid leakage).
The Evaluator Agent (powered by an LLM) was given the full conversation transcript between the Simulator and Support Agent, along with the golden answer. It was asked to judge whether the Support Agent correctly answered the user's question.
The result: the agent correctly answered 7 to 8 out of 10 questions, yielding an accuracy between 70–80%. This helped identify both coverage gaps and reasoning failures in the current tool-agent setup. With this setup, I can continue adding more tools, confidently building multi-agent structures to see if each agent incrementally performs better and build a better eval set than just FAQs.
Conclusion
With this setup, I can confidently iterate: adding new tools, adding more agents (multi-agent architecture), improving retrieval and evaluating whether each change leads to measurable gains. It also sets the foundation for expanding beyond FAQ-style benchmarks toward a more realistic eval set grounded in real support scenarios.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.Schedule a call with me via this link to learn more about Evaluating Agents
Introduction: A simulation-based evaluation framework
In AI-powered customer support, building an agent is just the start. The real challenge is making sure it works: reliably, accurately, and at scale. Manual spot-checking doesn’t cut it. You need systematic, repeatable evaluation. In this post, I break down a simulation-based evaluation framework I’ve built to rigorously test AI support agents in end-to-end workflows.
The core idea: simulate real user-agent conversations, then evaluate the output against known ground truths or domain-specific heuristics. This gives you hard metrics instead of gut feels.
Why Simulation?
When deploying AI agents in customer support, you face at least four persistent challenges:
Consistency: The agent should answer the same question the same way, regardless of when or how it’s asked.
Accuracy: Responses must contain correct information pulled from trusted sources or tools.
Scalability: You need to evaluate hundreds or thousands of interactions—not just a handful.
Iteration: Improvements should be measurable. If your agent gets better, your metrics should reflect it.
Traditional testing methods, such as manual QA, ad hoc prompts, or subjective evaluation, fail to scale and often miss failure patterns. Simulation addresses this by treating the agent like a system under test.
The Architecture
The simulation framework is built around three core agents:
Support Agent: The AI agent under evaluation.
Simulator Agent: Emulates realistic customer behavior and conversation flows.
Evaluator Agent: Scores the agent’s performance using pre-defined expectations or heuristics.
Each agent plays a well-defined role in the loop.
1. Support Agent
The Support Agent is the actual system you’re validating. It receives user queries, calls tools (e.g. search docs, update tickets, fetch orders), and returns answers. It may be a ReAct-style agent or a monolithic LLM with internal function routing. You can wire it up however you want.
2. Simulator Agent
The Simulator Agent acts as the user. It’s designed to:
Ask a query (e.g. “Where is my order?”)
Respond naturally to follow-ups from the support agent
Adapt to agent responses and generate multi-turn conversations
To achieve this, the simulator inverts roles during the conversation history, allowing it to take in the agent’s responses as if they came from a human:
def _filter_context(self, conversation_context: ConversationContext): for msg in conversation_context.task_history: if isinstance(msg, dict) and 'role' in msg and msg['role'] in ["user", "assistant"]: flipped_role = "user" if msg['role'] == "assistant" else "assistant" filtered_context.append({ "role": flipped_role, "content": msg.get('content', '') })
This trick lets us maintain a natural dialogue flow while still controlling the simulation deterministically.
3. Evaluator Agent
The Evaluator compares the conversation transcript to a gold standard outcome. This could be:
A golden answer (e.g. retrieved from historical ticket resolutions)
A golden instruction (e.g. “The agent should verify the customer’s email before sharing tracking info”)
A rule-based or LLM-based heuristic (e.g. accuracy of tool usage, hallucination detection, tone checking)
The Experiment
For this experiment, I scraped the LangWatch FAQ, which contains 10 curated Q/A pairs. These served as golden references for evaluation. I configured the Support Agent with access to tools that could retrieve information from our documentation and website (excluding the FAQ page to avoid leakage).
The Evaluator Agent (powered by an LLM) was given the full conversation transcript between the Simulator and Support Agent, along with the golden answer. It was asked to judge whether the Support Agent correctly answered the user's question.
The result: the agent correctly answered 7 to 8 out of 10 questions, yielding an accuracy between 70–80%. This helped identify both coverage gaps and reasoning failures in the current tool-agent setup. With this setup, I can continue adding more tools, confidently building multi-agent structures to see if each agent incrementally performs better and build a better eval set than just FAQs.
Conclusion
With this setup, I can confidently iterate: adding new tools, adding more agents (multi-agent architecture), improving retrieval and evaluating whether each change leads to measurable gains. It also sets the foundation for expanding beyond FAQ-style benchmarks toward a more realistic eval set grounded in real support scenarios.
I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.Schedule a call with me via this link to learn more about Evaluating Agents
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About