Simulations from First Principles (How to test your agents)

Tahmid AI researcher @ LangWatch
Jun 12, 2025
We’re getting better at building AI agents. With access to tools and long-term memory, these systems can now reason, plan, and execute complex, multi-step tasks. But as we move from building proof-of-concepts to production systems, we face a critical engineering challenge: how do we know if they actually work?
More importantly, how do we know they’ll work reliably?
Traditional evaluation methods, designed for static, single-turn LLMs, are not enough. Agents are stateful, dynamic systems. They make decisions over time, recover from errors, and adapt to new information. To build robust, autonomous agents, we need to evaluate them with the same rigor we use to design their architecture.
This post introduces a framework for evaluating agents from first principles. We'll cover:
Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.
The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement
Why Agent Evaluation needs a new approach
Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.
Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.
Consider a customer support agent. A typical workflow might involve:
Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.
Success isn't measured by the final text message alone. It’s measured by the correctness of the entire process. A brittle agent might produce a correct answer by chance in a controlled test, only to fail spectacularly when a user changes their mind or provides a faulty order ID.
To build reliable agents, we must evaluate the process, not just the final output.
From Static Checks to Dynamic Simulations
Traditional tests are static. They are the unit tests of the AI world: one fixed input, one expected output. They are fast, scalable, and excellent for catching basic functional errors in an agent’s individual tools or actions.
But agents live in dynamic environments. This is where simulations come in. Simulations are the integration tests and end-to-end tests for agentic systems. They create controlled, interactive environments where we can observe an agent’s behavior over time.
One of the most powerful simulation techniques is agent-to-agent simulation. Here, one agent—a simulator_agent
—plays the role of a dynamic user, while an evaluator_agent
observes and scores the interaction.
Eval Type | Description | Strengths | Weaknesses |
---|---|---|---|
Static Tests | Fixed input-output checks (e.g., unit tests for agent actions) | Fast, scalable, good for catching basic errors | Misses long-term reasoning, tool use, or adaptability |
Simulations | Dynamic setups where agents act, including agent-to-agent testing | Tests decision-making, tool use, and real-world behavior | Harder to design and score automatically |
Simulations allow us to ask more interesting questions. What happens if the simulator_agent
escalates its tone? What if it provides incomplete information? We can observe how the agent_under_test
handles ambiguity, retries failed actions, and navigates complex conversational flows.
Simulation as a Formalisation of Agent Behavior
To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.
At each timestep $t$, the agent observes an input $o_t$, chooses an action $a_t$ from the action space $A$and receives an updated observation $o_{t+1}$. The simulation continues until a terminal action is reached. The core components are:
A policy $\pi(o_t)$ → $a_t$ (typically an LLM)
An environment $E$ that handles actions and returns observations
A trace $T$ = $[(o_1, a_1), ..., (o_n, a_n)]$ representing the full interaction
The trace is the most critical artifact for evaluation. It captures the agent’s reasoning path: how it decomposes a problem, uses its tools, and course-corrects. As the original ReAct paper showed, the trajectory often matters more than the final answer. Evals must capture this trace, not just the endpoint.
A Framework for Systematic Evaluation
A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.
Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a
simulator_agent
that generates dynamic and adversarial user behavior.Simulation: Execute the scenario. The
agent_under_test
interacts with the environment (and potentially thesimulator_agent
) to complete the task.Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an
evaluator_agent
that assesses the quality of the interaction based on a rubric.Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.
For example, the simulator_agent
mimics a customer: "I want a refund." Then, it throws a curveball: "Oops, that was the wrong order ID. Try this one instead." The agent_under_test
must adapt. The evaluator_agent
scores the trace on whether it used the correct APIs, handled the user’s correction gracefully, and adhered to business logic.
Metrics That Matter for Agents
Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent
.
Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?
By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.
The Future is Simulation-Driven Development
Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.
Agent-to-agent simulations create a dynamic curriculum for training. The simulator_agent
can generate a diverse range of scenarios, and the evaluator_agent
can provide real-time feedback, enabling a tight loop of interactive refinement. This is where evaluation closes the loop with training, moving us toward true simulation-driven development.
Implementing this requires robust tooling. Platforms like LangWatch are designed specifically for this challenge. LangWatch helps you capture and visualize the complex traces from agentic systems, run agent-to-agent simulations, and integrate automated evaluator_agents
directly into your CI/CD pipeline. It provides the observability and evaluation infrastructure needed to turn the framework described here into a practical reality.
Book a call with us via this link
We’re getting better at building AI agents. With access to tools and long-term memory, these systems can now reason, plan, and execute complex, multi-step tasks. But as we move from building proof-of-concepts to production systems, we face a critical engineering challenge: how do we know if they actually work?
More importantly, how do we know they’ll work reliably?
Traditional evaluation methods, designed for static, single-turn LLMs, are not enough. Agents are stateful, dynamic systems. They make decisions over time, recover from errors, and adapt to new information. To build robust, autonomous agents, we need to evaluate them with the same rigor we use to design their architecture.
This post introduces a framework for evaluating agents from first principles. We'll cover:
Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.
The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement
Why Agent Evaluation needs a new approach
Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.
Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.
Consider a customer support agent. A typical workflow might involve:
Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.
Success isn't measured by the final text message alone. It’s measured by the correctness of the entire process. A brittle agent might produce a correct answer by chance in a controlled test, only to fail spectacularly when a user changes their mind or provides a faulty order ID.
To build reliable agents, we must evaluate the process, not just the final output.
From Static Checks to Dynamic Simulations
Traditional tests are static. They are the unit tests of the AI world: one fixed input, one expected output. They are fast, scalable, and excellent for catching basic functional errors in an agent’s individual tools or actions.
But agents live in dynamic environments. This is where simulations come in. Simulations are the integration tests and end-to-end tests for agentic systems. They create controlled, interactive environments where we can observe an agent’s behavior over time.
One of the most powerful simulation techniques is agent-to-agent simulation. Here, one agent—a simulator_agent
—plays the role of a dynamic user, while an evaluator_agent
observes and scores the interaction.
Eval Type | Description | Strengths | Weaknesses |
---|---|---|---|
Static Tests | Fixed input-output checks (e.g., unit tests for agent actions) | Fast, scalable, good for catching basic errors | Misses long-term reasoning, tool use, or adaptability |
Simulations | Dynamic setups where agents act, including agent-to-agent testing | Tests decision-making, tool use, and real-world behavior | Harder to design and score automatically |
Simulations allow us to ask more interesting questions. What happens if the simulator_agent
escalates its tone? What if it provides incomplete information? We can observe how the agent_under_test
handles ambiguity, retries failed actions, and navigates complex conversational flows.
Simulation as a Formalisation of Agent Behavior
To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.
At each timestep $t$, the agent observes an input $o_t$, chooses an action $a_t$ from the action space $A$and receives an updated observation $o_{t+1}$. The simulation continues until a terminal action is reached. The core components are:
A policy $\pi(o_t)$ → $a_t$ (typically an LLM)
An environment $E$ that handles actions and returns observations
A trace $T$ = $[(o_1, a_1), ..., (o_n, a_n)]$ representing the full interaction
The trace is the most critical artifact for evaluation. It captures the agent’s reasoning path: how it decomposes a problem, uses its tools, and course-corrects. As the original ReAct paper showed, the trajectory often matters more than the final answer. Evals must capture this trace, not just the endpoint.
A Framework for Systematic Evaluation
A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.
Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a
simulator_agent
that generates dynamic and adversarial user behavior.Simulation: Execute the scenario. The
agent_under_test
interacts with the environment (and potentially thesimulator_agent
) to complete the task.Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an
evaluator_agent
that assesses the quality of the interaction based on a rubric.Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.
For example, the simulator_agent
mimics a customer: "I want a refund." Then, it throws a curveball: "Oops, that was the wrong order ID. Try this one instead." The agent_under_test
must adapt. The evaluator_agent
scores the trace on whether it used the correct APIs, handled the user’s correction gracefully, and adhered to business logic.
Metrics That Matter for Agents
Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent
.
Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?
By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.
The Future is Simulation-Driven Development
Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.
Agent-to-agent simulations create a dynamic curriculum for training. The simulator_agent
can generate a diverse range of scenarios, and the evaluator_agent
can provide real-time feedback, enabling a tight loop of interactive refinement. This is where evaluation closes the loop with training, moving us toward true simulation-driven development.
Implementing this requires robust tooling. Platforms like LangWatch are designed specifically for this challenge. LangWatch helps you capture and visualize the complex traces from agentic systems, run agent-to-agent simulations, and integrate automated evaluator_agents
directly into your CI/CD pipeline. It provides the observability and evaluation infrastructure needed to turn the framework described here into a practical reality.
Book a call with us via this link
We’re getting better at building AI agents. With access to tools and long-term memory, these systems can now reason, plan, and execute complex, multi-step tasks. But as we move from building proof-of-concepts to production systems, we face a critical engineering challenge: how do we know if they actually work?
More importantly, how do we know they’ll work reliably?
Traditional evaluation methods, designed for static, single-turn LLMs, are not enough. Agents are stateful, dynamic systems. They make decisions over time, recover from errors, and adapt to new information. To build robust, autonomous agents, we need to evaluate them with the same rigor we use to design their architecture.
This post introduces a framework for evaluating agents from first principles. We'll cover:
Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.
The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement
Why Agent Evaluation needs a new approach
Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.
Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.
Consider a customer support agent. A typical workflow might involve:
Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.
Success isn't measured by the final text message alone. It’s measured by the correctness of the entire process. A brittle agent might produce a correct answer by chance in a controlled test, only to fail spectacularly when a user changes their mind or provides a faulty order ID.
To build reliable agents, we must evaluate the process, not just the final output.
From Static Checks to Dynamic Simulations
Traditional tests are static. They are the unit tests of the AI world: one fixed input, one expected output. They are fast, scalable, and excellent for catching basic functional errors in an agent’s individual tools or actions.
But agents live in dynamic environments. This is where simulations come in. Simulations are the integration tests and end-to-end tests for agentic systems. They create controlled, interactive environments where we can observe an agent’s behavior over time.
One of the most powerful simulation techniques is agent-to-agent simulation. Here, one agent—a simulator_agent
—plays the role of a dynamic user, while an evaluator_agent
observes and scores the interaction.
Eval Type | Description | Strengths | Weaknesses |
---|---|---|---|
Static Tests | Fixed input-output checks (e.g., unit tests for agent actions) | Fast, scalable, good for catching basic errors | Misses long-term reasoning, tool use, or adaptability |
Simulations | Dynamic setups where agents act, including agent-to-agent testing | Tests decision-making, tool use, and real-world behavior | Harder to design and score automatically |
Simulations allow us to ask more interesting questions. What happens if the simulator_agent
escalates its tone? What if it provides incomplete information? We can observe how the agent_under_test
handles ambiguity, retries failed actions, and navigates complex conversational flows.
Simulation as a Formalisation of Agent Behavior
To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.
At each timestep $t$, the agent observes an input $o_t$, chooses an action $a_t$ from the action space $A$and receives an updated observation $o_{t+1}$. The simulation continues until a terminal action is reached. The core components are:
A policy $\pi(o_t)$ → $a_t$ (typically an LLM)
An environment $E$ that handles actions and returns observations
A trace $T$ = $[(o_1, a_1), ..., (o_n, a_n)]$ representing the full interaction
The trace is the most critical artifact for evaluation. It captures the agent’s reasoning path: how it decomposes a problem, uses its tools, and course-corrects. As the original ReAct paper showed, the trajectory often matters more than the final answer. Evals must capture this trace, not just the endpoint.
A Framework for Systematic Evaluation
A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.
Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a
simulator_agent
that generates dynamic and adversarial user behavior.Simulation: Execute the scenario. The
agent_under_test
interacts with the environment (and potentially thesimulator_agent
) to complete the task.Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an
evaluator_agent
that assesses the quality of the interaction based on a rubric.Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.
For example, the simulator_agent
mimics a customer: "I want a refund." Then, it throws a curveball: "Oops, that was the wrong order ID. Try this one instead." The agent_under_test
must adapt. The evaluator_agent
scores the trace on whether it used the correct APIs, handled the user’s correction gracefully, and adhered to business logic.
Metrics That Matter for Agents
Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent
.
Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?
By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.
The Future is Simulation-Driven Development
Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.
Agent-to-agent simulations create a dynamic curriculum for training. The simulator_agent
can generate a diverse range of scenarios, and the evaluator_agent
can provide real-time feedback, enabling a tight loop of interactive refinement. This is where evaluation closes the loop with training, moving us toward true simulation-driven development.
Implementing this requires robust tooling. Platforms like LangWatch are designed specifically for this challenge. LangWatch helps you capture and visualize the complex traces from agentic systems, run agent-to-agent simulations, and integrate automated evaluator_agents
directly into your CI/CD pipeline. It provides the observability and evaluation infrastructure needed to turn the framework described here into a practical reality.
Book a call with us via this link
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About