Simulations from First Principles (How to test your agents)

Tahmid AI researcher @ LangWatch

Jun 12, 2025

We’re getting better at building AI agents. With access to tools and long-term memory, these systems can now reason, plan, and execute complex, multi-step tasks. But as we move from building proof-of-concepts to production systems, we face a critical engineering challenge: how do we know if they actually work?

More importantly, how do we know they’ll work reliably?

Traditional evaluation methods, designed for static, single-turn LLMs, are not enough. Agents are stateful, dynamic systems. They make decisions over time, recover from errors, and adapt to new information. To build robust, autonomous agents, we need to evaluate them with the same rigor we use to design their architecture.

This post introduces a framework for evaluating agents from first principles. We'll cover:

Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.

The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement

Why Agent Evaluation needs a new approach

Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.

Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Consider a customer support agent. A typical workflow might involve:

Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.

Success isn't measured by the final text message alone. It’s measured by the correctness of the entire process. A brittle agent might produce a correct answer by chance in a controlled test, only to fail spectacularly when a user changes their mind or provides a faulty order ID.

To build reliable agents, we must evaluate the process, not just the final output.

From Static Checks to Dynamic Simulations

Traditional tests are static. They are the unit tests of the AI world: one fixed input, one expected output. They are fast, scalable, and excellent for catching basic functional errors in an agent’s individual tools or actions.

But agents live in dynamic environments. This is where simulations come in. Simulations are the integration tests and end-to-end tests for agentic systems. They create controlled, interactive environments where we can observe an agent’s behavior over time.

One of the most powerful simulation techniques is agent-to-agent simulation. Here, one agent—a simulator_agent—plays the role of a dynamic user, while an evaluator_agent observes and scores the interaction.

Eval Type	Description	Strengths	Weaknesses
Static Tests	Fixed input-output checks (e.g., unit tests for agent actions)	Fast, scalable, good for catching basic errors	Misses long-term reasoning, tool use, or adaptability
Simulations	Dynamic setups where agents act, including agent-to-agent testing	Tests decision-making, tool use, and real-world behavior	Harder to design and score automatically

Simulations allow us to ask more interesting questions. What happens if the simulator_agent escalates its tone? What if it provides incomplete information? We can observe how the agent_under_test handles ambiguity, retries failed actions, and navigates complex conversational flows.

Simulation as a Formalisation of Agent Behavior

To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.

The trace is the most critical artifact for evaluation. It captures the agent’s reasoning path: how it decomposes a problem, uses its tools, and course-corrects. As the original ReAct paper showed, the trajectory often matters more than the final answer. Evals must capture this trace, not just the endpoint.

A Framework for Systematic Evaluation

A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.

Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a simulator_agent that generates dynamic and adversarial user behavior.
Simulation: Execute the scenario. The agent_under_test interacts with the environment (and potentially the simulator_agent) to complete the task.
Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an evaluator_agent that assesses the quality of the interaction based on a rubric.
Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.

For example, the simulator_agent mimics a customer: "I want a refund." Then, it throws a curveball: "Oops, that was the wrong order ID. Try this one instead." The agent_under_test must adapt. The evaluator_agent scores the trace on whether it used the correct APIs, handled the user’s correction gracefully, and adhered to business logic.

Metrics That Matter for Agents

Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent.

Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?

By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.

The Future is Simulation-Driven Development

Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.

Agent-to-agent simulations create a dynamic curriculum for training. The simulator_agent can generate a diverse range of scenarios, and the evaluator_agent can provide real-time feedback, enabling a tight loop of interactive refinement. This is where evaluation closes the loop with training, moving us toward true simulation-driven development.

Implementing this requires robust tooling. Platforms like LangWatch are designed specifically for this challenge. LangWatch helps you capture and visualize the complex traces from agentic systems, run agent-to-agent simulations, and integrate automated evaluator_agents directly into your CI/CD pipeline. It provides the observability and evaluation infrastructure needed to turn the framework described here into a practical reality.

Book a call with us via this link

Or sign-up for a demo

More importantly, how do we know they’ll work reliably?

This post introduces a framework for evaluating agents from first principles. We'll cover:

Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.

The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement

Why Agent Evaluation needs a new approach

Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.

Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Consider a customer support agent. A typical workflow might involve:

Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.

To build reliable agents, we must evaluate the process, not just the final output.

From Static Checks to Dynamic Simulations

Eval Type	Description	Strengths	Weaknesses
Static Tests	Fixed input-output checks (e.g., unit tests for agent actions)	Fast, scalable, good for catching basic errors	Misses long-term reasoning, tool use, or adaptability
Simulations	Dynamic setups where agents act, including agent-to-agent testing	Tests decision-making, tool use, and real-world behavior	Harder to design and score automatically

Simulation as a Formalisation of Agent Behavior

To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.

A Framework for Systematic Evaluation

A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.

Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a simulator_agent that generates dynamic and adversarial user behavior.
Simulation: Execute the scenario. The agent_under_test interacts with the environment (and potentially the simulator_agent) to complete the task.
Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an evaluator_agent that assesses the quality of the interaction based on a rubric.
Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.

Metrics That Matter for Agents

Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent.

Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?

By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.

The Future is Simulation-Driven Development

Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.

Book a call with us via this link

Or sign-up for a demo

More importantly, how do we know they’ll work reliably?

This post introduces a framework for evaluating agents from first principles. We'll cover:

Why traditional benchmarks fail for agentic workflows.
Moving from static tests to dynamic, multi-step simulations.
A formal framework for building systematic evaluation pipelines.
Metrics that capture agent behavior, not just final accuracy.

The goal isn't just to test agents, but to build a feedback loop that drives continuous improvement

Why Agent Evaluation needs a new approach

Testing a simple language model is often treated as a straightforward classification or generation problem: you provide an input, you check the output. Benchmarks like MMLU are built on this premise.

Agents, however, aren’t simple input-output functions. They are processes. An agent behaves like a program, executing a sequence of operations, using tools, and maintaining state.

Consider a customer support agent. A typical workflow might involve:

Looking up an order ID in a Shopify database.
Checking the refund policy based on the order details.
Deciding whether to issue the refund.
Updating the support ticket in Zendesk and notifying the user.

To build reliable agents, we must evaluate the process, not just the final output.

From Static Checks to Dynamic Simulations

Eval Type	Description	Strengths	Weaknesses
Static Tests	Fixed input-output checks (e.g., unit tests for agent actions)	Fast, scalable, good for catching basic errors	Misses long-term reasoning, tool use, or adaptability
Simulations	Dynamic setups where agents act, including agent-to-agent testing	Tests decision-making, tool use, and real-world behavior	Harder to design and score automatically

Simulation as a Formalisation of Agent Behavior

To build a robust evaluation system, we need to be precise. We can formalize an agent’s behavior using a structure similar to the ReAct framework.

A Framework for Systematic Evaluation

A modern agent evaluation pipeline mirrors the principles of CI/CD in software engineering. It’s an automated, repeatable loop.

Scenario Generation: Define the task. This can be a static dataset of instructions or, in an agent-to-agent setup, a simulator_agent that generates dynamic and adversarial user behavior.
Simulation: Execute the scenario. The agent_under_test interacts with the environment (and potentially the simulator_agent) to complete the task.
Tracing: Log everything. Every observation, thought, and action is recorded into the trace. This is non-negotiable for diagnostics.
Evaluation: Score the trace. This can be done with rule-based heuristics (e.g., "Did the agent call the Shopify API before responding?") or an evaluator_agent that assesses the quality of the interaction based on a rubric.
Metrics Aggregation: Aggregate scores across thousands of runs to measure performance, consistency, and failure modes.

Metrics That Matter for Agents

Standard NLP metrics like BLEU or ROUGE are mostly irrelevant here. We need task-oriented metrics that measure behavioral quality, often scored by an evaluator_agent.

Task Success: Did the agent ultimately achieve its goal? (Binary or scaled score).
Tool Accuracy: Did it call the right tools with the correct arguments?
Efficiency: How many steps or tool calls were required? Fewer is usually better.
Groundedness: Did the agent’s responses rely on retrieved information, or did it hallucinate facts?
Adaptability: How well did it handle unexpected inputs or errors?
Consistency: Does the agent behave reliably across similar but distinct scenarios?

By running simulations at scale, we can measure the variance in these metrics to understand our agent's reliability.

The Future is Simulation-Driven Development

Simulations are more than just an evaluation tool. They are becoming a core component of the development lifecycle itself.

Book a call with us via this link

Or sign-up for a demo

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started