A Systematic way of Testing of AI Agents

Manouk

Nov 20, 2025

A practical guide to experiments, evaluations, and simulations in LangWatch.

AI agents introduce a new complexity layer beyond traditional prompt-response systems. They plan, reason across multiple steps, interact with tools, query APIs, and operate with memory. Outputs vary across runs, even with identical instructions. A single configuration change can subtly shift reasoning patterns, tool-calling behavior, hallucination rates, or decision boundaries.

This stochastic behavior makes conventional deterministic testing insufficient. What matters is not a single output—it’s the statistical behavior of the system across many controlled runs, and increasingly:
how the agent behaves in simulated environments representing real-world constraints.

Every model upgrade, system prompt rewrite, policy change, retrieval adjustment, or fine-tuning iteration is an experiment. Testing must quantify change, prevent regressions, and support iteration—not by intuition, but by evidence.

LangWatch experiments and simulations provide repeatable snapshots of agent performance across qualitative and quantitative dimensions: correctness, cost, latency, resilience, reasoning stability, compliance, safety, and behavioral consistency.

This guide describes a systematic approach for evaluating and improving agent quality using LangWatch. We’ll cover:

  • How to run experiments

  • How to incorporate simulations

  • How to interpret results

  • How to operationalize improvements

A mental model to keep in mind:

Experiments are the CI pipeline. Simulations are the functional QA environment. Interpretation is debugging.

1. Running an Experiment

Evaluations in LangWatch consist of three elements:
the agent (task), the dataset, and the evaluators.
Simulations extend this by adding environmental context.

1. Task

The task is the function, chain, or agent that is being tested. In an agent setting this includes:

  • Reasoning loop

  • Tool-use or function calls

  • Memory retrieval

  • Embedded policies (guardrails)

In our running example, production currently uses:

model="sonnet-4"
model="sonnet-4"
model="sonnet-4"

A proposed upgrade to:

model="sonnet-4.5"
model="sonnet-4.5"
model="sonnet-4.5"

is being evaluated.

2. Dataset or defined Scenario's

The dataset defines the inputs. This may be:

  • A static regression test set

  • Real user traffic samples

  • Synthetic edge case collections

  • Scenario templates for simulations

LangWatch datasets ensure output consistency across experimental runs.

3. Evaluators

Evaluators score LLM outputs, simulations score behaviors. These can be:

  • LLM-based rubric scoring

  • Exact match or similarity scoring

  • Chain-of-thought safety checks

  • Structured JSON validation

  • Behavioral scoring of tool call correctness (critical for agents)

  • Scenario's to judge if the agent asked a minimum of 2 follow-ups questions.

Evaluators transform subjective quality into structured, queryable signals.

4. Simulation Layer (Optional but recommended for agents)

Unlike prompts or classifiers, agents operate inside context. LangWatch simulations allow injecting:

  • Ambiguous user phrasing

  • Tool failures

  • Delays or malformed responses

  • Conflicting instructions

  • Multi-step dialogue flows

Simulation converts a static dataset into a dynamic environment, revealing robustness—not just correctness.

Example

# my_vegetarian_recipe_agent.py
import pytest
import scenario
import litellm
 
# Configure the default model for simulations
scenario.configure(default_model="openai/gpt-4.1")
 
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_vegetarian_recipe_agent():
    # 1. Create your agent adapter
    class RecipeAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return vegetarian_recipe_agent(input.messages)
 
    # 2. Run the scenario
    result = await scenario.run(
        name="dinner recipe request",
        description="""
            It's saturday evening, the user is very hungry and tired,
            but have no money to order out, so they are looking for a recipe.
        """,
        agents=[
            RecipeAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should not ask more than two follow-up questions",
                "Agent should generate a recipe",
                "Recipe should include a list of ingredients",
                "Recipe should include step-by-step cooking instructions",
                "Recipe should be vegetarian and not include any sort of meat",
            ])
        ],
        script=[
            scenario.user("quick recipe for dinner"),
            scenario.agent(),
            scenario.user(),
            scenario.agent(),
            scenario.judge(),
        ],
    )
 
    # 3. Assert the result
    assert result.success
 
# Example agent implementation using litellm
@scenario.cache()
def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes:
    response = litellm.completion(
        model="openai/gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """
                    You are a vegetarian recipe agent.
                    Given the user request, ask AT MOST ONE follow-up question,
                    then provide a complete recipe. Keep your responses concise and focused.
                """,
            },
            *messages,
        ],
    )
    return response.choices[0].message
# my_vegetarian_recipe_agent.py
import pytest
import scenario
import litellm
 
# Configure the default model for simulations
scenario.configure(default_model="openai/gpt-4.1")
 
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_vegetarian_recipe_agent():
    # 1. Create your agent adapter
    class RecipeAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return vegetarian_recipe_agent(input.messages)
 
    # 2. Run the scenario
    result = await scenario.run(
        name="dinner recipe request",
        description="""
            It's saturday evening, the user is very hungry and tired,
            but have no money to order out, so they are looking for a recipe.
        """,
        agents=[
            RecipeAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should not ask more than two follow-up questions",
                "Agent should generate a recipe",
                "Recipe should include a list of ingredients",
                "Recipe should include step-by-step cooking instructions",
                "Recipe should be vegetarian and not include any sort of meat",
            ])
        ],
        script=[
            scenario.user("quick recipe for dinner"),
            scenario.agent(),
            scenario.user(),
            scenario.agent(),
            scenario.judge(),
        ],
    )
 
    # 3. Assert the result
    assert result.success
 
# Example agent implementation using litellm
@scenario.cache()
def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes:
    response = litellm.completion(
        model="openai/gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """
                    You are a vegetarian recipe agent.
                    Given the user request, ask AT MOST ONE follow-up question,
                    then provide a complete recipe. Keep your responses concise and focused.
                """,
            },
            *messages,
        ],
    )
    return response.choices[0].message
# my_vegetarian_recipe_agent.py
import pytest
import scenario
import litellm
 
# Configure the default model for simulations
scenario.configure(default_model="openai/gpt-4.1")
 
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_vegetarian_recipe_agent():
    # 1. Create your agent adapter
    class RecipeAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return vegetarian_recipe_agent(input.messages)
 
    # 2. Run the scenario
    result = await scenario.run(
        name="dinner recipe request",
        description="""
            It's saturday evening, the user is very hungry and tired,
            but have no money to order out, so they are looking for a recipe.
        """,
        agents=[
            RecipeAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should not ask more than two follow-up questions",
                "Agent should generate a recipe",
                "Recipe should include a list of ingredients",
                "Recipe should include step-by-step cooking instructions",
                "Recipe should be vegetarian and not include any sort of meat",
            ])
        ],
        script=[
            scenario.user("quick recipe for dinner"),
            scenario.agent(),
            scenario.user(),
            scenario.agent(),
            scenario.judge(),
        ],
    )
 
    # 3. Assert the result
    assert result.success
 
# Example agent implementation using litellm
@scenario.cache()
def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes:
    response = litellm.completion(
        model="openai/gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """
                    You are a vegetarian recipe agent.
                    Given the user request, ask AT MOST ONE follow-up question,
                    then provide a complete recipe. Keep your responses concise and focused.
                """,
            },
            *messages,
        ],
    )
    return response.choices[0].message

2. Interpreting Results

With the experiment complete, analysis follows a top-down sequence.

Step 1: Macro View

The experiments overview highlights average evaluator scores, cost, and latency across all recorded runs. This identifies whether the candidate version is:

  • A likely improvement

  • Neutral

  • A clear regression

Trends matter more than single scores. Agents evolve through iteration, not jumps.

Step 2: Baseline Diff

Comparing the candidate run (sonnet-4.5) to the baseline (sonnet-4) provides context.

Aggregate Metrics

Charts help identify whether improvements in reasoning quality justify cost or latency changes.

Examples:

  • +7% correctness

  • +14% tool failure resilience

  • +18% latency

Depending on product constraints, these may or may not be acceptable.

Item-Level Diff

Aggregates hide edge-case volatility. The diff table presents outputs side-by-side for each test case, simulation state, and evaluator.

Questions to ask:

  • Did quality improve uniformly?

  • Did regressions cluster around specific patterns (e.g., adversarial questions or unexpected tools)?

  • Are errors random or systematic?

Trace-Level Debugging

Selecting a regression case opens the full execution trace:

  • Thought steps

  • Tool invocation sequences

  • Retrieval context

  • Final answer

This is where behavior—not just output—is analyzed.

Step 3: Human Annotation

Automated evaluators detect that something changed. Humans determine why.

A typical flow involves:

  1. Engineer triage
    Mark items needing inspection: "review_status:fail".

  2. Domain expert analysis
    Classify failure patterns:
    hallucination, tool misuse, policy breach, insufficient reasoning, formatting error, RAG irrelevance.

Over time, this creates a failure taxonomy, which guides iteration more precisely than broad metrics.

3. Incorporating Simulations into Continuous Development

Simulations turn evaluation from static verification into behavioral testing.

Patterns emerge:

  • Does the agent persist through interruptions?

  • Does it degrade gracefully under uncertainty?

  • Does it follow policies under pressure?

  • Does it remain consistent across repeated runs?

Over multiple versions, simulation results form a behavioral signature of the system—allowing teams to detect drift early.

Takeaways

A reliable agent is not the result of intuition—it’s the product of systematic evaluation.

LangWatch provides the feedback loop:

  1. Run
    Execute the task against controlled datasets and optional simulated environments.

  2. Analyze
    Compare candidate and baseline versions top-down (aggregate → diff → trace).

  3. Act
    Convert ambiguous signals into structured annotations that guide focused improvement.

As agents become embedded in critical workflows—support automation, decision assistance, compliance enforcement—repeatable experiments and simulations become the backbone of responsible AI engineering.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.