The Agent Testing Pyramid: a mental model for trustworthy agents

Manouk Draisma
Ever since we put tools in the hands of LLMs, one question has refused to go away: how do you systematically know your agent actually works, and that it's reliable enough to ship?
After building increasingly complex agents with teams in production — and after a recent evening in Amsterdam where a room full of practitioners argued about exactly this — a consistent pattern keeps surfacing. We call it the Agent Testing Pyramid: three layers of quality assurance, each answering a question the others can't. (Our CTO Rogerio first wrote it up in the Scenario docs; this is the field-tested version, with the lessons that have accumulated since.)
It's deliberately modeled on the classic software testing pyramid, but the layers are different because agents are different. Let's walk up it.
Why agents broke the old pyramid
In traditional software, a test exercises a function: same input, same output, every time. In classic ML, you evaluate one operation — a classifier, a search, a single button-click behind the scenes — and you can score it against a labeled set.
An agent is neither. When you build one, you have an LLM at the core, but that model isn't the thing touching your systems. Your tools are: the functions you wrote, the skills you loaded, the MCP servers you connected. Every turn, the model reaches into that toolbelt and decides what to do — load a skill, classify intent, fetch an order number, look up a refund policy, check whether it matches, reply. The complexity is multiplied by every combination of tools the agent might pick up.
And then there's a second axis the old pyramid never had to think about: time. An agent doesn't do one thing; it goes over many turns. The customer comes back, changes their mind, asks for one more thing, uploads a PDF halfway through. So the combinatorial explosion of tools is multiplied again by everything that can happen across a multi-turn conversation.
This is why "it worked when I tried it" is such a trap with agents. You vibe-check a happy path, it looks great, thumbs up — and without evaluations underneath, you're still shipping on vibes. The pyramid is how you climb out of that.
Layer 1 — Unit tests (the foundation)
Agents are probabilistic, but they're built on deterministic software that has to work every single time: API connections to external tools, data-transformation pipelines, memory storage and retrieval, authentication, rate limiting, and error handling.
When these fail, it doesn't matter how sophisticated your prompts or retrieval are — the agent fails, often in ways that aren't obvious. A broken tool surfaces as "the model is being dumb today" when really your auth token expired or a downstream schema changed. Fast unit-test feedback saves enormous debugging time and gives you a floor of reliability to build on.
A useful habit, widely reported by teams using AI coding tools: ask the model to write unit tests alongside the code, as a way to keep itself honest. The deterministic base is also where the operational realities live that nobody warns you about — a model you depend on gets deprecated by your cloud provider six months in, a schema shifts, a rate limit changes. Tests at this layer are what make those changes loud instead of silent.
Layer 2 — Evals & optimization (the middle)
The middle layer is where you measure and optimize the individual probabilistic components:
RAG retrieval accuracy evaluations
LLM response-quality metrics
Prompt optimization (e.g. with frameworks like DSPy)
Fine-tuning for preference alignment
This layer has a data-science feel: you're working with datasets and optimizing toward a metric. Each component — parsing, retrieval, generation — is evaluated separately but compounds into overall system performance. A 5% retrieval gain plus a better prompt can compound into meaningfully better behavior. It also demands ML discipline: train/validation splits, carefully defined metrics, iterative cycles. Without it, teams chase their tails with random prompt tweaks that sometimes help and sometimes hurt.
This is exactly the layer LangWatch's evaluators target — from exact-match and semantic similarity to RAGAS-style retrieval metrics and LLM-as-judge scoring. But the layer is also where most teams quietly go wrong. A few hard-won lessons we keep seeing confirmed in the wild:
Start with observability, not metrics. Everything in this layer needs data, and the data comes from tracing. Before you can measure quality you need to be able to find any run, replay what happened, and spot patterns across many runs — the common failure modes, what users actually do versus what you imagined they'd do. Tracing is the substrate the whole pyramid sits on.
Make your judges binary, not continuous. This is the single biggest LLM-as-judge lesson, and it comes up every time. A continuous 0–1 "groundedness score" feels rigorous and is almost useless in practice: what do you do differently at 0.6 versus 0.7? Nobody knows, so the dashboard gets built, hidden behind a feature flag, and never looked at again. A binary check — is this answer correct for this case, yes or no? — is actionable. It tells you what to fix.
Define metrics from your product, not from your tooling's preset list. The failure mode is opening your eval tool, seeing a menu of generic evaluators, ticking three that sound relevant, and letting them run into a dashboard nobody reads. Flip it around. Start from the product outcome. For a document-Q&A feature serving, say, bid managers, trust is the product requirement: if a reviewer sees one fabricated contract detail, they stop trusting the whole system, and a missed clause can cost millions on a tender. So hallucination becomes a metric you care about — and now you can make it concrete and binary:
Extract every claim in the answer → for each claim, find supporting evidence in the source material → pass only if every claim is supported. Otherwise, fail.
That's a hallucination check defined from the product down, expressed as a binary chip, instead of a vague continuous score clicked from a dropdown.
Measure component by component. Different parts of the system fail in different ways, so they need different metrics. Hallucination is the right lens for document Q&A; for an extraction step you want more classical recall-style metrics. The point is that each AI-driven component gets metrics that make sense for that component, then a clear end-to-end picture: error analysis to find what typically breaks, a synthetic dataset to get started pre-production, a benchmark per metric, and a measured delta on every change.
Manage cost with a hybrid approach. LLM judges cost money and aren't deterministic, so they shouldn't be your default for everything. You don't need to run a judge over 30,000 traces a day. Sample. Run a cheap deterministic base check first and reserve the LLM judge for the cases that actually need it. Split the work; don't burn budget proving the same thing over and over.
Rule of thumb
If the thing you're testing has many tiny variations and you care about an accuracy number — routing to the right sub-agent, retrieving the right chunks, detecting the right language — it belongs in a dataset evaluation, not a simulation. This is the cleanest line between Layer 2 and Layer 3, and it's worth tattooing somewhere: too many tiny variations → eval; a few critical end-to-end journeys → simulation.
Layer 3 — Simulations (the peak)
At the top sit agent simulations: fewer in number, but crucial for end-to-end validation. The lower layers prove each component is good in isolation. But there's no point having the best search tool in the world if your agent reaches for it at the wrong moment — or doesn't reach for it at all. Simulations verify the pieces work together, as a system, across real multi-turn interactions.
How a simulation works is a simple idea that turns out to be powerful. With Scenario, our open-source framework, you have a user simulator that pretends to be your customer and talks to your agent. Your agent has no idea it's talking to a robot, so it behaves exactly as it would with a real user, exercising the whole path. A judge watches the conversation go by and, at the end, classifies success against criteria you define. You write the scenario, you define what "good" means.
A scenario has a name, a description of the situation ("the item arrived damaged and was charged twice"), the agents that participate, and the criteria the judge checks. And the criteria can go well beyond the final answer:
Final outcome — did the agent resolve the issue?
Behavior — did it gather the order number before investigating?
Ordering — did it check the refund policy before quoting it?
Escalation — did it hand off to a human instead of arguing?
That ordering point matters more than it looks. An agent might give you the right answer — "here's how our policy works" — while never actually checking the policy. This time it got lucky; next time that's a hallucination. Verifying that it did the proper lookup before replying is as important as verifying the reply is correct. Single-turn evals can't see that. Simulations can.
Simulations also extend to voice. Scenario can connect a user simulator to a phone number and literally call your agent, pretending to be a frustrated customer — and if you're using a multimodal model, the agent can pick up the emotion in the voice, not just the transcript. A practical pattern several teams have landed on: test the content in text first, where it's faster and cheaper to run hundreds of variations, then add voice only where voice itself is the risk (interruptions, latency, tone).
The power of binary outcomes
The most valuable shift teams report from simulation testing is moving from probabilistic averages to binary outcomes. Instead of "how accurate is the RAG system on average?", you ask: "Can the agent successfully help a customer cancel their order when they don't remember their order number?" The simulations layer maps directly to business value, builds trust through clear demonstrations of capability, pinpoints specific gaps, and communicates progress to non-technical stakeholders — the compliance officer who wants a traceable yes/no, not a 0.5 threshold.
But how do you trust a non-deterministic judge?
Fair question — the judge is an LLM too. The answer is fewer, calibrated scenarios, not more. The temptation is to run ten thousand simulations; that's counterproductive. A handful of scenarios covering your most critical cases already exercises the full breadth of the agent. Writing a good one is itself an exercise in iteration: you run it, you look at the judgment, you decide it's too strict or missed your domain nuance, you sharpen the criteria, you run again. After a few rounds it's solid and stays solid — especially at lower temperatures. The visual, anyone-can-contribute side of the platform exists precisely so a domain expert, not just an engineer, can do this calibration.
And the flakiness that scared us in classic software testing? It's handled differently here. Most teams use simulations as a gate before production: a small set of smoke tests that must be rock-solid, then a fuller sweep — sometimes with a human-in-the-loop check — before going live. Because people increasingly develop through a coding agent, the agent itself can read a flaky result, understand it, and harden the scenario. Some fuzziness is even acceptable when a human can sign off that the judge was simply being too harsh.
Where scenarios come from: the vibe-eval loop
The most common worry is "I don't have a dataset of conversations — where do I even get scenarios?" You already have them; they're just not written down. They show up constantly while you build, as edge cases you stumble on and bugs that get reported.
The discipline is the Vibe-Eval loop — TDD for agents. Vibe-checking is essential; only human perception catches the weird little issues. The problem is it doesn't scale. So you write the vibes down:
Play with your agent and vibe-debug until you find weird behavior.
Don't fix it yet. Write a scenario that reproduces it.
Run it and watch it fail — now you know the issue is real and the test catches it.
Implement the fix.
Run it again and watch it pass — now you've validated both the test and the fix.
From then on, you can refactor prompts and swap models knowing that if the issue ever returns, the test catches it silently for you. Writing the test first is also just faster: you can try several fixes and know the moment you're actually done.
This is where the layers become a flywheel. Production traces and bug reports become new eval cases and new scenarios. Meeting transcripts, spec docs, and customer call notes can be fed through an AI assistant to draft scenarios for you — we even ship a skill that teaches Claude Code or Cursor to write them. Observability feeds evals; evals and production feed simulations; the whole stack compounds and gets harder to regress. That's what evaluation-driven development actually looks like in practice.
Testing is a team sport
One theme from our practitioners deserves its own line, because it's the part engineering-led teams underestimate: evals are not purely an engineering task. You need domain experts to define ground truth, the business to decide what "good enough" means and what each failure actually costs, and engineers to build the infrastructure that ties it together. There are soft metrics (hallucination, helpfulness), hard metrics (latency), and business metrics, and they have different owners. The thing that breaks teams isn't the tooling it's not having decided, before something goes wrong, who is responsible for which layer. A collaborative process doesn't mean shared (i.e. nobody's) responsibility.
Finding the right balance
The pyramid isn't rigid. Early-stage agents might jump straight to a few simulations and skip some unit tests. Mature systems often invest heavily in the middle layer to squeeze out performance. But the structure holds: test both the components and their integration across multiple turns. New evaluation techniques will keep emerging, but that fundamental insight is unlikely to change.
Where to start this week
Inventory your agent against the three layers and find the gaps. Most teams are missing the top two.
Turn on tracing if you haven't. You can't measure what you can't see, and every layer above feeds on this data.
Pick the single flow that would hurt most if it broke. That's your first simulation candidate.
Pick the single component you tweak most by hand — a prompt, a retriever. That's your first eval candidate. Define its metric from the product outcome, and make it binary.
Next in the series: we make the eval layer real without leaving your coding tool, instrumenting an agent with LangWatch and letting Claude Code propose your first eval from real traces.

