LLM Evaluations Explained: Experiments, Online Evaluations, Guardrails, and when to use each in 2026

Rogerio Chaves
Feb 2, 2026
Ask five people what they mean by "I need to add evaluations to my agent" and you'll get five completely different answers:
"I changed the prompt, need to make sure I didn't break anything"
"I want to know when my support agent is escalating to humans when it shouldn't"
"We need to detect when the agent doesn't know something so we can add it to the knowledge base"
"The domain expert is tweaking prompts in the UI, we need a way for them to validate changes"
"I need to block jailbreak attempts before they reach users"
All valid. All "evaluations." But each is about a different process or tooling.
After working with dozens of companies building agents at LangWatch, we've seen this confusion play out constantly. Teams waste weeks searching for the wrong thing because they googled simply "LLM evaluation" and landed on a solution designed for a completely different problem.
So let's fix that.
The four things people mean by "Evaluations"
1. Experiments (Offline Evaluations, Batch Testing)
What it is: Running your agent against a dataset of test cases before deploying to production.
When to use it:
You changed the system prompt and need to verify nothing broke
Anthropic released a new Claude Opus version and you want to validate the upgrade
Your domain expert tweaked the agent instructions in the UI and wants to test before publishing
CI/CD quality gates before merging prompt changes
Team evaluation workflows in 2026:
Prompt iteration — Someone (often a domain expert, not a dev) changes the agent behavior. Did that improve things? Did it break edge cases that were working before?
Model upgrades — Your provider releases a new version. Gemini 2.5 → 3.0, Claude Sonnet 4.5 → Claude Opus 4.5. You need to validate the upgrade doesn't regress quality before flipping the switch.
Golden dataset validation — For RAG systems, you have a set of questions with known-correct answers. Run them, measure F1, precision, recall. This is your source of truth.
Two personas, two workflows:
Domain experts use the Experiments UI. They tweak prompts, run tests against datasets, compare results visually. No code required.
Developers use the SDK. Experiments run in CI/CD, block merges if quality drops below threshold, integrate with existing test infrastructure.
Both are valid. Both are "experiments."
2. Online Evaluation (Production Monitoring)
What it is: Continuously scoring your agent's behavior in production, after responses are sent.
When to use it:
Tracking whether your support agent is resolving issues or escalating unnecessarily
Detecting "I don't know" responses that signal missing documentation
Measuring user satisfaction and frustration signals
Alerting when something breaks that your test suite didn't catch
What this actually looks like for agents:
Forget abstract metrics. Here's what companies actually monitor:
Customer support agents:
Escalation rate — Is the agent handing off to humans appropriately? Too much = agent isn't useful. Too little = agent is frustrating users who need a human.
Resolution detection — Did the conversation end with the user's problem solved?
Intent misunderstanding — Is the agent responding to what the user meant, or what they literally said?
Internal data/BI agents:
User frustration signals — Is the user rephrasing the same question? Saying "no that's not what I meant"? Abandoning the conversation?
Query accuracy — For SQL/data agents, is the generated query actually returning what the user asked for?
Clarification loops — How many back-and-forths before the agent understands the request?
RAG agents:
"I don't know" detection — When the agent admits it doesn't have the answer, that's a signal. Aggregate these, find patterns, update your knowledge base.
Source attribution — Is the agent citing the right documents? This is measurable.
Hallucination detection — Is the response grounded in retrieved context, or is the agent making things up?
This is observability for agents. You're not blocking anything, you're measuring what's happening so you can improve.
3. Guardrails (Real-time Enforcement)
What it is: Evaluators that run synchronously and can block or modify responses before they reach users.
When to use it:
Blocking jailbreak attempts (yes, people still try)
Preventing PII from leaking in responses (compliance requirement)
Enforcing that the agent stays on-topic for your use case
Stopping the agent from making commitments it shouldn't ("I'll refund you 100%")
The performance reality:
Guardrails don't have to add latency if you structure them right. Run the guardrail check in parallel with your LLM call — fire both off, wait for both to return, then decide. Since LLM calls take way longer than ML-based guardrail checks (jailbreak detection, PII detection typically run in milliseconds), the guardrail finishes first and you're just waiting on the LLM anyway. Net impact: zero.
For input guardrails (checking the user message before it hits the LLM), you do need to wait. But these checks are fast. For output guardrails (checking the response), run them in parallel with generation.
What guardrails are NOT for:
Don't guardrail quality metrics. "Block all responses with faithfulness < 0.8" sounds reasonable until you realize you're now blocking 20% of responses and users are getting "I cannot respond to that" for totally normal questions.
Guardrails are for policy enforcement, not quality optimization. If something is merely suboptimal, monitor it. If something is actively harmful or forbidden, guardrail it.
4. Evaluators (The Scoring Functions)
What it is: The actual functions that score agent outputs. These are the building blocks used by experiments, monitors, and guardrails.
For agents in 2026, useful evaluators include:
Category | Evaluator | What it catches |
|---|---|---|
Business | Escalation Detection | Agent handing off when it shouldn't |
Business | Resolution Detection | Whether the issue was actually solved |
Business | User Frustration | Rephrasing, "that's wrong", abandonment signals |
RAG | Faithfulness | Response grounded in retrieved context |
RAG | "I Don't Know" Detection | Gaps in your knowledge base |
RAG | Context Relevancy | Retrieval returning useful documents |
Safety | Jailbreak Detection | Prompt injection attempts |
Safety | PII Detection | Personal data in responses |
Format | Valid JSON/SQL | Agent outputting parseable structured data |
You don't "run evaluators" in isolation. You use them within experiments, monitors, or guardrails.
The Mental Model
Common Mistakes
Mistake 1: Guardrailing quality metrics
"Let's block all responses where faithfulness is below 0.8."
Now 15% of your responses return "I cannot help with that" for totally reasonable questions. Users hate it. Quality metrics are for monitoring and improvement, not real-time blocking.
Mistake 2: Monitoring what should be guardrailed
"We'll track PII exposure rate in our dashboards."
Cool. You just exposed customer data 47 times last week and have a nice chart showing it. If something must not happen, guardrail it. Don't measure how often you failed.
Mistake 3: Only testing happy paths
Your experiment dataset has 500 examples of the agent doing its job correctly. Great. But what about:
Users who are frustrated and rude
Completely off-topic requests
Multi-turn conversations that go sideways
Edge cases from actual production logs
Pull failure cases from production into your test set. That's where the real edge cases live.
Mistake 4: Measuring metrics nobody acts on
You're tracking 47 different scores. Dashboards everywhere. When faithfulness drops from 0.82 to 0.79, what do you do?
If you don't have an action for when a metric moves, you don't need that metric. Start with metrics tied to business outcomes:
Escalation rate → adjust agent capabilities or prompt
"I don't know" rate → update knowledge base
User frustration → investigate and fix root cause
Choosing the right approach
You want to... | Use |
|---|---|
Validate prompt changes before deploying | Experiments |
Let domain experts test their own changes | Experiments (UI) |
Run quality gates in CI/CD | Experiments (SDK) |
Track escalation rates, resolution rates | Monitors |
Find gaps in your knowledge base | Monitors (detect "I don't know") |
Alert when quality regresses | Monitors |
Block jailbreak attempts | Guardrails |
Prevent PII exposure | Guardrails |
Stop off-topic responses | Guardrails |
Building the Full Stack
If you're shipping an agent to production, you need all three:
1. Experiments in your workflow
Domain experts validate prompt changes in the UI before publishing
Developers run experiments in CI/CD to catch regressions
Golden datasets for RAG accuracy (F1, precision, recall)
2. Monitors in production
Business metrics: escalation rate, resolution rate, user satisfaction
Quality metrics: faithfulness, relevancy, frustration signals
Gap detection: aggregate "I don't know" responses, find missing docs
3. Guardrails at runtime
Safety: jailbreak detection, PII prevention
Policy: staying on-topic, not making unauthorized commitments
FAQ
What's the difference between offline and online LLM evaluations?
Offline (experiments) = Testing on a dataset before deployment. Batch processing, controlled conditions, compare prompt/model variations.
Online (monitors) = Scoring production traffic after responses are sent. Real users, real edge cases, async processing that doesn't slow down your agent.
Do guardrails add latency?
Not if you structure them right. Run guardrail checks in parallel with your LLM call — the guardrail (fast ML model) finishes way before the LLM, so you're not adding wait time. For input guardrails you do need to check before calling the LLM, but these are milliseconds. The "guardrails are slow" myth comes from naive sequential implementations.
How do I measure ROI of my agent?
Track business outcomes, not abstract scores:
Support agent: Escalation rate (lower = more issues resolved by agent = more savings)
Data agent: Task completion rate, user return rate
RAG agent: "I don't know" rate (signals knowledge gaps), user satisfaction
These tie directly to whether the agent is doing its job.
Who should own experiments, developers or domain experts?
Both. Domain experts iterate on prompts and test via UI. Developers own CI/CD integration and automated quality gates. The platform should support both workflows, not force everyone into one.
What evaluators should I start with?
Depends on your agent type:
Customer support: Escalation Detection, Resolution Detection, User Frustration
RAG/Q&A: Faithfulness, "I Don't Know" Detection, Context Relevancy
Data/SQL agents: Query Accuracy, User Frustration, Task Completion
All agents: Jailbreak Detection (guardrail), PII Detection (guardrail)
How is this different from traditional software testing?
Same concepts, different implementation:
Experiments ≈ Unit/integration tests
Monitors ≈ APM (Datadog, etc.)
Guardrails ≈ Input validation, middleware
The difference: agent outputs are non-deterministic and require semantic evaluation. You can't just assert output == expected. You need evaluators that understand meaning.
When should I build custom evaluators?
When built-in evaluators don't capture what you care about. Common cases:
Business-specific criteria ("did the agent correctly identify this as a billing issue?")
Domain expertise ("is this medical advice appropriate?")
Company policy ("did the agent follow our refund guidelines?")
LLM-as-Judge lets you define custom criteria without writing code, just describe what "good" looks like.
At LangWatch, we built experiments, monitors, and guardrails into one platform because this confusion kept costing teams time. Domain experts get a UI. Developers get SDKs. Everyone gets clarity on what "evaluations" actually means. Try it.

