LLM Evaluations Explained: Experiments, Online Evaluations, Guardrails, and when to use each in 2026

Rogerio Chaves

Feb 2, 2026

Ask five people what they mean by "I need to add evaluations to my agent" and you'll get five completely different answers:

  • "I changed the prompt, need to make sure I didn't break anything"

  • "I want to know when my support agent is escalating to humans when it shouldn't"

  • "We need to detect when the agent doesn't know something so we can add it to the knowledge base"

  • "The domain expert is tweaking prompts in the UI, we need a way for them to validate changes"

  • "I need to block jailbreak attempts before they reach users"

All valid. All "evaluations." But each is about a different process or tooling.

After working with dozens of companies building agents at LangWatch, we've seen this confusion play out constantly. Teams waste weeks searching for the wrong thing because they googled simply "LLM evaluation" and landed on a solution designed for a completely different problem.

So let's fix that.

The four things people mean by "Evaluations"

1. Experiments (Offline Evaluations, Batch Testing)

What it is: Running your agent against a dataset of test cases before deploying to production.

When to use it:

  • You changed the system prompt and need to verify nothing broke

  • Anthropic released a new Claude Opus version and you want to validate the upgrade

  • Your domain expert tweaked the agent instructions in the UI and wants to test before publishing

  • CI/CD quality gates before merging prompt changes

Team evaluation workflows in 2026:

  1. Prompt iteration — Someone (often a domain expert, not a dev) changes the agent behavior. Did that improve things? Did it break edge cases that were working before?

  2. Model upgrades — Your provider releases a new version. Gemini 2.5 → 3.0, Claude Sonnet 4.5 → Claude Opus 4.5. You need to validate the upgrade doesn't regress quality before flipping the switch.

  3. Golden dataset validation — For RAG systems, you have a set of questions with known-correct answers. Run them, measure F1, precision, recall. This is your source of truth.

Two personas, two workflows:

  • Domain experts use the Experiments UI. They tweak prompts, run tests against datasets, compare results visually. No code required.

  • Developers use the SDK. Experiments run in CI/CD, block merges if quality drops below threshold, integrate with existing test infrastructure.

Both are valid. Both are "experiments."

2. Online Evaluation (Production Monitoring)

What it is: Continuously scoring your agent's behavior in production, after responses are sent.

When to use it:

  • Tracking whether your support agent is resolving issues or escalating unnecessarily

  • Detecting "I don't know" responses that signal missing documentation

  • Measuring user satisfaction and frustration signals

  • Alerting when something breaks that your test suite didn't catch

What this actually looks like for agents:

Forget abstract metrics. Here's what companies actually monitor:

Customer support agents:

  • Escalation rate — Is the agent handing off to humans appropriately? Too much = agent isn't useful. Too little = agent is frustrating users who need a human.

  • Resolution detection — Did the conversation end with the user's problem solved?

  • Intent misunderstanding — Is the agent responding to what the user meant, or what they literally said?

Internal data/BI agents:

  • User frustration signals — Is the user rephrasing the same question? Saying "no that's not what I meant"? Abandoning the conversation?

  • Query accuracy — For SQL/data agents, is the generated query actually returning what the user asked for?

  • Clarification loops — How many back-and-forths before the agent understands the request?

RAG agents:

  • "I don't know" detection — When the agent admits it doesn't have the answer, that's a signal. Aggregate these, find patterns, update your knowledge base.

  • Source attribution — Is the agent citing the right documents? This is measurable.

  • Hallucination detection — Is the response grounded in retrieved context, or is the agent making things up?

This is observability for agents. You're not blocking anything, you're measuring what's happening so you can improve.

3. Guardrails (Real-time Enforcement)

What it is: Evaluators that run synchronously and can block or modify responses before they reach users.

When to use it:

  • Blocking jailbreak attempts (yes, people still try)

  • Preventing PII from leaking in responses (compliance requirement)

  • Enforcing that the agent stays on-topic for your use case

  • Stopping the agent from making commitments it shouldn't ("I'll refund you 100%")

The performance reality:

Guardrails don't have to add latency if you structure them right. Run the guardrail check in parallel with your LLM call — fire both off, wait for both to return, then decide. Since LLM calls take way longer than ML-based guardrail checks (jailbreak detection, PII detection typically run in milliseconds), the guardrail finishes first and you're just waiting on the LLM anyway. Net impact: zero.

For input guardrails (checking the user message before it hits the LLM), you do need to wait. But these checks are fast. For output guardrails (checking the response), run them in parallel with generation.

What guardrails are NOT for:

Don't guardrail quality metrics. "Block all responses with faithfulness < 0.8" sounds reasonable until you realize you're now blocking 20% of responses and users are getting "I cannot respond to that" for totally normal questions.

Guardrails are for policy enforcement, not quality optimization. If something is merely suboptimal, monitor it. If something is actively harmful or forbidden, guardrail it.

4. Evaluators (The Scoring Functions)

What it is: The actual functions that score agent outputs. These are the building blocks used by experiments, monitors, and guardrails.

For agents in 2026, useful evaluators include:

Category

Evaluator

What it catches

Business

Escalation Detection

Agent handing off when it shouldn't

Business

Resolution Detection

Whether the issue was actually solved

Business

User Frustration

Rephrasing, "that's wrong", abandonment signals

RAG

Faithfulness

Response grounded in retrieved context

RAG

"I Don't Know" Detection

Gaps in your knowledge base

RAG

Context Relevancy

Retrieval returning useful documents

Safety

Jailbreak Detection

Prompt injection attempts

Safety

PII Detection

Personal data in responses

Format

Valid JSON/SQL

Agent outputting parseable structured data

You don't "run evaluators" in isolation. You use them within experiments, monitors, or guardrails.

The Mental Model




Common Mistakes

Mistake 1: Guardrailing quality metrics

"Let's block all responses where faithfulness is below 0.8."

Now 15% of your responses return "I cannot help with that" for totally reasonable questions. Users hate it. Quality metrics are for monitoring and improvement, not real-time blocking.

Mistake 2: Monitoring what should be guardrailed

"We'll track PII exposure rate in our dashboards."

Cool. You just exposed customer data 47 times last week and have a nice chart showing it. If something must not happen, guardrail it. Don't measure how often you failed.

Mistake 3: Only testing happy paths

Your experiment dataset has 500 examples of the agent doing its job correctly. Great. But what about:

  • Users who are frustrated and rude

  • Completely off-topic requests

  • Multi-turn conversations that go sideways

  • Edge cases from actual production logs

Pull failure cases from production into your test set. That's where the real edge cases live.

Mistake 4: Measuring metrics nobody acts on

You're tracking 47 different scores. Dashboards everywhere. When faithfulness drops from 0.82 to 0.79, what do you do?

If you don't have an action for when a metric moves, you don't need that metric. Start with metrics tied to business outcomes:

  • Escalation rate → adjust agent capabilities or prompt

  • "I don't know" rate → update knowledge base

  • User frustration → investigate and fix root cause

Choosing the right approach

You want to...

Use

Validate prompt changes before deploying

Experiments

Let domain experts test their own changes

Experiments (UI)

Run quality gates in CI/CD

Experiments (SDK)

Track escalation rates, resolution rates

Monitors

Find gaps in your knowledge base

Monitors (detect "I don't know")

Alert when quality regresses

Monitors

Block jailbreak attempts

Guardrails

Prevent PII exposure

Guardrails

Stop off-topic responses

Guardrails

Building the Full Stack

If you're shipping an agent to production, you need all three:

1. Experiments in your workflow

  • Domain experts validate prompt changes in the UI before publishing

  • Developers run experiments in CI/CD to catch regressions

  • Golden datasets for RAG accuracy (F1, precision, recall)

2. Monitors in production

  • Business metrics: escalation rate, resolution rate, user satisfaction

  • Quality metrics: faithfulness, relevancy, frustration signals

  • Gap detection: aggregate "I don't know" responses, find missing docs

3. Guardrails at runtime

  • Safety: jailbreak detection, PII prevention

  • Policy: staying on-topic, not making unauthorized commitments

FAQ

What's the difference between offline and online LLM evaluations?

Offline (experiments) = Testing on a dataset before deployment. Batch processing, controlled conditions, compare prompt/model variations.

Online (monitors) = Scoring production traffic after responses are sent. Real users, real edge cases, async processing that doesn't slow down your agent.

Do guardrails add latency?

Not if you structure them right. Run guardrail checks in parallel with your LLM call — the guardrail (fast ML model) finishes way before the LLM, so you're not adding wait time. For input guardrails you do need to check before calling the LLM, but these are milliseconds. The "guardrails are slow" myth comes from naive sequential implementations.

How do I measure ROI of my agent?

Track business outcomes, not abstract scores:

  • Support agent: Escalation rate (lower = more issues resolved by agent = more savings)

  • Data agent: Task completion rate, user return rate

  • RAG agent: "I don't know" rate (signals knowledge gaps), user satisfaction

These tie directly to whether the agent is doing its job.

Who should own experiments, developers or domain experts?

Both. Domain experts iterate on prompts and test via UI. Developers own CI/CD integration and automated quality gates. The platform should support both workflows, not force everyone into one.

What evaluators should I start with?

Depends on your agent type:

  • Customer support: Escalation Detection, Resolution Detection, User Frustration

  • RAG/Q&A: Faithfulness, "I Don't Know" Detection, Context Relevancy

  • Data/SQL agents: Query Accuracy, User Frustration, Task Completion

  • All agents: Jailbreak Detection (guardrail), PII Detection (guardrail)

How is this different from traditional software testing?

Same concepts, different implementation:

  • Experiments ≈ Unit/integration tests

  • Monitors ≈ APM (Datadog, etc.)

  • Guardrails ≈ Input validation, middleware

The difference: agent outputs are non-deterministic and require semantic evaluation. You can't just assert output == expected. You need evaluators that understand meaning.

When should I build custom evaluators?

When built-in evaluators don't capture what you care about. Common cases:

  • Business-specific criteria ("did the agent correctly identify this as a billing issue?")

  • Domain expertise ("is this medical advice appropriate?")

  • Company policy ("did the agent follow our refund guidelines?")

LLM-as-Judge lets you define custom criteria without writing code, just describe what "good" looks like.

At LangWatch, we built experiments, monitors, and guardrails into one platform because this confusion kept costing teams time. Domain experts get a UI. Developers get SDKs. Everyone gets clarity on what "evaluations" actually means. Try it.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.