Evaluations Overview

Let your agent set this up. Copy the evaluations prompt into your coding agent to get started automatically.

LangWatch provides comprehensive evaluations tools for your LLM applications. Whether you’re evaluating before deployment or monitoring in production, we have you covered.

The Agent Evaluation Lifecycle

BUILD → TEST → DEPLOY → MONITOR
         ↓              ↓
    Experiments    Online Evaluation
         ↓              ↓
    CI/CD Gate      Guardrails

Core Concepts

Experiments

Online Evaluation

Guardrails

Evaluators

When to Use What

Use Case	Solution
Test prompt changes before deploying	Experiments
Compare different models or configurations	Experiments
Run quality checks in CI/CD	Experiments CI/CD
Monitor production quality over time	Online Evaluation
Block harmful or policy-violating content	Guardrails
Get alerts when quality drops	Online Evaluation + Automations

Quick Start

1. Run Your First Experiment

Test your LLM on a dataset using the Experiments via UI or via code:

Platform
Python
TypeScript

Go to Experiments and click “New Experiment” to get started with the UI.

import langwatch

evaluation = langwatch.experiment.init("my-first-experiment")

for idx, row in evaluation.loop(dataset.iterrows()):
    response = my_llm(row["input"])
    evaluation.log("quality", index=idx, score=0.95)

import { LangWatch } from "langwatch";

const langwatch = new LangWatch();
const evaluation = await langwatch.experiments.init("my-first-experiment");

await evaluation.run(dataset, async ({ item, index }) => {
  const response = await myLLM(item.input);
  evaluation.log("quality", { index, score: 0.95 });
});

2. Set Up Online Evaluation

Monitor your production traffic with evaluators that run on every trace:

Go to Monitors
Create a new monitor with “When a message arrives” trigger
Select evaluators (e.g., PII Detection, Faithfulness)
Enable monitoring

3. Add Guardrails

Protect your users by blocking harmful content in real-time:

import langwatch

@langwatch.trace()
def my_llm_call(user_input):
    # Check input before processing
    guardrail = langwatch.evaluation.evaluate(
        "azure/jailbreak",
        name="Jailbreak Detection",
        as_guardrail=True,
        data={"input": user_input},
    )

    if not guardrail.passed:
        return "I can't help with that request."

    # Continue with normal processing...

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

Evaluations Overview

The Agent Evaluation Lifecycle

Core Concepts

Experiments

Online Evaluation

Guardrails

Evaluators

When to Use What

Quick Start

1. Run Your First Experiment

2. Set Up Online Evaluation

3. Add Guardrails

Supporting Resources

Datasets

Annotations

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

​The Agent Evaluation Lifecycle

​Core Concepts

Experiments

Online Evaluation

Guardrails

Evaluators

​When to Use What

​Quick Start

​1. Run Your First Experiment

​2. Set Up Online Evaluation

​3. Add Guardrails

​Supporting Resources

Datasets

Annotations

The Agent Evaluation Lifecycle

Core Concepts

When to Use What

Quick Start

1. Run Your First Experiment

2. Set Up Online Evaluation

3. Add Guardrails

Supporting Resources