Skip to main content
Evaluators are scoring functions that assess the quality of your LLM’s outputs. They’re the building blocks for experiments, online evaluation, and guardrails.

Choose Your Approach

There are three ways to evaluate your LLM outputs with LangWatch:

Which should I use?

ApproachSlug FormatBest For
Built-in Evaluatorsprovider/evaluator (e.g., ragas/faithfulness)Quick setup, standard evaluation methods
Saved Evaluatorsevaluators/{slug} (e.g., evaluators/my-checker)Team collaboration, UI-based configuration
Custom ScoringN/A - you send the score directlyProprietary logic, domain-specific metrics
Do you have your own evaluation logic?
├─ Yes → Use Custom Scoring
└─ No → Do you want to configure via UI and reuse?
         ├─ Yes → Use Saved Evaluators
         └─ No → Use Built-in Evaluators

What is an Evaluator?

An evaluator takes inputs (like the user question, LLM response, and optionally context or expected output) and returns a score indicating quality along some dimension.
Input + Output + Context → Evaluator → Score

                              passed: true/false
                              score: 0.0 - 1.0
                              details: "explanation"

Built-in Evaluator Categories

LangWatch provides a library of ready-to-use evaluators:
CategoryExamplesUse Case
RAG QualityFaithfulness, Context Precision, Context RecallEvaluate retrieval-augmented generation
SafetyPII Detection, Jailbreak Detection, Content ModerationDetect harmful content
CorrectnessExact Match, LLM Answer Match, Factual MatchCheck answer accuracy
FormatValid JSON, Valid Format, SQL Query EquivalenceValidate output structure
Custom CriteriaLLM-as-Judge (Boolean, Score, Category)Custom evaluation prompts
Browse all evaluators →

Quick Examples

Using a Built-in Evaluator

import langwatch

# Use directly by slug
langwatch.evaluation.evaluate(
    "ragas/faithfulness",  # Built-in evaluator
    name="Faithfulness Check",
    data={
        "input": user_input,
        "output": response,
        "contexts": contexts,
    },
)

Using a Saved Evaluator

import langwatch

# Use your saved evaluator by its slug
langwatch.evaluation.evaluate(
    "evaluators/my-tone-checker",  # Saved on platform
    name="Tone Check",
    data={
        "input": user_input,
        "output": response,
    },
)

Sending Custom Scores

import langwatch

# Run your own logic and send the result
score = my_custom_evaluator(input, output)

langwatch.get_current_span().add_evaluation(
    name="my_custom_metric",
    passed=score > 0.7,
    score=score,
)

Using Evaluators

In Experiments

Run evaluators on each row of your test dataset for batch evaluation:
experiment = langwatch.experiment.init("my-experiment")

for idx, row in experiment.loop(df.iterrows()):
    response = my_llm(row["input"])

    experiment.evaluate(
        "ragas/faithfulness",
        index=idx,
        data={
            "input": row["input"],
            "output": response,
            "contexts": row["contexts"],
        },
    )
Learn more about experiments →

In Online Evaluation (Monitors)

Run evaluators automatically on production traces:
  1. Create a monitor in LangWatch
  2. Select evaluators to run
  3. Configure when to trigger (all traces, sampled, filtered)
  4. Scores appear on traces and dashboards
Learn more about online evaluation →

As Guardrails

Use evaluators to block harmful content in real-time:
guardrail = langwatch.evaluation.evaluate(
    "azure/jailbreak",
    name="Jailbreak Detection",
    data={"input": user_input},
    as_guardrail=True,
)

if not guardrail.passed:
    return "I can't help with that request."
Learn more about guardrails →

Evaluator Inputs

Different evaluators require different inputs:
InputDescriptionExample Evaluators
inputUser question/promptJailbreak Detection, Off-Topic
outputLLM responsePII Detection, Valid Format
contextsRetrieved documentsFaithfulness, Context Precision
expected_outputGround truth answerAnswer Correctness, Exact Match
conversationFull conversation historyConversation Relevancy
Check each evaluator’s documentation for required and optional inputs.

The name Parameter

Important: Always provide a descriptive name when running evaluators. This helps identify evaluation results in Analytics and traces.
# Good - descriptive name
langwatch.evaluation.evaluate(
    "langevals/llm_category",
    name="Answer Completeness Check",  # Descriptive!
    data={...},
)

# Bad - no name, hard to track
langwatch.evaluation.evaluate(
    "langevals/llm_category",
    data={...},
)

Next Steps