Skip to main content
LangWatch provides a library of ready-to-use evaluators for common evaluation tasks. You can use these directly in your code without any setup on the platform.
When to use Built-in Evaluators:
  • You want to quickly add evaluation without platform configuration
  • You’re running experiments or online evaluations programmatically
  • You want to use well-tested, standardized evaluation methods
See also:

Available Evaluators

LangWatch offers evaluators across several categories:
CategoryExamplesUse Case
RAG Qualityragas/faithfulness, ragas/context_precisionEvaluate retrieval-augmented generation
Safetypresidio/pii_detection, azure/jailbreakDetect PII, jailbreaks, harmful content
Correctnesslangevals/exact_match, langevals/llm_booleanCheck answer accuracy
Custom Criterialangevals/llm_boolean, langevals/llm_scoreLLM-as-Judge for custom checks
Browse all evaluators →

Using Built-in Evaluators

In Experiments

Run evaluators on your test dataset during batch evaluation:
import langwatch

df = langwatch.datasets.get_dataset("my-dataset").to_pandas()

experiment = langwatch.experiment.init("my-experiment")

for index, row in experiment.loop(df.iterrows()):
    # Your LLM call
    output = my_llm(row["input"])

    # Run built-in evaluator
    experiment.evaluate(
        "ragas/faithfulness",  # Built-in evaluator slug
        index=index,
        data={
            "input": row["input"],
            "output": output,
            "contexts": row["contexts"],
        },
    )

In Online Evaluation

Run evaluators on production traces in real-time:
import langwatch

@langwatch.span()
def my_llm_step(user_input: str):
    # Your LLM call
    output = my_llm(user_input)

    # Run evaluator on production traffic
    result = langwatch.evaluation.evaluate(
        "presidio/pii_detection",  # Built-in evaluator slug
        name="PII Check",
        data={
            "input": user_input,
            "output": output,
        },
    )

    return output

As Guardrails

Use evaluators to block harmful content before responding:
import langwatch

@langwatch.span()
def my_llm_step(user_input: str):
    # Check input before processing
    guardrail = langwatch.evaluation.evaluate(
        "azure/jailbreak",  # Built-in evaluator slug
        name="Jailbreak Detection",
        data={"input": user_input},
        as_guardrail=True,
    )

    if not guardrail.passed:
        return "I can't help with that request."

    # Safe to proceed
    return my_llm(user_input)

Evaluator Inputs

Different evaluators require different inputs. Check the evaluator list for each evaluator’s requirements.
InputDescriptionExample Evaluators
inputUser question/promptJailbreak Detection, Off-Topic
outputLLM responsePII Detection, Valid Format
contextsRetrieved documents (array)Faithfulness, Context Precision
expected_outputGround truth answerAnswer Correctness, Exact Match
conversationConversation historyConversation Relevancy

Configuring Settings

Many evaluators accept configuration settings:
experiment.evaluate(
    "langevals/llm_boolean",
    index=index,
    data={"input": question, "output": response},
    settings={
        "model": "openai/gpt-4o-mini",
        "prompt": "Does this response fully answer the question? Reply true or false.",
    },
)

The name Parameter

Always provide a descriptive name when using evaluators in online evaluation. This helps track results in Analytics.
# Good - descriptive name
langwatch.evaluation.evaluate(
    "langevals/llm_category",
    name="Tone Checker",  # Shows up in Analytics
    data={...},
)

# Bad - no name, hard to track
langwatch.evaluation.evaluate(
    "langevals/llm_category",
    data={...},
)

Next Steps