LangWatch provides a library of ready-to-use evaluators for common evaluation tasks. You can use these directly in your code without any setup on the platform.
When to use Built-in Evaluators:
- You want to quickly add evaluation without platform configuration
- You’re running experiments or online evaluations programmatically
- You want to use well-tested, standardized evaluation methods
See also:
Available Evaluators
LangWatch offers evaluators across several categories:
| Category | Examples | Use Case |
|---|
| RAG Quality | ragas/faithfulness, ragas/context_precision | Evaluate retrieval-augmented generation |
| Safety | presidio/pii_detection, azure/jailbreak | Detect PII, jailbreaks, harmful content |
| Correctness | langevals/exact_match, langevals/llm_boolean | Check answer accuracy |
| Custom Criteria | langevals/llm_boolean, langevals/llm_score | LLM-as-Judge for custom checks |
Browse all evaluators →
Using Built-in Evaluators
In Experiments
Run evaluators on your test dataset during batch evaluation:
import langwatch
df = langwatch.datasets.get_dataset("my-dataset").to_pandas()
experiment = langwatch.experiment.init("my-experiment")
for index, row in experiment.loop(df.iterrows()):
# Your LLM call
output = my_llm(row["input"])
# Run built-in evaluator
experiment.evaluate(
"ragas/faithfulness", # Built-in evaluator slug
index=index,
data={
"input": row["input"],
"output": output,
"contexts": row["contexts"],
},
)
In Online Evaluation
Run evaluators on production traces in real-time:
import langwatch
@langwatch.span()
def my_llm_step(user_input: str):
# Your LLM call
output = my_llm(user_input)
# Run evaluator on production traffic
result = langwatch.evaluation.evaluate(
"presidio/pii_detection", # Built-in evaluator slug
name="PII Check",
data={
"input": user_input,
"output": output,
},
)
return output
As Guardrails
Use evaluators to block harmful content before responding:
import langwatch
@langwatch.span()
def my_llm_step(user_input: str):
# Check input before processing
guardrail = langwatch.evaluation.evaluate(
"azure/jailbreak", # Built-in evaluator slug
name="Jailbreak Detection",
data={"input": user_input},
as_guardrail=True,
)
if not guardrail.passed:
return "I can't help with that request."
# Safe to proceed
return my_llm(user_input)
Different evaluators require different inputs. Check the evaluator list for each evaluator’s requirements.
| Input | Description | Example Evaluators |
|---|
input | User question/prompt | Jailbreak Detection, Off-Topic |
output | LLM response | PII Detection, Valid Format |
contexts | Retrieved documents (array) | Faithfulness, Context Precision |
expected_output | Ground truth answer | Answer Correctness, Exact Match |
conversation | Conversation history | Conversation Relevancy |
Configuring Settings
Many evaluators accept configuration settings:
experiment.evaluate(
"langevals/llm_boolean",
index=index,
data={"input": question, "output": response},
settings={
"model": "openai/gpt-4o-mini",
"prompt": "Does this response fully answer the question? Reply true or false.",
},
)
The name Parameter
Always provide a descriptive name when using evaluators in online evaluation. This helps track results in Analytics.
# Good - descriptive name
langwatch.evaluation.evaluate(
"langevals/llm_category",
name="Tone Checker", # Shows up in Analytics
data={...},
)
# Bad - no name, hard to track
langwatch.evaluation.evaluate(
"langevals/llm_category",
data={...},
)
Next Steps