Choose Your Approach
There are three ways to evaluate your LLM outputs with LangWatch:Built-in Evaluators
Use LangWatch’s library of evaluators directly in your code.
Saved Evaluators
Create reusable evaluator configs on the platform.
Custom Scoring
Send scores from your own evaluation logic.
Which should I use?
| Approach | Slug Format | Best For |
|---|---|---|
| Built-in Evaluators | provider/evaluator (e.g., ragas/faithfulness) | Quick setup, standard evaluation methods |
| Saved Evaluators | evaluators/{slug} (e.g., evaluators/my-checker) | Team collaboration, UI-based configuration |
| Custom Scoring | N/A - you send the score directly | Proprietary logic, domain-specific metrics |
Decision flowchart
Decision flowchart
What is an Evaluator?
An evaluator takes inputs (like the user question, LLM response, and optionally context or expected output) and returns a score indicating quality along some dimension.Built-in Evaluator Categories
LangWatch provides a library of ready-to-use evaluators:| Category | Examples | Use Case |
|---|---|---|
| RAG Quality | Faithfulness, Context Precision, Context Recall | Evaluate retrieval-augmented generation |
| Safety | PII Detection, Jailbreak Detection, Content Moderation | Detect harmful content |
| Correctness | Exact Match, LLM Answer Match, Factual Match | Check answer accuracy |
| Format | Valid JSON, Valid Format, SQL Query Equivalence | Validate output structure |
| Custom Criteria | LLM-as-Judge (Boolean, Score, Category) | Custom evaluation prompts |
Quick Examples
Using a Built-in Evaluator
Using a Saved Evaluator
Sending Custom Scores
Using Evaluators
In Experiments
Run evaluators on each row of your test dataset for batch evaluation:In Online Evaluation (Monitors)
Run evaluators automatically on production traces:- Create a monitor in LangWatch
- Select evaluators to run
- Configure when to trigger (all traces, sampled, filtered)
- Scores appear on traces and dashboards
As Guardrails
Use evaluators to block harmful content in real-time:Evaluator Inputs
Different evaluators require different inputs:| Input | Description | Example Evaluators |
|---|---|---|
input | User question/prompt | Jailbreak Detection, Off-Topic |
output | LLM response | PII Detection, Valid Format |
contexts | Retrieved documents | Faithfulness, Context Precision |
expected_output | Ground truth answer | Answer Correctness, Exact Match |
conversation | Full conversation history | Conversation Relevancy |