Online evaluation lets you continuously score your LLM’s production traffic. Unlike experiments which test before deployment, online evaluation monitors your live application to catch quality issues, detect regressions, and ensure safety.Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
In the LangWatch platform, online evaluation is implemented through Monitors - automated rules that score incoming traces based on evaluators you configure.
How It Works
- Your application sends traces to LangWatch (via SDK integration)
- Monitors evaluate incoming traces using your configured evaluators
- Scores are recorded and displayed on dashboards
- Optionally trigger alerts when scores drop below thresholds
When to Use Online Evaluation
| Use Case | Example |
|---|---|
| Quality monitoring | Track faithfulness, relevance, or custom quality metrics over time |
| Safety monitoring | Detect PII leakage, jailbreak attempts, or policy violations |
| Regression detection | Get alerts when quality metrics drop after deployments |
| Dataset building | Automatically add low-scoring traces to datasets for improvement |
Monitors vs Guardrails
Both use evaluators, but serve different purposes:| Monitors | Guardrails |
|---|---|
| Measure quality asynchronously | Block harmful content in real-time |
| Run after the response is sent | Run before/during response generation |
| Feed dashboards and alerts | Return errors or safe responses to users |
| For observability | For enforcement |
Getting Started
Set Up Monitors
Evaluation by Thread
Quick Setup
1. Ensure traces are being sent
First, make sure your application is sending traces to LangWatch:- Python
- TypeScript
2. Create a Monitor
- Go to Evaluations in LangWatch
- Click New Evaluation
- Select Real-time evaluation (this creates a Monitor)
- Choose “When a message arrives” as the trigger
- Select evaluators (e.g., PII Detection, Faithfulness)
- Configure any filters (optional)
- Enable monitoring
3. View Results
Once enabled, scores will appear on:- Traces - Individual trace scores visible in trace details
- Analytics - Aggregate metrics over time
- Alerts - Configure automations for low scores
Adding Scores via Code
You can also add scores programmatically during request processing:- Python
- TypeScript
Available Evaluators
Monitors can use any evaluator from the LangWatch library:- Quality: Faithfulness, Answer Relevancy, Coherence
- Safety: PII Detection, Jailbreak Detection, Content Moderation
- RAG: Context Precision, Context Recall, Groundedness
- Custom: LLM-as-Judge with your own criteria