Online evaluation lets you continuously score your LLM’s production traffic. Unlike experiments which test before deployment, online evaluation monitors your live application to catch quality issues, detect regressions, and ensure safety.
In the LangWatch platform, online evaluation is implemented through Monitors - automated rules that score incoming traces based on evaluators you configure.
How It Works
User Request → Your LLM → Response → LangWatch Trace → Monitor → Score
↓
Dashboard & Alerts
- Your application sends traces to LangWatch (via SDK integration)
- Monitors evaluate incoming traces using your configured evaluators
- Scores are recorded and displayed on dashboards
- Optionally trigger alerts when scores drop below thresholds
When to Use Online Evaluation
| Use Case | Example |
|---|
| Quality monitoring | Track faithfulness, relevance, or custom quality metrics over time |
| Safety monitoring | Detect PII leakage, jailbreak attempts, or policy violations |
| Regression detection | Get alerts when quality metrics drop after deployments |
| Dataset building | Automatically add low-scoring traces to datasets for improvement |
Monitors vs Guardrails
Both use evaluators, but serve different purposes:
| Monitors | Guardrails |
|---|
| Measure quality asynchronously | Block harmful content in real-time |
| Run after the response is sent | Run before/during response generation |
| Feed dashboards and alerts | Return errors or safe responses to users |
| For observability | For enforcement |
If you need to block harmful content before it reaches users, see Guardrails.
Getting Started
Quick Setup
1. Ensure traces are being sent
First, make sure your application is sending traces to LangWatch:
import langwatch
@langwatch.trace()
def my_llm_app(user_input):
# Your LLM logic here
return response
import { LangWatch } from "langwatch";
const langwatch = new LangWatch();
const trace = langwatch.getTrace();
// Your LLM logic here
trace.end();
2. Create a Monitor
- Go to Evaluations in LangWatch
- Click New Evaluation
- Select Real-time evaluation (this creates a Monitor)
- Choose “When a message arrives” as the trigger
- Select evaluators (e.g., PII Detection, Faithfulness)
- Configure any filters (optional)
- Enable monitoring
3. View Results
Once enabled, scores will appear on:
- Traces - Individual trace scores visible in trace details
- Analytics - Aggregate metrics over time
- Alerts - Configure triggers for low scores
Adding Scores via Code
You can also add scores programmatically during request processing:
import langwatch
@langwatch.trace()
def my_llm_app(user_input):
response = generate_response(user_input)
# Add a custom score
langwatch.get_current_span().add_evaluation(
name="response_quality",
passed=True,
score=0.95,
details="High quality response"
)
return response
const trace = langwatch.getTrace();
// After generating response
trace.addEvaluation({
name: "response_quality",
passed: true,
score: 0.95,
details: "High quality response"
});
Available Evaluators
Monitors can use any evaluator from the LangWatch library:
- Quality: Faithfulness, Answer Relevancy, Coherence
- Safety: PII Detection, Jailbreak Detection, Content Moderation
- RAG: Context Precision, Context Recall, Groundedness
- Custom: LLM-as-Judge with your own criteria
See the full Evaluators List.
Next Steps