Skip to main content
Online evaluation lets you continuously score your LLM’s production traffic. Unlike experiments which test before deployment, online evaluation monitors your live application to catch quality issues, detect regressions, and ensure safety.
In the LangWatch platform, online evaluation is implemented through Monitors - automated rules that score incoming traces based on evaluators you configure.

How It Works

User Request → Your LLM → Response → LangWatch Trace → Monitor → Score

                                              Dashboard & Alerts
  1. Your application sends traces to LangWatch (via SDK integration)
  2. Monitors evaluate incoming traces using your configured evaluators
  3. Scores are recorded and displayed on dashboards
  4. Optionally trigger alerts when scores drop below thresholds

When to Use Online Evaluation

Use CaseExample
Quality monitoringTrack faithfulness, relevance, or custom quality metrics over time
Safety monitoringDetect PII leakage, jailbreak attempts, or policy violations
Regression detectionGet alerts when quality metrics drop after deployments
Dataset buildingAutomatically add low-scoring traces to datasets for improvement

Monitors vs Guardrails

Both use evaluators, but serve different purposes:
MonitorsGuardrails
Measure quality asynchronouslyBlock harmful content in real-time
Run after the response is sentRun before/during response generation
Feed dashboards and alertsReturn errors or safe responses to users
For observabilityFor enforcement
If you need to block harmful content before it reaches users, see Guardrails.

Getting Started

Quick Setup

1. Ensure traces are being sent

First, make sure your application is sending traces to LangWatch:
import langwatch

@langwatch.trace()
def my_llm_app(user_input):
    # Your LLM logic here
    return response

2. Create a Monitor

  1. Go to Evaluations in LangWatch
  2. Click New Evaluation
  3. Select Real-time evaluation (this creates a Monitor)
  4. Choose “When a message arrives” as the trigger
  5. Select evaluators (e.g., PII Detection, Faithfulness)
  6. Configure any filters (optional)
  7. Enable monitoring

3. View Results

Once enabled, scores will appear on:
  • Traces - Individual trace scores visible in trace details
  • Analytics - Aggregate metrics over time
  • Alerts - Configure triggers for low scores

Adding Scores via Code

You can also add scores programmatically during request processing:
import langwatch

@langwatch.trace()
def my_llm_app(user_input):
    response = generate_response(user_input)
    
    # Add a custom score
    langwatch.get_current_span().add_evaluation(
        name="response_quality",
        passed=True,
        score=0.95,
        details="High quality response"
    )
    
    return response

Available Evaluators

Monitors can use any evaluator from the LangWatch library:
  • Quality: Faithfulness, Answer Relevancy, Coherence
  • Safety: PII Detection, Jailbreak Detection, Content Moderation
  • RAG: Context Precision, Context Recall, Groundedness
  • Custom: LLM-as-Judge with your own criteria
See the full Evaluators List.

Next Steps