Judge Agent

Overview

The Judge Agent is an LLM-powered evaluator that automatically determines whether your agent under test meets defined success criteria. Instead of writing complex assertion logic, you describe what success looks like in natural language, and the judge evaluates each conversation turn to decide whether to continue, succeed, or fail the test.

After each agent response, the judge:

Reviews the entire conversation history
Evaluates against your defined criteria
Decides whether to continue, succeed, or fail

Use Case Example

Let's test a customer support agent handling billing inquiries:

python

import pytest
import scenario
 
@pytest.mark.asyncio
async def test_billing_inquiry_quality():
    result = await scenario.run(
        name="billing inquiry handling",
        description="""
            User received an unexpected charge on their credit card and is
            concerned but polite. They have their account information ready.
        """,
        agents=[
            CustomerSupportAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent asks for account information to investigate",
                "Agent explains the charge clearly",
                "Agent offers a solution or next steps",
                "Agent maintains a helpful and empathetic tone",
                "Agent should not make promises about refunds without verification"
            ])
        ],
        max_turns=8
    )
    
    assert result.success

Configuration Reference

Parameter	Type	Required	Default	Description
`criteria`	`List[str]`	No	`[]`	Success criteria to evaluate. Include positive requirements and negative constraints.
`model`	`str`	No	Global config	LLM model identifier (e.g., `"openai/gpt-4o"`).
`temperature`	`float`	No	`0.0`	Sampling temperature (0.0-1.0). Use 0.0-0.2 for consistent evaluation.
`max_tokens`	`int`	No	Model default	Maximum tokens for judge reasoning and explanations.
`system_prompt`	`str`	No	Built-in	Custom system prompt to override default judge behavior.
`api_base`	`str`	No	Global config	Base URL for custom API endpoints.
`api_key`	`str`	No	Environment	API key for the model provider.
`**extra_params`	`dict`	No	`{}`	Additional LiteLLM parameters (`headers`, `timeout`, `client`).

Writing Effective Criteria

Good criteria are specific, measurable, relevant, and actionable:

python

# Good - specific and measurable
scenario.JudgeAgent(criteria=[
    "Agent asks for the user's order number",
    "Agent provides a tracking link",
    "Agent offers to help with anything else",
    "Agent should not promise delivery dates without checking the system"
])
 
# Avoid vague criteria
scenario.JudgeAgent(criteria=[
    "Agent is helpful",           # Too vague
    "Agent does everything right", # Not measurable
])

Next Steps

User Simulator Agent - Configure realistic user behavior
Writing Scenarios - Best practices for scenario design
Scripted Simulations - Combine judges with precise flow control
Configuration - Set global defaults for all judges