Judge Agent
Overview
The Judge Agent is an LLM-powered evaluator that automatically determines whether your agent under test meets defined success criteria. Instead of writing complex assertion logic, you describe what success looks like in natural language, and the judge evaluates each conversation turn to decide whether to continue, succeed, or fail the test.
After each agent response, the judge:
- Reviews the entire conversation history
- Evaluates against your defined criteria
- Decides whether to continue, succeed, or fail
Use Case Example
Let's test a customer support agent handling billing inquiries:
python
import pytest
import scenario
@pytest.mark.asyncio
async def test_billing_inquiry_quality():
result = await scenario.run(
name="billing inquiry handling",
description="""
User received an unexpected charge on their credit card and is
concerned but polite. They have their account information ready.
""",
agents=[
CustomerSupportAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent asks for account information to investigate",
"Agent explains the charge clearly",
"Agent offers a solution or next steps",
"Agent maintains a helpful and empathetic tone",
"Agent should not make promises about refunds without verification"
])
],
max_turns=8
)
assert result.successConfiguration Reference
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
criteria | List[str] | No | [] | Success criteria to evaluate. Include positive requirements and negative constraints. |
model | str | No | Global config | LLM model identifier (e.g., "openai/gpt-4o"). |
temperature | float | No | 0.0 | Sampling temperature (0.0-1.0). Use 0.0-0.2 for consistent evaluation. |
max_tokens | int | No | Model default | Maximum tokens for judge reasoning and explanations. |
system_prompt | str | No | Built-in | Custom system prompt to override default judge behavior. |
api_base | str | No | Global config | Base URL for custom API endpoints. |
api_key | str | No | Environment | API key for the model provider. |
**extra_params | dict | No | {} | Additional LiteLLM parameters (headers, timeout, client). |
Writing Effective Criteria
Good criteria are specific, measurable, relevant, and actionable:
python
# Good - specific and measurable
scenario.JudgeAgent(criteria=[
"Agent asks for the user's order number",
"Agent provides a tracking link",
"Agent offers to help with anything else",
"Agent should not promise delivery dates without checking the system"
])
# Avoid vague criteria
scenario.JudgeAgent(criteria=[
"Agent is helpful", # Too vague
"Agent does everything right", # Not measurable
])Next Steps
- User Simulator Agent - Configure realistic user behavior
- Writing Scenarios - Best practices for scenario design
- Scripted Simulations - Combine judges with precise flow control
- Configuration - Set global defaults for all judges
