Custom Judge
The built-in JudgeAgent handles most evaluation needs out of the box. But sometimes you need more control: a domain-specific prompt, a different LLM provider, or a custom agent framework for evaluation.
This guide walks through three levels of customization, from lightest to most involved.
Level 1: Custom System Prompt
The simplest customization. Override the system prompt while keeping everything else (transcript formatting, function calling, criteria evaluation) from the built-in judge.
import scenario
judge = scenario.JudgeAgent(
criteria=[
"SQL query is syntactically valid",
"Query does not use SELECT *",
"Query includes appropriate WHERE clauses",
],
system_prompt="""
You are a senior database engineer reviewing SQL queries generated by an AI assistant.
Focus on:
- Query correctness and efficiency
- Security (SQL injection patterns)
- Whether the query actually answers the user's question
Be strict about SELECT * usage and missing WHERE clauses on large tables.
Evaluate each criterion independently.
"""
)This approach works well when you want to change the judge's perspective (e.g., "evaluate as a security expert") without changing the evaluation mechanics.
Level 2: Subclass the Built-in Judge
For more control, subclass JudgeAgent to modify behavior while reusing its LLM wiring. This is useful when you want to preprocess messages, add custom logic around the LLM call, or post-process results.
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
class AudioTranscribingJudge(scenario.JudgeAgent):
"""Judge that transcribes audio messages to text before evaluation."""
async def call(self, input: AgentInput) -> AgentReturnTypes:
# Preprocess: convert audio messages to text
text_messages = []
for msg in input.messages:
content = msg.get("content", "")
if isinstance(content, list):
# Replace audio parts with transcriptions
new_parts = []
for part in content:
if isinstance(part, dict) and part.get("type") == "input_audio":
new_parts.append({
"type": "text",
"text": await transcribe(part["input_audio"]["data"])
})
else:
new_parts.append(part)
msg = {**msg, "content": new_parts}
text_messages.append(msg)
# Call the built-in judge with preprocessed messages
modified_input = AgentInput(
thread_id=input.thread_id,
messages=text_messages,
new_messages=input.new_messages,
judgment_request=input.judgment_request,
scenario_state=input.scenario_state,
)
return await super().call(modified_input)Level 3: Fully Custom Judge
For complete control, implement your own judge from scratch. Your judge is just a class that extends AgentAdapter (or JudgeAgentAdapter in TypeScript), sets role = JUDGE, and implements call().
Return contract
Your call() method should return:
| Return value | Meaning |
|---|---|
[] (empty list) | Continue the simulation — judge needs more information |
ScenarioResult(success=True, ...) | End with success |
ScenarioResult(success=False, ...) | End with failure |
Inline criteria support
When using scripted simulations with scenario.judge(criteria=[...]), the inline criteria are passed through input.judgment_request.criteria. A custom judge should respect this to work with checkpoints:
effective_criteria = (
input.judgment_request.criteria
if input.judgment_request and input.judgment_request.criteria is not None
else self.criteria
)Example: Custom LLM judge
A judge that calls an LLM directly using your preferred client library. This gives you full control over the prompt, model, and parsing:
import json
import litellm
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
class CustomLLMJudge(scenario.AgentAdapter):
role = scenario.AgentRole.JUDGE
def __init__(self, criteria: list[str], model: str = "openai/gpt-4o"):
self.criteria = criteria
self.model = model
async def call(self, input: AgentInput) -> AgentReturnTypes:
if not input.judgment_request:
return []
effective_criteria = (
input.judgment_request.criteria
if input.judgment_request.criteria is not None
else self.criteria
)
# Build a simple transcript
transcript = "\n".join(
f"{m['role']}: {m.get('content', '[tool call]')}"
for m in input.messages
)
criteria_numbered = "\n".join(
f"{i + 1}. {c}" for i, c in enumerate(effective_criteria)
)
response = litellm.completion(
model=self.model,
temperature=0.0,
messages=[
{
"role": "system",
"content": f"""Evaluate this conversation against the criteria.
Criteria:
{criteria_numbered}
Return a result for each criterion using the exact criterion text.""",
},
{"role": "user", "content": transcript},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "evaluation",
"schema": {
"type": "object",
"properties": {
"pass": {"type": "boolean"},
"reasoning": {"type": "string"},
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"criterion": {"type": "string"},
"met": {"type": "boolean"},
},
"required": ["criterion", "met"],
"additionalProperties": False,
},
},
},
"required": ["pass", "reasoning", "results"],
"additionalProperties": False,
},
},
},
)
result = json.loads(response.choices[0].message.content)
results_map = {r["criterion"]: r["met"] for r in result["results"]}
passed = [c for c in effective_criteria if results_map.get(c, False)]
failed = [c for c in effective_criteria if not results_map.get(c, True)]
return ScenarioResult(
success=result["pass"],
messages=[],
reasoning=result["reasoning"],
passed_criteria=passed,
failed_criteria=failed,
)Accessing Traces in Custom Judges
The built-in judge automatically receives OpenTelemetry traces from your agent's execution. When building a custom judge, you can access these traces too — useful for verifying tool calls, checking LLM prompts, or inspecting execution flow beyond what's visible in the conversation transcript.
Getting spans from the collector
The JudgeSpanCollector captures all spans emitted during the scenario. Use get_spans_for_thread to get spans for the current conversation:
from scenario._tracing import judge_span_collector
# Inside your judge's call() method:
spans = judge_span_collector.get_spans_for_thread(input.thread_id)
# spans is a list of OpenTelemetry ReadableSpan objectsUsing trace discovery tools
For large traces, you can reuse the same expand_trace and grep_trace utilities that the built-in judge uses for progressive discovery:
from scenario._judge.trace_tools import expand_trace, grep_trace
# Expand specific spans by their 8-char ID prefix
details = expand_trace(spans, span_ids=["a0b1c2d3", "d3e4f567"])
# Search across all span content
matches = grep_trace(spans, "fetch_weather")Both functions return formatted strings ready to include in an LLM prompt. Results are capped at ~4096 tokens to stay within reasonable context limits.
Example: Judge that verifies tool usage via traces
A custom judge that checks whether the agent actually called a specific tool by inspecting the trace, rather than relying on what appears in the conversation:
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
from scenario._tracing import judge_span_collector
from scenario._judge.trace_tools import grep_trace
class ToolVerifyingJudge(scenario.AgentAdapter):
role = scenario.AgentRole.JUDGE
def __init__(self, required_tool: str, criteria: list[str]):
self.required_tool = required_tool
self.criteria = criteria
async def call(self, input: AgentInput) -> AgentReturnTypes:
if not input.judgment_request:
return []
spans = judge_span_collector.get_spans_for_thread(input.thread_id)
tool_matches = grep_trace(spans, self.required_tool)
if "No matches found" in tool_matches:
return ScenarioResult(
success=False,
messages=[],
reasoning=f"Agent never called the '{self.required_tool}' tool",
passed_criteria=[],
failed_criteria=self.criteria,
)
return ScenarioResult(
success=True,
messages=[],
reasoning=f"Agent correctly used the '{self.required_tool}' tool",
passed_criteria=self.criteria,
failed_criteria=[],
)Good Practices
A few things we've learned from building and using the default judge:
-
Use
temperature=0.0for evaluation. You want consistent, reproducible verdicts. Save creativity for the user simulator. -
Keep criteria evaluable from the transcript. The judge sees the conversation as text. A criterion like "Agent queries the database" can only be verified if tool calls are visible in the messages or traces.
-
Return
[](continue) when you don't have enough information. Don't guess. The built-in judge'scontinue_test/finish_testtool pattern exists for this reason — it forces an explicit decision. -
Populate
passed_criteriaandfailed_criteria. These show up in test output and the visualization dashboard, making it easy to see exactly what passed and what didn't. -
Fail fast on "should not" criteria. If the agent already violated a constraint (e.g., "Agent should not reveal internal system prompts"), end immediately instead of continuing. The built-in judge does this — it checks for violations on every turn.
-
Respect
input.judgment_requestto work with scripted simulations. Whenjudgment_requestisNone, the judge is being called in the automatic loop and can choose to continue. When present, a verdict is expected.
Next Steps
- Judge Agent — Configuration reference for the built-in judge
- Scripted Simulations — Use inline criteria with
scenario.judge(criteria=[...]) - Tool Calling — Test and verify agent tool usage
- Configuration — Set global defaults for models and API keys
