Skip to content

Custom Judge

The built-in JudgeAgent handles most evaluation needs out of the box. But sometimes you need more control: a domain-specific prompt, a different LLM provider, or a custom agent framework for evaluation.

This guide walks through three levels of customization, from lightest to most involved.

Level 1: Custom System Prompt

The simplest customization. Override the system prompt while keeping everything else (transcript formatting, function calling, criteria evaluation) from the built-in judge.

python
import scenario
 
judge = scenario.JudgeAgent(
    criteria=[
        "SQL query is syntactically valid",
        "Query does not use SELECT *",
        "Query includes appropriate WHERE clauses",
    ],
    system_prompt="""
You are a senior database engineer reviewing SQL queries generated by an AI assistant.
 
Focus on:
- Query correctness and efficiency
- Security (SQL injection patterns)
- Whether the query actually answers the user's question
 
Be strict about SELECT * usage and missing WHERE clauses on large tables.
Evaluate each criterion independently.
"""
)

This approach works well when you want to change the judge's perspective (e.g., "evaluate as a security expert") without changing the evaluation mechanics.

Level 2: Subclass the Built-in Judge

For more control, subclass JudgeAgent to modify behavior while reusing its LLM wiring. This is useful when you want to preprocess messages, add custom logic around the LLM call, or post-process results.

python
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
 
class AudioTranscribingJudge(scenario.JudgeAgent):
    """Judge that transcribes audio messages to text before evaluation."""
 
    async def call(self, input: AgentInput) -> AgentReturnTypes:
        # Preprocess: convert audio messages to text
        text_messages = []
        for msg in input.messages:
            content = msg.get("content", "")
            if isinstance(content, list):
                # Replace audio parts with transcriptions
                new_parts = []
                for part in content:
                    if isinstance(part, dict) and part.get("type") == "input_audio":
                        new_parts.append({
                            "type": "text",
                            "text": await transcribe(part["input_audio"]["data"])
                        })
                    else:
                        new_parts.append(part)
                msg = {**msg, "content": new_parts}
            text_messages.append(msg)
 
        # Call the built-in judge with preprocessed messages
        modified_input = AgentInput(
            thread_id=input.thread_id,
            messages=text_messages,
            new_messages=input.new_messages,
            judgment_request=input.judgment_request,
            scenario_state=input.scenario_state,
        )
        return await super().call(modified_input)

Level 3: Fully Custom Judge

For complete control, implement your own judge from scratch. Your judge is just a class that extends AgentAdapter (or JudgeAgentAdapter in TypeScript), sets role = JUDGE, and implements call().

Return contract

Your call() method should return:

Return valueMeaning
[] (empty list)Continue the simulation — judge needs more information
ScenarioResult(success=True, ...)End with success
ScenarioResult(success=False, ...)End with failure

Inline criteria support

When using scripted simulations with scenario.judge(criteria=[...]), the inline criteria are passed through input.judgment_request.criteria. A custom judge should respect this to work with checkpoints:

python
effective_criteria = (
    input.judgment_request.criteria
    if input.judgment_request and input.judgment_request.criteria is not None
    else self.criteria
)

Example: Custom LLM judge

A judge that calls an LLM directly using your preferred client library. This gives you full control over the prompt, model, and parsing:

python
import json
import litellm
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
 
class CustomLLMJudge(scenario.AgentAdapter):
    role = scenario.AgentRole.JUDGE
 
    def __init__(self, criteria: list[str], model: str = "openai/gpt-4o"):
        self.criteria = criteria
        self.model = model
 
    async def call(self, input: AgentInput) -> AgentReturnTypes:
        if not input.judgment_request:
            return []
 
        effective_criteria = (
            input.judgment_request.criteria
            if input.judgment_request.criteria is not None
            else self.criteria
        )
 
        # Build a simple transcript
        transcript = "\n".join(
            f"{m['role']}: {m.get('content', '[tool call]')}"
            for m in input.messages
        )
 
        criteria_numbered = "\n".join(
            f"{i + 1}. {c}" for i, c in enumerate(effective_criteria)
        )
 
        response = litellm.completion(
            model=self.model,
            temperature=0.0,
            messages=[
                {
                    "role": "system",
                    "content": f"""Evaluate this conversation against the criteria.
 
Criteria:
{criteria_numbered}
 
Return a result for each criterion using the exact criterion text.""",
                },
                {"role": "user", "content": transcript},
            ],
            response_format={
                "type": "json_schema",
                "json_schema": {
                    "name": "evaluation",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "pass": {"type": "boolean"},
                            "reasoning": {"type": "string"},
                            "results": {
                                "type": "array",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "criterion": {"type": "string"},
                                        "met": {"type": "boolean"},
                                    },
                                    "required": ["criterion", "met"],
                                    "additionalProperties": False,
                                },
                            },
                        },
                        "required": ["pass", "reasoning", "results"],
                        "additionalProperties": False,
                    },
                },
            },
        )
 
        result = json.loads(response.choices[0].message.content)
 
        results_map = {r["criterion"]: r["met"] for r in result["results"]}
        passed = [c for c in effective_criteria if results_map.get(c, False)]
        failed = [c for c in effective_criteria if not results_map.get(c, True)]
 
        return ScenarioResult(
            success=result["pass"],
            messages=[],
            reasoning=result["reasoning"],
            passed_criteria=passed,
            failed_criteria=failed,
        )

Accessing Traces in Custom Judges

The built-in judge automatically receives OpenTelemetry traces from your agent's execution. When building a custom judge, you can access these traces too — useful for verifying tool calls, checking LLM prompts, or inspecting execution flow beyond what's visible in the conversation transcript.

Getting spans from the collector

The JudgeSpanCollector captures all spans emitted during the scenario. Use get_spans_for_thread to get spans for the current conversation:

python
from scenario._tracing import judge_span_collector
 
# Inside your judge's call() method:
spans = judge_span_collector.get_spans_for_thread(input.thread_id)
# spans is a list of OpenTelemetry ReadableSpan objects

Using trace discovery tools

For large traces, you can reuse the same expand_trace and grep_trace utilities that the built-in judge uses for progressive discovery:

python
from scenario._judge.trace_tools import expand_trace, grep_trace
 
# Expand specific spans by their 8-char ID prefix
details = expand_trace(spans, span_ids=["a0b1c2d3", "d3e4f567"])
 
# Search across all span content
matches = grep_trace(spans, "fetch_weather")

Both functions return formatted strings ready to include in an LLM prompt. Results are capped at ~4096 tokens to stay within reasonable context limits.

Example: Judge that verifies tool usage via traces

A custom judge that checks whether the agent actually called a specific tool by inspecting the trace, rather than relying on what appears in the conversation:

python
import scenario
from scenario.types import AgentInput, AgentReturnTypes, ScenarioResult
from scenario._tracing import judge_span_collector
from scenario._judge.trace_tools import grep_trace
 
class ToolVerifyingJudge(scenario.AgentAdapter):
    role = scenario.AgentRole.JUDGE
 
    def __init__(self, required_tool: str, criteria: list[str]):
        self.required_tool = required_tool
        self.criteria = criteria
 
    async def call(self, input: AgentInput) -> AgentReturnTypes:
        if not input.judgment_request:
            return []
 
        spans = judge_span_collector.get_spans_for_thread(input.thread_id)
        tool_matches = grep_trace(spans, self.required_tool)
 
        if "No matches found" in tool_matches:
            return ScenarioResult(
                success=False,
                messages=[],
                reasoning=f"Agent never called the '{self.required_tool}' tool",
                passed_criteria=[],
                failed_criteria=self.criteria,
            )
 
        return ScenarioResult(
            success=True,
            messages=[],
            reasoning=f"Agent correctly used the '{self.required_tool}' tool",
            passed_criteria=self.criteria,
            failed_criteria=[],
        )

Good Practices

A few things we've learned from building and using the default judge:

  • Use temperature=0.0 for evaluation. You want consistent, reproducible verdicts. Save creativity for the user simulator.

  • Keep criteria evaluable from the transcript. The judge sees the conversation as text. A criterion like "Agent queries the database" can only be verified if tool calls are visible in the messages or traces.

  • Return [] (continue) when you don't have enough information. Don't guess. The built-in judge's continue_test / finish_test tool pattern exists for this reason — it forces an explicit decision.

  • Populate passed_criteria and failed_criteria. These show up in test output and the visualization dashboard, making it easy to see exactly what passed and what didn't.

  • Fail fast on "should not" criteria. If the agent already violated a constraint (e.g., "Agent should not reveal internal system prompts"), end immediately instead of continuing. The built-in judge does this — it checks for violations on every turn.

  • Respect input.judgment_request to work with scripted simulations. When judgment_request is None, the judge is being called in the automatic loop and can choose to continue. When present, a verdict is expected.


Next Steps