In this cookbook, weβll explore a more effective approach to evaluating multi-turn customer support agents. Traditional evaluation methods that use a single input-output pair are insufficient for agents that need to adapt their tool usage as conversations evolve. Instead, weβll implement a simulation-based approach where an LLM evaluates our agent against specific success criteria.Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
The Problem with Traditional Evaluation
Traditional evaluation methods for customer support agents often use a dataset where:- Input: Customer ticket/query
- Output: Expected sequence of tool calls
- It assumes a fixed, predetermined path to resolution
- It doesnβt account for new information discovered during the conversation
- It focuses on the exact sequence of tools rather than achieving the desired outcome
A Better Approach: Simulation-Based Evaluation
Instead of predicting exact tool sequences, weβll define success criteria that focus on what the agent must accomplish, regardless of the specific path taken. For example:- Focuses on outcomes rather than specific steps
- Allows for multiple valid solution paths
- Better reflects real-world customer support scenarios
Requirements
Before we start, make sure you have the necessary packages installed:Define Tools
Letβs implement this simulation-based evaluation approach using mock tools for an e-commerce customer support scenario.Define Agents
Now weβll define our agents. Weβll create both a Planner and an Executor agent. The Planner agent is responsible for creating a plan to achieve the userβs goal, while the Executor agent is responsible for executing the plan. We also define a helper function to generate a response from the tool outputs.Evaluator Agent
The Evaluator Agent evaluates our multi-turn agent behavior using binary success criteria over full simulated conversations. This method moves beyond traditional input/output (I/O) pair evaluation, addressing the stochastic and flexible nature of agent workflows.Simulation Function
Below we define a method to simulate conversations between our agent and a user. The outputs will be evaluated by our Evaluator Agent.Running the Simulation
Now, letβs define a test ticket and our success criteria, then run the simulation:Simulation Output
Hereβs an example of the output you would see when running this simulation:Conclusion
Traditional evaluation methods that rely on fixed input-output pairs are insufficient for multi-turn conversational agents. By simulating complete conversations and evaluating against outcome-based criteria, we can better assess an agentβs ability to handle real-world customer support scenarios. Key benefits of this approach include:- Flexibility in solution paths: The agent can take different valid approaches to solve the same problem
- Focus on outcomes: Evaluation is based on what the agent accomplishes, not how it gets there
- Adaptability to new information: The agent can adjust its strategy based on information discovered during the conversation
- Realistic assessment: The evaluation better reflects how agents would perform in real-world scenarios