From Scenario to Finished: How to Test AI Agents with Domain-Driven TDD

Andrew Joia

Sep 26, 2025

TL;DR

Traditional testing breaks down when applied to AI agent testing or LLM testing. That’s why we built Scenario, a framework for running agent simulations and LLM evaluations in real-world business contexts.

I validated Scenario's domain-driven, test-first approach to AI agent testing by building a flight booking agent from scratch. Starting with scenario tests that defined business capabilities, I let failing tests drive every step, from basic conversation handling to full booking logic.

The experiment shows how Scenario enables LLM agent testing and agent simulations through domain-driven TDD: each test failure uncovers missing domain knowledge, scenarios act as living documentation, and the result is an AI agent you can ship with confidence.

The Challenge of Testing AI Systems

Problem: Traditional testing approaches fall apart when your system's behavior is inherently unpredictable—yet shipping untested AI products is like flying blind.

How confident are you that your AI product is actually working as expected? Especially in all of the various scenarios that your agent will be exposed to? LLMs are essentially nondeterministic, and when dealing with a chat-based agent (or multi-agent system), getting actual test coverage can be difficult, if not seemingly impossible.

Here at LangWatch, we've been spending a lot of time working on Scenario and figuring out all of the great applications for it in the AI space. Our goal is to provide services, tools, and knowledge to our clients and the wider AI community that allow teams to ship AI products with 100% confidence.

We believe that Scenario is the answer to this problem.

Key Insight: Scenario testing enables Domain-Driven Design for AI agents. Your scenarios become your domain model, defining the language, boundaries, and rules that both your agents and stakeholders understand. This domain-first approach then drives every implementation decision through TDD.

Introducing Scenario: A New Paradigm for Agent Testing

Traditionally, confidence has meant a comprehensive testing suite. Unit tests give you confidence at the lowest level of your codebase. Integration tests and traditional end-to-end tests give you confidence that your whole system works front to back.

But AI systems are different. Systems running on LLMs are by their nature nondeterministic—the same input can produce different outputs, making traditional assertion-based testing unreliable. That's why we built Scenario to run in your CI alongside your existing tests, but with a new paradigm designed specifically for AI agents.

With Scenario, you can still write unit tests and integration tests within your existing testing framework (vitestpytest, etc.). And if you're using Scenario's full scripting capabilities that provide access to step-by-step messages and tool calls, you can even write unit and integration tests within Scenario itself. But since you're defining scenarios—complete user journeys and business capabilities—Scenario becomes a powerful method for taking end-to-end testing even further, capturing the nuanced behaviors that matter most in AI systems.

Domain-Driven Design Meets Test-Driven Development

Scenario is a framework that we built to introduce a new paradigm for testing agent systems—one that starts with domain modeling through scenarios. Rather than beginning with technical implementation details, Scenario forces you to think in terms of business capabilities, user journeys, and domain events.

The scenarios you write become your domain model: they define the language your agents speak, the boundaries of their responsibilities, and the contracts they must fulfill. This domain-first approach then drives your entire TDD cycle—every failing test is a domain requirement waiting to be implemented.

The breakthrough with Scenario isn't just about testing—it's about how it enables Domain-Driven Design (DDD) to drive your entire development process. When you write scenario tests, you're not just testing behavior; you're defining your domain language, identifying bounded contexts, and establishing the vocabulary that both your AI agents and your business stakeholders will use.

This domain-first approach then naturally drives TDD: your scenario tests become living specifications that guide every implementation decision. Instead of guessing what your agent should do, the domain requirements—expressed through scenarios—tell you exactly what to build.

As an engineer, I've been a big fan of TDD (Test-Driven Development) for a long time—TDD helps you focus on what you need to build, and when your tests pass, you can have the confidence to refactor, knowing that you won't break your product and ruin the experience for your users.

So, I wanted to put our testing framework to the test. My plan was to write the scenario test first and see what that process looks like.

Domain-Driven TDD for AI Agents

Domain-Driven Test-Driven Development applies TDD principles to generative AI. Scenarios capture business rules, domain events, and vocabulary, guiding every implementation step and aligning code with business language.

Choosing the Right Domain: Flight booking AI agent example with Scenario

I decided on a booking agent—one that could book flights, hotels, manage reservations, etc. This domain was perfect for demonstrating DDD principles because it has clear bounded contexts (flight booking, hotel booking, reservation management), well-defined domain events (booking created, reservation confirmed), and a rich vocabulary that both users and the system understand.

The booking domain also has clear aggregate roots (Booking, Reservation) and value objects (FlightDetails, PassengerInfo) that would naturally emerge from scenario-driven development.

I chose NestJS as my backend framework because I find it to be a delight to work with—it's modular by design, easy to spin up, well documented, and its dependency injection system makes testing really straightforward.

Domain-Driven TDD: A Step-by-Step Approach

The beauty of Scenario testing lies in its simplicity: you start with business scenarios and let failing tests drive your implementation. Here’s the same flow, told briefly as a story and grounded in concrete steps.

Step 1: Write a Scenario Test for Your AI Agent

Begin by defining what your agent should accomplish from a business perspective. Don't worry about implementation details—focus on the user journey and business outcomes.

I began black‑box. One invocation endpoint, no internal leaks—just enough to send user messages to the agent. My first “hello world” route proved the wiring before touching agent logic.

const result = await scenario.run({
  setId: "booking-agent-demo",
  name: "Basic greeting test",
  description: "User wants to book a flight and expects polite interaction",
  maxTurns: 5,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "The agent should greet the user politely",
        "The agent should understand the user wants to book a flight",
      ],
    }),
  ],
  script: [scenario.proceed(5)],
});

Step 2: Run the Scenario to Simulate User Journeys and Validate Outcomes

Execute your scenario test immediately—it will fail, and that's exactly what you want. The failure reveals what's missing from your system.

I ran the test expecting red. That first failure told me the harness and HTTP path worked—and that the agent itself had no behavior yet.

npm test
# Expected: Test fails because no agent implementation exists yet

Step 3: Let Failing Tests Drive Your Implementation

Each failing test reveals missing behavior—this is where LLM testing overlaps with domain-driven rules. The test failure tells you exactly what to build next.

Those first reds became my backlog. I added a controller/service pair and a tiny adapter that relayed Scenario messages to the HTTP endpoint and back.

First failure: No agent endpoint exists

  • Action: Create a basic HTTP endpoint

  • Domain insight: Your agent needs a communication interface

Second failure: Agent responds with static text

  • Action: Integrate an LLM for dynamic responses

  • Domain insight: Your agent needs intelligence to understand context

Third failure: Agent lacks conversation memory

  • Action: Implement chat history management

  • Domain insight: Your agent needs state to maintain coherent conversations

Step 4: Build Domain Knowledge Through Test Failures

As you implement each failing test, you're not just fixing code—you're building domain understanding. Each test failure teaches you something new about your business domain.

Plugging in Vercel’s AI SDK with OpenAI gpt-4o-mini got the greeting test to green without a custom prompt. Adding a single criterion—“coherent conversation”—turned it red again, which I fixed by introducing conversation memory. Green returned, and the domain intent became clearer.

Step 5: Scale Up to Complex Business Scenarios

Once basic functionality works, expand to full business scenarios that test complete user journeys:

With the core loop green, I moved to a full booking journey—letting the scenario judge outcomes, not just turns of phrase.

const result = await scenario.run({
  setId: "booking-agent-scenario-demo",
  name: "Complete flight booking",
  description:
    "User books a flight from New York to London with all required details",
  maxTurns: 100,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "The agent should collect all required booking information",
        "The agent should create a valid booking in the system",
        "The agent should confirm the booking to the user",
      ],
    }),
  ],
  script: [scenario.proceed(100)],
});

Step 6: Validate Real Business Outcomes

Don't just test what the agent says—verify what it actually accomplishes. Check your database, confirm bookings are created, and validate that real business value is delivered.

I validated the outcome in the database with TypeORM. First pass: no bookings—because I hadn’t built the tools yet. After adding them, I hit a new red (airport codes didn’t match). Perfect: the scenario surfaced a domain rule I’d missed.

Scaling Up: The Full Flight Booking Test

I know Scenario is powerful, so I confidently moved on to a full scenario test for a complete flight booking.

Looking at the criteria I defined, I can see the domain model emerging:

  • Aggregates: FlightBooking (the main aggregate root)

  • Value Objects: Airport codes, dates, passenger counts, service classes

  • Domain Events: Booking initiated, details collected, booking confirmed

  • Business Rules: All required information must be collected before booking

  • Bounded Context: Flight booking (separate from hotel booking or car rental)

const result = await scenario.run({
  setId: "booking-agent-scenario-demo",
  name: "Book a flight",
  description: `The user (email: test-customer@test.com) wants to book a flight from New York to London.
    IMPORTANT: The conversation can be long.
    CRITICAL: DO NOT end the conversation or judge prematurely and allow the agent to complete the conversation.
    The agent should say goodbye to the user when the conversation is complete.`,
  maxTurns: 100,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "The agent should get the user's name",
        "The agent should get the departure airport",
        "The agent should get the destination airport",
        "The agent should get the departure date",
        "The agent should get the return date",
        "The agent should get the number of passengers",
        "The agent should get the class of service",
        "The agent should get the special requests",
        "The agent should make the booking",
        "The agent should confirm the booking",
        "The agent should say goodbye to the user when the conversation is complete",
      ],
    }),
  ],
  script: [scenario.proceed(100)],
});

Domain-Driven TDD in Practice

Each criterion I defined wasn't just a test requirement—it was a domain rule that emerged from thinking about what "booking a flight" actually means in business terms. I was acting as both domain expert and developer, using scenarios to capture the business language and rules.

The TDD cycle became domain-driven: each failing test represented a missing piece of domain knowledge or capability. When I updated the agent's prompt, I wasn't just fixing code—I was implementing domain logic that the scenario had specified.

I ran the test, which obviously failed. Each time I ran it and got failing results, I updated the agent's prompt, usually with a single simple line.

1, 2, 3, Green!

Verifying the Results

But how do we know if it actually created the booking correctly?

Normally, with a lower-level test, we'd let Scenario manage the conversation history and state for our agent, and we'd be able to really use the scenario's API to check the tool calls. But since this is more of a black box, we can't do that as easily.

But that's not a major problem. We can just check our database directly. Since I'm using TypeORM, I can use the DataSource to look up the entities and assert their existence. This approach validates the real business outcome—whether the agent actually created a booking in the system—rather than just checking if it said the right things.

And… the liar! I didn't actually create the bookings. Well, obviously this was going to be true. I didn't write any tools for it yet 🙂

Building the Complete System

I wrote the services and wrote the tools to give the LLM the ability to call them. I could be writing unit/integration tests for these internal services, but I wanted to focus on the highest level End-to-End (E2E) scenario test. Moreover, if the scenario test is green, I have my confidence, and that's the goal. Once the services and tools were in place, I ran the test again. It created the bookings! However, the booking data, specifically the airport codes, didn't match what I was expecting. Honestly, I hadn't even thought about this! This was a great example of how Scenario can help you discover domain concepts you hadn't initially considered, and since the test was now red again, I knew what I needed to do next with a clear path forward.

Domain Evolution Through Testing

As I implemented each failing test, I noticed something powerful: the domain model was evolving and becoming more precise. The scenario test wasn't just validating behavior—it was helping me discover domain concepts I hadn't initially considered.

For example, the test revealed that "special requests" was a domain concept I needed to model, even though I hadn't thought of it initially. The scenario-driven approach forced me to think more deeply about the booking domain and its boundaries.

The Power of Confidence

Since this was just an investigative prototype, I stopped here. It was a really satisfying experience, and my test was clearly showing me the next steps along with the confidence that what I've already achieved still worked—this is what I wanted.

The beauty of this technique is that I am now free to update/change the prompt, the models, create a full multi-agent system, with the confidence that if I make a change that compromises the integrity of my agent or service, I'll know before I ship. I can even write scenarios to test across different languages if I want to release to different markets, and I can write scenarios to see if my agent can handle many different use cases without needing to manually test everything each time.

Conclusion: Validating Scenario's Domain-Driven Approach

In the end, building with confidence is what it's all about—and Scenario proved it can deliver exactly that. The experiment validated that Scenario doesn't just help you test your agent; it enables you to model your business domain, establish clear boundaries, and create a shared language between your code and your business requirements.

The beauty of what I discovered is that Scenario makes domain-driven TDD not just possible, but natural. Your tests become living documentation of your business domain. When requirements change, your scenarios change first, driving the implementation. When new team members join, they can read your scenarios to understand not just what the system does, but why it does it.

Here's what the experiment proved:

  • Scenario enables domain modeling: Scenarios naturally define your business language and rules

  • Failing tests drive domain discovery: Each test failure reveals missing domain knowledge that Scenario helps you capture

  • Scenarios serve as living documentation: They work both as tests and domain specifications

  • Business capabilities drive implementation: Scenario ensures your domain drives your code, not the other way around

This validation shows that Scenario can transform how we build AI systems—not just faster, but smarter, with deep domain understanding baked into every line of code.

Ready to try domain-driven TDD with your own AI agents? Scenario is open source, and the patterns I've shown here will work with any LLM or agent architecture. Start by writing a scenario that captures your domain's core business capabilities, then let the failing tests guide your implementation.

You can explore the complete booking agent demo repository to see all the code in action.

❓ FAQ

How do you test AI agents?
Traditional unit tests can’t handle the nondeterminism of LLMs. Scenario enables AI agent testing by simulating full user journeys. It uses a judge agent and user simulator to evaluate whether business rules are met, not just whether the output string matches.

What’s the difference between LLM testing and agent simulations?

  • LLM testing checks model outputs against evaluation criteria (e.g., factual accuracy, toxicity).

  • Agent simulations test end-to-end workflows, where the agent must gather information, call tools, and satisfy business requirements. Scenario combines both, so you can run LLM evaluations inside full agent simulations.

Why is Scenario different from prompt-level test harnesses?
Prompt-level tests only validate snippets of behavior. Scenario is domain-driven: scenarios capture business rules, bounded contexts, and expected outcomes. This makes it a framework for aligning LLM behavior with product requirements, not just prompt correctness.

Can Scenario integrate with CI/CD pipelines?
Yes. Scenario runs alongside pytest, Vitest, and other testing frameworks. You can fail the build if scenarios don’t pass, giving you continuous assurance that your agents behave correctly before deployment.

What kinds of agents can Scenario test?
Any LLM-powered system: customer support bots, booking agents, fintech assistants, compliance checkers, or multi-agent systems. If it’s an AI agent that relies on conversation, tools, or workflows, Scenario can test it.

Terms

AI Agent Testing
The practice of validating how an AI-powered agent behaves in real-world user journeys. Unlike traditional testing, it focuses on outcomes, tool usage, and business rules, ensuring agents meet domain requirements under varied scenarios.

Agent Simulations
Full end-to-end recreations of conversations between AI agents and simulated users. They validate whether the agent can gather information, call tools, and reach the correct outcome, not just generate plausible text.

LLM Evaluations
Performance checks that measure specific criteria like factual accuracy, safety, or tone. LLM evaluations are usually applied at the model or component level and can be run online (real time) or offline (in the development phase), and they complement agent simulations to give both micro- and macro-level testing coverage.

Domain-Driven TDD for AI
A methodology that combines Test-Driven Development (TDD) with Domain-Driven Design (DDD). Developers first write scenario tests that capture business rules and domain language, then let failing tests drive implementation, aligning AI behavior with product requirements.

👉 Ready to try? Scenario is open source

Start by writing a scenario for your own AI agent, and let the failing tests guide the way.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.