How Freeday Cut Agent Implementation Time with LangWatch Scenario Testing

Freeday's implementation team went from manually testing every agent change to running comprehensive scenario suites in minutes, dramatically reducing onboarding time and time-to-fix for their enterprise customers.

Company: Freeday 
Industry: Enterprise AI Automation 
Use Case: AI agent testing and quality assurance for customer implementations 

About Freeday

Freeday is an enterprise AI agent platform. Their "digital employees" handle complete workflows end-to-end across customer service, finance, and compliance, resolving up to 92.6% of inquiries autonomously for customers like TUI, Bitvavo, and Erasmus MC. Freeday's platform connects to 100+ business applications and deploys AI agents that understand context, make decisions, handle exceptions, and execute processes without human intervention.

But building great AI agents is only half the challenge. Getting them to work for each individual customer is where the real complexity begins.

The Challenge: Every customer is different

Freeday's AI agents serve industries from travel to healthcare to fintech. Each customer operates differently — with unique business rules, expectations, and edge cases that their agents need to handle correctly.

To bridge this gap, Freeday built a team of implementation managers: technical specialists responsible for taking the platform, understanding each customer's requirements, configuring the agents, and getting the customer live and successful.

This process is inherently complex. Implementation managers need to deeply understand a customer's business, translate those requirements into agent configurations, and then validate that everything works as expected. Customers need to see it working. They need to trust it.

And that's where the bottleneck lived: testing.

Before LangWatch, testing was almost entirely manual. Every time an agent was configured or modified, implementation managers would interact with it by hand: typing messages, checking responses, verifying behavior. When support tickets came in ("the agent said the wrong thing in this scenario," "this edge case wasn't handled"), the cycle repeated: manually reproduce the issue, fix the configuration, manually test again, then go back to the customer.

The result was long onboarding times and a slow support loop that didn't scale with Freeday's growing customer base.

The OKR: Making testing efficient

Recognizing the bottleneck, Freeday's leadership made testing efficiency one of their top OKRs. The objective was clear: add scenario tests for every new feature, every new implementation, and every support request. The goals behind it were twofold:

  1. Reduce time spent on testing both during initial onboarding and ongoing support

  2. Prevent regressions ensure that fixing one issue doesn't break something else

The engineer leading this initiative started by experimenting with LangWatch's agent simulation framework hands-on, writing initial test scenarios manually to understand what worked.

The solution: LangWatch Scenario + Claude code skills

Finding the right approach

An early and critical insight shaped the entire strategy. When the team first tried generating test scenarios with AI, the results were too generic,  surface-level tests that didn't reflect the real complexity of each customer's business. A generic "customer asks about a refund" scenario doesn't capture the nuance of how a specific travel company handles cancellations during peak season, or how a fintech's compliance rules differ by jurisdiction.

What Freeday needed was specificity: scenarios grounded in real customer requirements, real conversations, and the actual edge cases that are hardest to get right.

The breakthrough came from combining LangWatch Scenario with rich context. By feeding in customer requirement documents, real conversation logs, and specific business rules, LangWatch could generate scenarios that were genuinely testing for the problems that mattered. The simulated users behaved like real customers with real problems, not generic placeholders.

Scaling with Claude Code Skills

Once the team validated the approach, they packaged it into a Claude Code skill, a reusable workflow that any implementation manager could invoke. This was a deliberate choice: implementation managers are technical, but they're not developers. They needed a tool that met them where they are.

Now, an implementation manager can:

  1. Provide context — customer docs, requirements, real conversation logs

  2. Run a command — Claude Code, powered by the custom skill, generates targeted scenario tests using LangWatch Scenario

  3. Execute the suite — scenarios run against the configured agent, validating behavior across dozens of cases in minutes

  4. Get fix suggestions — when scenarios fail, the system proposes configuration changes, which are then reviewed and approved by an engineer

No manual back-and-forth. No hoping you remembered to test that one edge case. No regressions slipping through.

The results

The impact was immediate and significant:

  • Dramatically reduced onboarding time — new customers go live faster because testing no longer requires days of manual interaction

  • Faster time-to-fix on support tickets — when an issue comes in, the team creates a scenario for it, fixes the configuration, reruns the full suite, and confirms nothing else broke — all in a fraction of the previous time

  • Comprehensive regression prevention — instead of testing one scenario at a time (as was done manually), the team now runs entire suites, catching issues that would have previously gone unnoticed until a customer reported them

  • Empowered implementation team — every implementation manager now has access to simulation-driven testing through Claude Code, turning what was a developer-only capability into a team-wide workflow

What's next: Closing the Feedback Loop

With the testing OKR delivering strong results, Freeday is already looking ahead to their next objective: closing the feedback loop between production agent behavior and testing.

The roadmap includes:

  • LangWatch trace integration — adopting LangWatch's tracing capabilities to verify that the right tools were called at the right time, and that agent behavior in production matches what was validated in testing

  • Online evaluations — moving beyond pre-deployment testing to continuous monitoring of agent quality in production

  • Customer-facing reporting — giving Freeday's customers direct visibility into how their agents are performing, building trust through transparency

  • Industry-specific scenario baselines — by analyzing patterns across customers in the same vertical, Freeday can pre-build scenario suites for common edge cases. When a new travel customer onboards, they already benefit from lessons learned across every previous travel implementation — including issues the customer themselves may not have anticipated

This creates a flywheel: production data informs better tests, better tests catch more issues before they reach production, and the accumulated knowledge across customers makes every new implementation faster and more robust.

Key takeaways

  • Generic AI-generated tests aren't enough. Grounding scenarios in real customer data — documents, conversations, and specific business rules is what makes simulation testing actually valuable.

  • Meet your team where they are. By packaging the workflow into a LangWatch and a Claude Code skill, Freeday made advanced agent testing accessible to their entire implementation team, not just developers.

  • Testing is a competitive advantage. Faster onboarding, faster fixes, and fewer regressions directly translate to better customer experience and a more scalable business.

LangWatch is the AI agent testing and evaluation platform. Test agents with simulated users, prevent regressions, and ship with confidence. Learn more at langwatch.ai.