How Freeday cut agent implementation time with LangWatch Scenario testing

Freeday's implementation team went from manually testing every agent change to running comprehensive scenario suites in minutes — dramatically reducing onboarding time and time-to-fix for their enterprise customers.

Company: Freeday

Industry: Enterprise AI Automation

Use Case: AI agent testing and quality assurance for customer implementations

*Dramatically reduced onboarding time from half a day to around 10 minutes

About Freeday

Freeday is Europe's leading enterprise AI agent platform. Their "digital employees" handle complete workflows end-to-end across customer service, finance, and compliance, resolving up to 92.6% of inquiries autonomously for customers like TUI, Bitvavo, and Erasmus MC. Freeday's platform connects to 100+ business applications and deploys AI agents that understand context, make decisions, handle exceptions, and execute processes without human intervention.

But building great AI agents is only half the challenge. Getting them to work for each individual customer is where the real complexity begins.

The Challenge: Every Customer Is Different

Freeday's AI agents serve industries from travel to healthcare to fintech. Each customer operates differently — with unique business rules, expectations, and edge cases that their agents need to handle correctly.

To bridge this gap, Freeday built a team of technical specialists responsible for taking the platform, understanding each customer's requirements, configuring the agents, and getting the customer live and successful.

Implementations are done by a pair: an implementation lead (IL) and an engineer. The IL owns the customer relationship, gathers requirements, and handles the non-technical configuration. The engineer handles the technical side: complex configuration and development. Together they test and are responsible for a successful go-live.

This process is inherently complex. Implementation managers need to deeply understand a customer's business, translate those requirements into agent configurations, and then validate that everything works as expected. Customers need to see it working. They need to trust it.

And that's where the bottleneck lived: testing.

Before LangWatch, testing was almost entirely manual. Every time an agent was configured or modified, implementation managers would interact with it by hand: typing messages, checking responses, verifying behavior. When support tickets came in ("the agent said the wrong thing in this scenario," "this edge case wasn't handled"), the cycle repeated: manually reproduce the issue, fix the configuration, manually test again, then go back to the customer.

A big part of the impact was also on ongoing support (CCW: continuous client work). Engineers now use Langwatch when solving support tickets to reproduce the bug as a failing scenario, fix it and verify nothing else broke

The result was long onboarding times and a slow support loop that didn't scale with Freeday's growing customer base.

The OKR: Making Testing Efficient

Recognizing the bottleneck, Freeday's leadership made testing efficiency one of their top OKRs. The objective was clear: add scenario tests for every new feature, every new implementation, and every support request. The goals behind it were twofold:

Reduce time spent on testing — both during initial onboarding and ongoing support
Prevent regressions — ensure that fixing one issue doesn't break something else

The engineer leading this initiative started by experimenting with LangWatch's agent simulation framework hands-on, writing initial test scenarios manually to understand what worked.

The Solution: LangWatch Scenario + Claude Code Skills

Finding the Right Approach

An early and critical insight shaped the entire strategy. When the team first tried generating test scenarios with AI, the results were too generic, surface-level tests that didn't reflect the real complexity of each customer's business. A generic "customer asks about a refund" scenario doesn't capture the nuance of how a specific travel company handles cancellations during peak season, or how a fintech's compliance rules differ by jurisdiction.

What Freeday needed was specificity: scenarios grounded in real customer requirements, real conversations, and the actual edge cases that are hardest to get right.

The breakthrough came from combining LangWatch Scenario with rich context. By feeding in customer requirement documents, real conversation logs, and specific business rules, LangWatch could generate scenarios that were genuinely testing for the problems that mattered. The simulated users behaved like real customers with real problems, not generic placeholders.

Scaling with Claude Code Skills

Once the team validated the approach, they packaged it into a Claude Code skill, a reusable workflow that any implementation manager could invoke. This was a deliberate choice: implementation managers are technical, but they're not developers. They needed a tool that met them where they are.

Now, an implementation manager can:

Provide context — customer docs, requirements, real conversation logs
Run a command — Claude Code, powered by the custom skill, generates targeted scenario tests using LangWatch Scenario
Execute the suite — scenarios run against the configured agent, validating behavior across dozens of cases in minutes
Get fix suggestions — when scenarios fail, the system proposes configuration changes, which are then reviewed and approved by an engineer

No manual back-and-forth. No hoping you remembered to test that one edge case. No regressions slipping through.

One of the Freeday engineers now writes scenarios before building; a shift towards test-driven configuration that emerged naturally. Nobody told him to do that and a testament that LangWatch is changing how people think and not just how they test.

The Results

The impact was immediate and significant:

Dramatically reduced onboarding time from half a day to around 10 minutes — new customers go live faster because testing no longer requires days of manual interaction
Faster time-to-fix on support tickets — when an issue comes in, the team creates a scenario for it, fixes the configuration, reruns the full suite, and confirms nothing else broke — all in a fraction of the previous time
Comprehensive regression prevention — instead of testing one scenario at a time (as was done manually), the team now runs entire suites, catching issues that would have previously gone unnoticed until a customer reported them
Empowered implementation team — every implementation manager now has access to simulation-driven testing through Claude Code, turning what was a developer-only capability into a team-wide workflow

What's Next: Closing the Feedback Loop

With the testing OKR delivering strong results, Freeday is already looking ahead to their next objective: closing the feedback loop between production agent behavior and testing.

Areas of exploring tThe roadmap includes:

LangWatch trace integration — adopting LangWatch's tracing capabilities to verify that the right tools were called at the right time, and that agent behavior in production matches what was validated in testing
Online evaluations — moving beyond pre-deployment testing to continuous monitoring of agent quality in production
Customer-facing reporting — giving Freeday's customers direct visibility into how their agents are performing, building trust through transparency
Industry-specific scenario baselines — by analyzing patterns across customers in the same vertical, Freeday can pre-build scenario suites for common edge cases. When a new travel customer onboards, they already benefit from lessons learned across every previous travel implementation — including issues the customer themselves may not have anticipated

This creates a flywheel: production data informs better tests, better tests catch more issues before they reach production, and the accumulated knowledge across customers makes every new implementation faster and more robust.

Key Takeaways

Generic AI-generated tests aren't enough. Grounding scenarios in real customer data — documents, conversations, and specific business rules — is what makes simulation testing actually valuable.
Meet your team where they are. By packaging the workflow into a Claude Code skill, Freeday made advanced agent testing accessible to their entire implementation team, not just developers.
Testing is a competitive advantage. Faster onboarding, faster fixes, and fewer regressions directly translate to better customer experience and a more scalable business.

Freeday is an enterprise AI agent platform helping businesses automate customer service, finance, and compliance workflows. Learn more at freeday.ai.

LangWatch is the AI agent testing and evaluation platform. Test agents with simulated users, prevent regressions, and ship with confidence. Learn more at langwatch.ai.