Customer Story: How Roojoom automates AI Agent Quality Control with LangWatch Scenario

Manouk Draisma

Jun 25, 2026

Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic

Background

Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.

As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.

The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.

When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations

The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— Amit
Using LangWatch Scenarios, Amit and the team built a daily automation script to:

Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit

Why LangWatch to evaluate your AI Agents?

Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.

Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.

CI/CD Testing for Every Prompt

Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.

If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents

Benefit	Description
Structured evaluation	Clear, repeatable tests for complex agent workflows
CI integration	Automatic testing on every code or prompt change
Observability	Full visibility into agent decisions, tool calls, and conversations
Regression detection	Instant alerts when a model or prompt performs worse
Cross-team collaboration	Product, QA, and engineering aligned around shared criteria
Faster iterations	Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom

Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.

Background

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— Amit
Using LangWatch Scenarios, Amit and the team built a daily automation script to:

Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit

Why LangWatch to evaluate your AI Agents?

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

CI/CD Testing for Every Prompt

If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents

Benefit	Description
Structured evaluation	Clear, repeatable tests for complex agent workflows
CI integration	Automatic testing on every code or prompt change
Observability	Full visibility into agent decisions, tool calls, and conversations
Regression detection	Instant alerts when a model or prompt performs worse
Cross-team collaboration	Product, QA, and engineering aligned around shared criteria
Faster iterations	Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.

Background

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— Amit
Using LangWatch Scenarios, Amit and the team built a daily automation script to:

Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit

Why LangWatch to evaluate your AI Agents?

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

CI/CD Testing for Every Prompt

If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents

Benefit	Description
Structured evaluation	Clear, repeatable tests for complex agent workflows
CI integration	Automatic testing on every code or prompt change
Observability	Full visibility into agent decisions, tool calls, and conversations
Regression detection	Instant alerts when a model or prompt performs worse
Cross-team collaboration	Product, QA, and engineering aligned around shared criteria
Faster iterations	Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started