Customer Story: How Roojoom automates AI Agent Quality Control with LangWatch Scenario

Manouk Draisma

Manouk Draisma

Jun 25, 2026

Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic

Background

Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.

As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.

The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.

When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
Amit, Roojoom

From Traditional QA  to structured evaluations

The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
Amit

Using LangWatch Scenarios, Amit and the team built a daily automation script to:

  • Evaluate prompts and workflows every morning

  • Use LLM judges to determine pass/fail outcomes

  • Compare results to a version-controlled baseline

  • Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
Amit

Why LangWatch to evaluate your AI Agents?

Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.

Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

  1. User Simulator Agent – Mimics a real user’s inputs and reactions.

  2. Agent Under Test – Roojoom’s live agent code, wrapped for testing.

  3. Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.

  4. ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

  • Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.

  • Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

  • "Was the tone polite?"

  • "Did the agent use the correct API call?"

  • "Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

  • Manual QA processes suitable for initial scale

  • Reactive issue identification through client feedback

  • Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

  • A daily script runs critical scenario tests automatically.

  • Each scenario simulates a realistic user conversation.

  • Judges evaluate responses based on Roojoom’s criteria.

  • Results are compared to a version-controlled baseline.

  • Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.

CI/CD Testing for Every Prompt

Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.

  • If a scenario fails, the PR is blocked.

  • If a regression is detected, the baseline diff is shown.

  • Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents 

Benefit

Description

Structured evaluation

Clear, repeatable tests for complex agent workflows

CI integration

Automatic testing on every code or prompt change

Observability

Full visibility into agent decisions, tool calls, and conversations

Regression detection

Instant alerts when a model or prompt performs worse

Cross-team collaboration

Product, QA, and engineering aligned around shared criteria

Faster iterations

Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
Amit, Roojoom

Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.

Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic

Background

Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.

As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.

The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.

When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
Amit, Roojoom

From Traditional QA  to structured evaluations

The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
Amit

Using LangWatch Scenarios, Amit and the team built a daily automation script to:

  • Evaluate prompts and workflows every morning

  • Use LLM judges to determine pass/fail outcomes

  • Compare results to a version-controlled baseline

  • Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
Amit

Why LangWatch to evaluate your AI Agents?

Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.

Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

  1. User Simulator Agent – Mimics a real user’s inputs and reactions.

  2. Agent Under Test – Roojoom’s live agent code, wrapped for testing.

  3. Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.

  4. ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

  • Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.

  • Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

  • "Was the tone polite?"

  • "Did the agent use the correct API call?"

  • "Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

  • Manual QA processes suitable for initial scale

  • Reactive issue identification through client feedback

  • Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

  • A daily script runs critical scenario tests automatically.

  • Each scenario simulates a realistic user conversation.

  • Judges evaluate responses based on Roojoom’s criteria.

  • Results are compared to a version-controlled baseline.

  • Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.

CI/CD Testing for Every Prompt

Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.

  • If a scenario fails, the PR is blocked.

  • If a regression is detected, the baseline diff is shown.

  • Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents 

Benefit

Description

Structured evaluation

Clear, repeatable tests for complex agent workflows

CI integration

Automatic testing on every code or prompt change

Observability

Full visibility into agent decisions, tool calls, and conversations

Regression detection

Instant alerts when a model or prompt performs worse

Cross-team collaboration

Product, QA, and engineering aligned around shared criteria

Faster iterations

Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
Amit, Roojoom

Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.

Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic

Background

Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.

As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.

The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.

When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.

"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
Amit, Roojoom

From Traditional QA  to structured evaluations

The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.

"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
Amit

Using LangWatch Scenarios, Amit and the team built a daily automation script to:

  • Evaluate prompts and workflows every morning

  • Use LLM judges to determine pass/fail outcomes

  • Compare results to a version-controlled baseline

  • Detect regressions early before deploying changes

"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
Amit

Why LangWatch to evaluate your AI Agents?

Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.

LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.

LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.

Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.

“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit

What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.

Each test includes:

  1. User Simulator Agent – Mimics a real user’s inputs and reactions.

  2. Agent Under Test – Roojoom’s live agent code, wrapped for testing.

  3. Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.

  4. ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.

LangWatch offers two styles of testing:

  • Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.

  • Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.

Each scenario can include evaluation criteria like:

  • "Was the tone polite?"

  • "Did the agent use the correct API call?"

  • "Was the final answer accurate?"

These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.

Production Results: From Good to Industry-Leading

Before LangWatch:

  • Manual QA processes suitable for initial scale

  • Reactive issue identification through client feedback

  • Standard testing approaches for traditional software

After adopting LangWatch Scenarios:

  • A daily script runs critical scenario tests automatically.

  • Each scenario simulates a realistic user conversation.

  • Judges evaluate responses based on Roojoom’s criteria.

  • Results are compared to a version-controlled baseline.

  • Regressions are flagged early, before they reach production.

“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit

This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.

CI/CD Testing for Every Prompt

Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.

  • If a scenario fails, the PR is blocked.

  • If a regression is detected, the baseline diff is shown.

  • Engineers get visual feedback showing which criteria failed.

“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit

This integration enables faster development cycles with enterprise-grade quality assurance.

Key benefits of using LangWatch to test AI agents 

Benefit

Description

Structured evaluation

Clear, repeatable tests for complex agent workflows

CI integration

Automatic testing on every code or prompt change

Observability

Full visibility into agent decisions, tool calls, and conversations

Regression detection

Instant alerts when a model or prompt performs worse

Cross-team collaboration

Product, QA, and engineering aligned around shared criteria

Faster iterations

Developers can ship safely without breaking core flows

Ready to try LangWatch scenarios to test your AI agents?

"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
Amit, Roojoom

Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.

Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.

👉 Try LangWatch Scenarios today or Book a demo to get started.