Customer Story: How Roojoom automates AI Agent Quality Control with LangWatch Scenario

Manouk Draisma
Jun 25, 2026
Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic
Background
Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.
As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.
The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.
When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.
"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations
The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.
"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— AmitUsing LangWatch Scenarios, Amit and the team built a daily automation script to:
Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes
"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit
Why LangWatch to evaluate your AI Agents?
Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.
LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.
LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.
Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.
“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit
What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.
Each test includes:
User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.
LangWatch offers two styles of testing:
Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.
Each scenario can include evaluation criteria like:
"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"
These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.
Production Results: From Good to Industry-Leading
Before LangWatch:
Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software
After adopting LangWatch Scenarios:
A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.
“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit
This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.
CI/CD Testing for Every Prompt
Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.
If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.
“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit
This integration enables faster development cycles with enterprise-grade quality assurance.
Key benefits of using LangWatch to test AI agents
Benefit | Description |
Structured evaluation | Clear, repeatable tests for complex agent workflows |
CI integration | Automatic testing on every code or prompt change |
Observability | Full visibility into agent decisions, tool calls, and conversations |
Regression detection | Instant alerts when a model or prompt performs worse |
Cross-team collaboration | Product, QA, and engineering aligned around shared criteria |
Faster iterations | Developers can ship safely without breaking core flows |
Ready to try LangWatch scenarios to test your AI agents?
"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom
Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.
Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.
👉 Try LangWatch Scenarios today or Book a demo to get started.
Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic
Background
Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.
As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.
The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.
When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.
"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations
The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.
"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— AmitUsing LangWatch Scenarios, Amit and the team built a daily automation script to:
Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes
"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit
Why LangWatch to evaluate your AI Agents?
Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.
LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.
LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.
Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.
“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit
What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.
Each test includes:
User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.
LangWatch offers two styles of testing:
Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.
Each scenario can include evaluation criteria like:
"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"
These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.
Production Results: From Good to Industry-Leading
Before LangWatch:
Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software
After adopting LangWatch Scenarios:
A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.
“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit
This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.
CI/CD Testing for Every Prompt
Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.
If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.
“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit
This integration enables faster development cycles with enterprise-grade quality assurance.
Key benefits of using LangWatch to test AI agents
Benefit | Description |
Structured evaluation | Clear, repeatable tests for complex agent workflows |
CI integration | Automatic testing on every code or prompt change |
Observability | Full visibility into agent decisions, tool calls, and conversations |
Regression detection | Instant alerts when a model or prompt performs worse |
Cross-team collaboration | Product, QA, and engineering aligned around shared criteria |
Faster iterations | Developers can ship safely without breaking core flows |
Ready to try LangWatch scenarios to test your AI agents?
"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom
Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.
Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.
👉 Try LangWatch Scenarios today or Book a demo to get started.
Customer: Roojoom
Interviewee: Amit, Leading AI at Roojoom
Use case: Automating offline evaluations, testing prompt regressions, multi-step agent workflows, getting enterprise ready
LangWatch features in use: Scenario Testing, Agent Simulations, Observability, Judge integration, Custom success/failure logic
Background
Roojoom delivers advanced AI-powered customer journey orchestration, powering complex agent-driven workflows across industries. As these workflows scaled, so did the challenge: how to continuously test and validate AI behavior at scale — before it impacts real users.
As part of its expanding AI innovation portfolio, Roojoom recently launched PickMyCall.ai — an AI-powered voice assistant and productivity platform designed to help small businesses grow. Leveraging Roojoom's proven expertise in orchestrating conversations and AI agents, PickMyCall.ai brings enterprise-level voice automation to everyday business tasks.
The scaling challenge: As Roojoom's client base grew and AI workflows became more sophisticated, the team recognized the need to evolve from traditional QA approaches to enterprise-grade, automated testing infrastructure. With complex multi-step agents serving critical business functions, ensuring consistent quality at scale became paramount.
When Amit joined as Head of AI, he brought extensive experience from traditional ML environments and immediately saw an opportunity to implement best-in-class evaluation frameworks that would position Roojoom as an industry leader in AI reliability.
"When I saw LangWatch for the first time, it reminded me of how we used to evaluate models in classic machine learning.I knew this was exactly what we needed to maintain our high standards at enterprise scale"
— Amit, Roojoom

From Traditional QA to structured evaluations
The opportunity: Roojoom's rapid growth meant scaling beyond manual testing approaches that worked well for smaller deployments. The team needed enterprise-grade automation that could match their sophisticated AI capabilities.
"We were successful with traditional QA methods, but as we scaled to serve larger enterprise clients, we needed testing infrastructure that could match the complexity and reliability expectations of our AI solutions."
— AmitUsing LangWatch Scenarios, Amit and the team built a daily automation script to:
Evaluate prompts and workflows every morning
Use LLM judges to determine pass/fail outcomes
Compare results to a version-controlled baseline
Detect regressions early before deploying changes
"Now we have enterprise-grade baselines. When we deploy new versions, we instantly validate performance against our quality standards — it's given us tremendous confidence in our releases.”
— Amit
Why LangWatch to evaluate your AI Agents?
Roojoom's engineering team evaluated multiple solutions, seeking enterprise-grade LLM observability, structured evaluations, and explainability that could match their background in production ML systems.
LangWatch offered Roojoom a fundamentally different approach to testing LLM applications.
LangWatch Scenarios simulate real interactions between a user and an agent. They go beyond simple prompt-response testing to validate complete workflows, across multiple turns, tools, and states. Each scenario includes a user simulator, the agent-under-test, and a judge agent that determines whether the outcome was a success or failure based on business rules and soft criteria.
Roojoom’s agents aren’t just static chatbots, they make decisions, call APIs, perform logic, and adapt based on previous turns. For this, they needed more than string-matching assertions. They needed intelligent, reproducible, criteria-based simulations, and LangWatch delivered exactly that.
“LangWatch gives you flexibility and visibility. It helps you understand exactly how your models behave in production scenarios, especially with sophisticated multi-step agents.”
— Amit
What is LangWatch Scenario, exactly?

At its core, a LangWatch Scenario is an agent-based simulation test.
Each test includes:
User Simulator Agent – Mimics a real user’s inputs and reactions.
Agent Under Test – Roojoom’s live agent code, wrapped for testing.
Judge Agent – An LLM or scripted function that evaluates whether the scenario met defined success criteria.
ScenarioExecutor – Orchestrates the test flow, tracks the state, and determines pass/fail.
LangWatch offers two styles of testing:
Scripted Scenarios: Define every message and agent response, step-by-step. Ideal for critical edge cases.
Goal-Based Scenarios: Let the user simulator pursue a goal (e.g., “cancel my subscription”), and test whether the agent reaches it successfully.
Each scenario can include evaluation criteria like:
"Was the tone polite?"
"Did the agent use the correct API call?"
"Was the final answer accurate?"
These are automatically evaluated by a Judge Agent, an LLM with structured reasoning capabilities.
Production Results: From Good to Industry-Leading
Before LangWatch:
Manual QA processes suitable for initial scale
Reactive issue identification through client feedback
Standard testing approaches for traditional software
After adopting LangWatch Scenarios:
A daily script runs critical scenario tests automatically.
Each scenario simulates a realistic user conversation.
Judges evaluate responses based on Roojoom’s criteria.
Results are compared to a version-controlled baseline.
Regressions are flagged early, before they reach production.
“Now we have industry-leading quality baselines. When we deploy new versions, we instantly validate performance.”
— Amit
This shift gave Roojoom the ability to run offline evaluations continuously, gaining the kind of feedback loops that traditional software teams enjoy with unit tests, but tailored to the fuzziness and flexibility of LLM-based systems.
Roojoom integrated LangWatch into every layer of their development and QA process.
CI/CD Testing for Every Prompt
Whenever a developer opens a pull request that touches agent logic or prompt templates, LangWatch Scenarios run in CI via scenario. The team uses LangWatch’s Scenario to treat agent tests just like unit tests.
If a scenario fails, the PR is blocked.
If a regression is detected, the baseline diff is shown.
Engineers get visual feedback showing which criteria failed.
“We want every prompt change to trigger a LangWatch test in CI, just like a unit test. That’s how we code now.”
— Amit
This integration enables faster development cycles with enterprise-grade quality assurance.
Key benefits of using LangWatch to test AI agents
Benefit | Description |
Structured evaluation | Clear, repeatable tests for complex agent workflows |
CI integration | Automatic testing on every code or prompt change |
Observability | Full visibility into agent decisions, tool calls, and conversations |
Regression detection | Instant alerts when a model or prompt performs worse |
Cross-team collaboration | Product, QA, and engineering aligned around shared criteria |
Faster iterations | Developers can ship safely without breaking core flows |
Ready to try LangWatch scenarios to test your AI agents?
"LangWatch is our go-to tool for working with multi-step agents. For anyone moving from traditional ML into LLMs, it’s a very natural fit."
— Amit, Roojoom
Roojoom’s journey with LangWatch shows how simulation testing, intelligent evaluation, and automation can dramatically improve LLM product reliability, without slowing down innovation. By combining open-source testing with a powerful visual platform, LangWatch helps AI-native companies and Enterprises building AI agents to go from reactive bug-fixing to proactive quality assurance for every AI release.
Whether you're validating agents, testing prompts, or automating LLM evaluations, LangWatch lets you move faster with confidence.
👉 Try LangWatch Scenarios today or Book a demo to get started.
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Integrations
Recourses
Platform
About