<- Back to overview

Product Managers and leaders are running agent simulations now, and it changing how AI ships

Sergio Cardenas

Mar 25, 2026

How non-technical leaders use LangWatch and Claude Cowork to test AI agents before a single customer sees them

Here's something that would've sounded absurd eighteen months ago.

The CEO of an tech company — not a developer, not a data scientist — opened Claude Desktop, described what his AI agent should do when a customer support rep asks for a client risk analysis, and pushed 10+ test scenarios to LangWatch. Organized by category. Labeled by type. Ready to run against the agent his engineering team is still building.

No sprint planning. No Jira ticket. No "can someone from the team write tests for this?"

Just a CEO who knows exactly what his customers need, describing it in plain language to an AI assistant that structures it into executable simulation scenarios.

This isn't hypothetical. This is happening right now with LangWatch and Claude Desktop.

The shift nobody predicted

If you've been following the agentic AI space, you've seen the pattern. Developers dabbling with coding assistants. AI is enabling platforms where non-technical users build agent workflows. Near future is multi-agent orchestration.

But there's a transition happening, that nobody's writing about:

Product managers, owners, leaders and executives, founders of 200+ companies are writing the use cases FOR AI agents themselves.

Not reviewing them. Not approving them. Writing them.

Why? Because the people who understand the business domain better than anyone; the PM who's talked to 200 customers, the Founder/CEO who built the company on a specific domain insight, they are the same people best positioned to describe what "correct agent behavior" actually looks like.

The bottleneck was never knowledge. It was tooling.

That bottleneck is gone.

Two tools, two roles, one workflow

Here's how it works in practice. No terminal required.

For the CEO and Product People: Claude Chat + MCP

Claude Chat is a AI workspace. You chat with Claude, and it has access to tools via MCP (Model Context Protocol) — including direct access to LangWatch.

The CEO opens Claude Desktop. Through the LangWatch MCP integration, Claude can read current traces, existing scenarios, create new ones, organize them by label, and push them to the LangWatch platform. The CEO describes what the agent should do in natural language. Claude structures it into properly formatted scenarios with evaluation criteria. Done.

Just a conversation that produces production-grade test scenarios.

The interface looks like any chat app — a conversation on the left, a structured document on the right showing the scenarios being created, organized by categories like happy-path, edge-case, data-validation, report-accuracy. The CEO reviews, adjusts, and pushes.

In Practice: How Agranimo's CEO tests the agent before customers see it

Let's make this concrete.

Agranimo is a B2B company that processes large volumes of sensor and satellite data for its clients. They're building an internal AI agent that automates one of their most time-consuming processes: collecting thousands of data points from IoT sensors, cross-referencing them against operational events, and generating structured risk analysis reports for their customer support team.

The specifics are Agranimo's, but the shape of the problem is universal. Any business that processes high-volume data and delivers analytical reports to clients, finance, healthcare, logistics, legal, faces the same challenge.

The problem every data-heavy business recognizes

Here's what this report requires without the agent:

17,500+ data points per client, per period — collected around the clock from IoT sensors
50+ operational events — each with timestamps, durations, and contextual metadata
Manual cross-referencing of every data point against every event to identify anomalies, risks, and patterns

Someone on the team would manually pull this data, load it into Excel, identify patterns from raw readings, calculate derived metrics, flag risk periods, and compile everything into a report the client can act on.

But can you trust the agent?

This is where it gets interesting. The agent isn't customer-facing yet. It's internal. The CEO and the Customer Support Manager are the ones using it .

The question isn't "does the code run?" The question is: "Would I send this to my most important client?"

That's not a question an engineer can answer. That's a question for the person who's been on calls with the client. Who knows which metrics that client actually cares about. Who knows that a 40-minute data anomaly looks suspicious and might be a sensor glitch, not a real event. Who knows what "good enough to send" actually means for this specific account.

So the CEO opens Claude Cowork, connects to LangWatch via MCP, and starts writing scenarios.

The real unlock: hours saved

Here's the part that makes this story different from the usual "AI testing" narrative.

Agranimo's agent hasn't launched yet. It's internal. The CEO and CS Manager are the primary users. And they're already saving hours.

Before the agent: someone manually processes thousands of data points per client, cross-references dozens of operational events, and builds the analysis in Excel. Hours per report. Days per client review cycle.

With the agent: the data collection and cross-referencing is automated. The team reviews the output and fills in the interpretation — the part that requires human judgment about what matters to each specific client.

With LangWatch Scenarios: the CEO / Product Manager / Support Manager encodes that human judgment into repeatable tests. "This is what a good report looks like." "This is what the agent should flag." "This is what should raise a red flag." Now the agent improves against the standard the business leader set, not just against generic metrics.

The agent saves hours on data processing. The scenarios ensure those hours aren't wasted on bad output.
This pattern applies everywhere AI agents produce client-facing analysis — financial services, healthcare, logistics, legal. The domain expert who knows what "good" looks like is rarely the engineer who builds the pipeline.

The second use-case: Fintech product owner testing multi-agent systems

Now let’s look at a completely different environment: a large fintech company.

Here, the complexity isn’t just dat, it’s systems talking to systems.

A Product Owner (PO) is responsible for an agentic system that:

Calls multiple internal tools (KYC checks, transaction history, fraud scoring APIs)
Orchestrates multi-step workflows
Handles multi-turn conversations with internal analysts
Produces decisions or recommendations that may impact compliance and financial risk

This isn’t a single prompt. It’s a multi-agent, multi-tool pipeline.

If something goes wrong, it’s not just a bad answer.

It’s:

A compliance violation
A missed fraud signal
A wrong client classification
Or worse — a decision that gets audited later

And here’s the key:

The PO understands these risks better than anyone. Not the engineers.

What the product owner actually tests

Before LangWatch, testing this system meant:

Waiting for engineering to write test cases
Reviewing logs after something broke
Or manually trying flows in staging

Now?

The PO opens Claude and writes:

“A user asks: ‘Can I onboard this client?’ The agent must:
Call KYC tool
Check transaction history
Run fraud scoring
Ask follow-up if data is incomplete
Return a clear decision with justification”

Edge case scenario

“KYC returns partial data, fraud score is high, transaction history missing
→ Agent must NOT approve
→ Must request additional documents
→ Must flag as high-risk internally”

Multi-turn scenario

“User challenges decision (‘but this is a trusted client’)
→ Agent should re-evaluate
→ But not override compliance logic
→ Should explain reasoning clearly without leaking internal scoring thresholds”

Tool failure scenario

“Fraud API times out
→ Agent must retry once
→ Then fallback
→ Then clearly communicate uncertainty”

These aren’t unit tests. These are real-world simulations of how the system behaves under pressure.

LangWatch is open source with 5.6k+ GitHub stars and processes 900k+ daily evaluations. Get started at langwatch.ai or explore the Skills directory.