<- Back to overview

Launch Week Day 5: Better Agents CLI: The reliability layer for the next wave of agent development

Rogerio Chaves

Nov 26, 2025

Building AI agents was supposed to feel like progress, the next stage after chatbots, RAG, and copilots. Instead, for many teams, it has felt like starting over.

Today, more than 95% of enterprise agent projects fail to reach production due to a lack of reliability, evaluation discipline, and trust. Most never make it past the prototype phase.

The promise is powerful, but without AI agent testing, evaluations, and rigor, outcomes vary wildly. One execution succeeds, the next spirals into hallucinations, broken workflows, or infinite tool loops.

The industry doesn’t need more agent frameworks.

We need better agents.

Over the past two years, LangWatch has worked with AI engineering teams building some of the earliest real-world agents in fintech, retail, logistics, and security. We’ve seen what works, what fails, and where the discipline must evolve.

Today, we’re bringing those worlds together.

Introducing Better Agents CLI by LangWatch

Better Agents is a CLI toolkit and emerging standard for building reliable, testable, production-grade agents, independent of which framework you use.

Use your preferred stack, Agno, Mastra, Vercel AI, Google ADK, or anything else.

Use your preferred coding assistant.

Use your preferred development workflow.

Better Agents doesn’t replace your stack, it stabilizes it.

It is the reliability layer:

Agent structure guidance
Scenario testing
Automated evaluations
Prompt versioning
Observability
Best practices encoded into tooling

Better Agents becomes the AI systems engineer embedded in your repo, discovering MCP tools, wiring capabilities, and aligning implementation to best practices automatically.

No more guessing.

No more brittle demos held together by optimism and duct tape.

Why we built Better Agents

Early agent systems behave like software in the early web PHP era, duct taped together, manually tested, difficult to debug, and pushing to prod with only hopes for it to work.

To move agents into the professional era, we need:

Agent testing frameworks
Evaluation workflows
Versioned prompts
Observability and instrumentation
Structured approach to quality

For AI agents to enter production, reliability can’t be an afterthought, it must be the foundation.

Better Agents makes reliability default, not optional.

The Better Agents structure

Every project generated via Better Agents CLI follows a tested, scalable, maintainable layout:

Every directory has a purpose.

`scenarios/` — The core of Real Agent reliability

These aren’t unit tests.

They are conversational test cases that simulate real tasks and validate agent behavior across iterations, updates, or model swaps.

They answer the most important question in AI engineering:

Does the agent still behave the way we expect?

With scenario testing, teams avoid regressions, undefined behavior, and silent capability drift.

`evaluations/` — Structured benchmarking for Logic, retrieval, and Reasoning

Evaluations allow teams to quantitatively test components like:

RAG correctness
Retrieval F1 score
Classification accuracy
Routing accuracy

This makes AI development feel less like experimentation and more like engineering.

`prompts/` — Versioned, reviewable, documented

Prompts are no longer ad-hoc artifacts.

They become tracked, shared, and collaboratively improved—like real software, controlled by a prompts-lock.json file to be versioned along with your codebase while also synced to the platform playground for greater collaboration.