Launch Week Day 5: Better Agents CLI: The reliability layer for the next wave of agent development

Rogerio Chaves

Nov 26, 2025

Building AI agents was supposed to feel like progress, the next stage after chatbots, RAG, and copilots. Instead, for many teams, it has felt like starting over.

Today, more than 95% of enterprise agent projects fail to reach production due to a lack of reliability, evaluation discipline, and trust. Most never make it past the prototype phase.

The promise is powerful, but without AI agent testing, evaluations, and rigor, outcomes vary wildly. One execution succeeds, the next spirals into hallucinations, broken workflows, or infinite tool loops.

The industry doesn’t need more agent frameworks.

We need better agents.

Over the past two years, LangWatch has worked with AI engineering teams building some of the earliest real-world agents in fintech, retail, logistics, and security. We’ve seen what works, what fails, and where the discipline must evolve.

Today, we’re bringing those worlds together.

Introducing Better Agents CLI by LangWatch

Better Agents is a CLI toolkit and emerging standard for building reliable, testable, production-grade agents, independent of which framework you use.

Use your preferred stack, Agno, Mastra, Vercel AI, Google ADK, or anything else.

Use your preferred coding assistant.

Use your preferred development workflow.

Better Agents doesn’t replace your stack, it stabilizes it.

It is the reliability layer:

  • Agent structure guidance

  • Scenario testing

  • Automated evaluations

  • Prompt versioning

  • Observability

  • Best practices encoded into tooling

Better Agents becomes the AI systems engineer embedded in your repo, discovering MCP tools, wiring capabilities, and aligning implementation to best practices automatically.

No more guessing.

No more brittle demos held together by optimism and duct tape.

Why we built Better Agents

Early agent systems behave like software in the early web PHP era, duct taped together, manually tested, difficult to debug, and pushing to prod with only hopes for it to work.

To move agents into the professional era, we need:

  • Agent testing frameworks

  • Evaluation workflows

  • Versioned prompts

  • Observability and instrumentation

  • Structured approach to quality

For AI agents to enter production, reliability can’t be an afterthought, it must be the foundation.

Better Agents makes reliability default, not optional.

The Better Agents structure

Every project generated via Better Agents CLI follows a tested, scalable, maintainable layout:


Every directory has a purpose.

scenarios/ — The core of Real Agent reliability

These aren’t unit tests.

They are conversational test cases that simulate real tasks and validate agent behavior across iterations, updates, or model swaps.

They answer the most important question in AI engineering:

Does the agent still behave the way we expect?

With scenario testing, teams avoid regressions, undefined behavior, and silent capability drift.

evaluations/ — Structured benchmarking for Logic, retrieval, and Reasoning

Evaluations allow teams to quantitatively test components like:

  • RAG correctness

  • Retrieval F1 score

  • Classification accuracy

  • Routing accuracy

This makes AI development feel less like experimentation and more like engineering.

prompts/ — Versioned, reviewable, documented

Prompts are no longer ad-hoc artifacts.

They become tracked, shared, and collaboratively improved—like real software, controlled by a prompts-lock.json file to be versioned along with your codebase while also synced to the platform playground for greater collaboration.

Why this matters

Agent development is entering the same maturity curve we saw with CI/CD, Infrastructure-as-Code, and automated testing.

Organizations that adopt evaluation-driven agent development will ship faster, and safely.

Organizations that rely on experimentation, intuition, and trial-and-error will stall.

The difference isn't talent or budget.

It's discipline.

Get Started

The next generation of agent systems won’t be defined by the flashiest demo, they’ll be defined by consistency, reliability, and trust.

Better Agents is now available:

👉 https://github.com/langwatch/better-agents

Generate new agent projects following best practices from day one, built for reliability, maintainability, and repeatability.

Agents aren't just code.

They are systems.

Let’s build them like it.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.