Better Agents is a CLI tool and a set of standards for developing agents we’ve come to value (GitHub). On this page you can read the manifesto that guides our philosophy.
Systematic Approach to Quality over Vibe Checking
Few Well-Thought Scenarios over Thousands of Auto-Generated ones
Business Metrics over Technical ones
Incremental Agent Improvements over Premature AGI
That is, while there is value in the items on
the right, we value the items on the left more.
Systematic Approach to Quality over Vibe Checking
The current state of the industry when developing agents is still very much based on manual testing, sending messages each time a change is made on the agent, and eyeballing for a subjective feeling of quality, to determine if it’s better than the previous iteration.
Agents are highly non-deterministic systems where previous software testing practices don’t fit — but that doesn’t mean it’s not possible to control and improve the quality consistently. We believe a combination of simulation tests, evaluations, monitoring and defined iteration processes for your team is essential to consistently iterate and create better agents.
An important part of the manifesto, however, is the last line: there is value in the items on the right.
We believe “Vibe Checking” is still an essential part of the process as well, because it captures the human expertise and intuition on what is off. It becomes part of the systematic approach when used to feed back the simulations and evaluations with new insights for continuous improvement.
Few Well-Thought Scenarios over Thousands of Auto-Generated ones
When dealing with AI where everything is easy to generate, one may be tempted to have it automatically generate thousands of simulations for your Agent and make a quality assessment for it fully by itself. We believe, however, that defining what is important for your business, the scenarios and the metrics, is exactly where the whole value is.
More importantly, having a few well-thought and well-cared-for scenarios enables the oversight by the developers, domain experts, and business — by reading through a humanly possible amount of key simulated conversations which test the critical edges of the agent and create real trust in the agent and its progress.
In contrast, millions of auto-generated simulated conversations will never be read. Having another AI telling you what it means for your agent to have good quality and then evaluating itself on it doesn’t actually bring trust.
Then again, for the last line of the manifesto, there is value in the items on the right, as automatic explorations are extremely valuable for finding new cases for your test suite, uncovering dozens of edge cases and finding problems at scale.
Business Metrics over Technical ones
The value of agents over workflows is precisely having the flexibility to handle never-seen-before situations. Simulations and evaluations help prevent regressions, but the value the agent is bringing in real life should be the main guiding principle for agent quality.
We found that in many situations value and ROI can be hard to define for agent projects due to the cognitive nature of the work they are automating, but the teams that focus on it regardless deliver better agents. Proxy metrics such as escalation rates or user acceptance rate of suggestions can already bring enormous value.
Of course, technical metrics are still of enormous importance as the base of the correct functioning of your agent, and can sometimes contribute to business performance directly, such as the latency on a customer support voice agent.
Incremental Agent Improvements over Premature AGI
We found out that teams that start simple and expand with incremental changes on their agent, have much higher success rates than teams that start with an already very capable multi-agent systems, and try to refine it afterward.
Starting with a single LLM + tool, launching, proving it does one small job very well, and expanding from there, is actually faster than starting with an orchestrator agent, 15 subagents, 42 tools total, voice and UI generation and then trying to lick it into shape to control it back down. We call it “Premature AGI”.
Again, to balance it out, sometimes bigger steps can pragmatically lead to much better results and uncover new directions to be explored for agent development, so eventual experimentations with hyped new approaches should also be part of the systematic approach to quality.
The LangWatch Team
Alex Forbes-Reed
Andrew Garde Joia
Aryan Sharma
Manouk Draisma
Richard Huth
Rogerio Chaves
Tariq Valente

