Are evals dead?

Rogerio Chaves

Sep 7, 2025

tl;dr: https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines

This week, the trending topic on AI community is “Evals are dead”, seems like after killing RAG, Evals are being murdered too, with very passionate comments such as Helicone CEO saying “f*** evals!”[1] and AgentOps CEO saying “Evals are a scam”[2].

Of course in this attention economy you need to make bold claims to not get lost on the noise, so I don’t blame them, but my goal with this post is really to try to understand and organize what are the common claims being made and the reasoning behind them, trying to add to the conversation and bring some nuance of my own, so let me start with the least controversial one:

Claim #1: Ready-to-Use Evals are Useless

I remember Hamel[3] as the first one to boldly claim that ready-to-use evals are useless, and I agree with it fully. Those are metrics like “helpfulness”, “coherence”, “quality” etc, which you don’t need to put any effort, and promise to just give you a number to “measure your AI quality”.

That doesn’t really add value and pretty much everyone can either agree with that already or will find out quickly by themselves. Not only because they are pretty generic and uninterpretable, but because even if they were really measuring “helpfulness”, the baseline AI labs like OpenAI, Anthropic and Google, have already maximized the hell of our LLMs to make them as “helpful” and “coherent” as possible.

So this is not a very controversial claim at this point, although there is an overlap with some of the arguments, some claims of “Evals are dead” say the same is true for basically every other metric, “labs will take care of it anyway so you don’t need to bother”. Now that is simply not true, unless 1. you are not adding any value on top of what labs are doing and OpenAI will just swallow your startup anytime now or 2. you expect labs to have ownership on every domain ever.

The claim here is then that you really need to take care of your own evals, which is true, there is no free lunch. The knowledge of what quality means on your domain is exactly the value that your own AI brings over pure ChatGPT. But let’s expand more on that more on the next (much more spicy) claim.

Claim #2: All evals are useless, top applications don’t do it

Now this is a very very interesting discussion, I love it. What if we just keep going with zero evals? I love all the thought experiments it brings on what are the best ways to manage and grow an AI project. I love it specially because I was very deep into an equivalent discussion before, back in 2020, having experienced going from 100% test coverage, TDD all the way, in a very successful company, from zero tests YOLO-push-to-prod, in an also very successful company[4].

First of all there are some lack of nuances there, some argue that even people that are claiming “all evals are useless” are still evaluating in a way or another, because they are testing their AI by hand anyway, so then it’s just a form of manual evaluation. Or they let their users do it, which means looking at production metrics instead.

I don’t want to argue for the petty “manual evaluation is evaluation too”, I want to argue really about “should we written down evals” vs “should we not”.

Then in this space, other big nuance here is missing. The main example used for supporting the claim that “all evals are useless” is the fact that Claude Code and many other coding agent products have no evals:

Now, notice I put “coding agents” in bold. Well, not everyone is building a coding agent, and not every agent has a similar principle, and there are a few main reasons for coding agents to be succeeding without it:

  1. Code, along with Math, are perhaps the only example that can be optimized via verifiable RL, so labs can really push the boundaries and making LLMs awesome at it, not true for almost all other domains.

  2. Those are general coding agents, for any codebase. If you were building even a prompt that migrates code that only exists on your corporate microservices, you’d probably want to try at least a couple evals for that, as it’s very specific to you.

  3. There is a human in the loop! Big time. Humans are doing the evaluations, devs are very vocal and create leaderboards and promptly give this feedback to anyone that is listening. The “manual eval” is really powerful and plentiful here.

  4. It is not risky, the model makes mistakes, you correct it, it tells you “you are absolutely right”, you laugh a bit, nobody gets hurt.

The main point here is not to dismiss “not having evals”, but rather see for yourself if you are in a similar position. If your traffic is plentiful and mistakes are not costly, that may indeed be a great approach, and perhaps you focus more on A/B testing as some suggested and keep evals more as smoke-testing or similar. If you are not in this situation, then maybe you do have to shift-left[5] more.

With the risk of being redundant, the key word here is nuance, different problems require different testing practices, and trying to classify the problem you are in helps much more than trying to find absolutes. One slide I’ve been using recently, although very simple, have been very helpful in getting folks to recognize which class of problems they are dealing with:

At the end goal is to have a systematic approach to quality that speeds you up, for most teams this will look like a set of unit/smoke tests, offline evals and some end-to-end scenarios [6], then monitoring with realtime evals, while still having human oversight and annotations. It’s not a one or the other, but a set of tools on your toolbelt, that you need to apply in different quantities depending on the task at hand, even “not having evals and relying on users” can be a tool on your toolbelt.

Like I said at the beginning I do like those discussions, including the heat of it, I think it pushes people to debate the subject deeply, and think differently, and I think that’s healthy, it forces us to be pragmatic. And to be fair, it was not as heated as I imagined it would be, a lot of very chill rebuttals with some support to this, and a nice discussion overall.

Claim #3: You can’t outsource your QA

This is another one I fully agree, you can’t outsource your QA.

Part of the claims of “Evals are dead” are actually saying this is the case because “you are the one that needs to write your evals anyway” and “therefore tools are useless”. Now I agree with the first part of this sentence but not the second one.

Let’s untangle the first part. This is the same argument as ready-to-use models being no good, or the AI labs not being able to know about your domain expertise. Evaluating quality is hard, specially in AI land, with so much domain expertise and nuance being required. You are by far the best one positioned to define the evals of your own product. They are akin to product requirements, if you don’t own that, you are not owning the core part of your business. This is what we always believed at LangWatch.

What I can’t agree that tools are useless, not one bit. I do agree that building your evaluation tool is an invaluable learning, and the process matters most above all, but as the industry matures, and the tools becomes more and more encompassing of different use cases, eventually the right tool for the right job will speed you up more than not, and even guide those that are just starting to learn. That was always the case in software.

But yeah no matter how much tools evolve, you will have to be the one wielding it, doing the error analysis and writing your own evals. You wouldn’t delegate your unit tests or product requirements to anybody else.

Conclusion

The reports of Evals’ death were greatly exaggerated.

But the whole discussion does help keeping us more pragmatic.

Evals are a key tool in your AI quality toolbelt, which shouldn’t be seen as a panacea, but neither dismissed. Which practices will really benefit you, in which shape and format, at which stage of the pipeline, well that depends a lot, nuance is key here, and perhaps the main takeaway from this.

So we will keep on our journey at LangWatch, of empowering you to find the best way to systematically improve your AI agents, if that means writing Evals or not, we can discuss.

Thank you for reading!


Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.