Does your AI actually work? Lessons from LangWatch x Altura in Amsterdam

Manouk Draisma


Yesterday, we hosted LangWatch x Altura at Altura’s Amsterdam office for an evening focused on one of the most important questions in AI right now: does your AI actually work in production?

The event brought together presentations from Martijn and Rogerio, followed by a panel with people working on evaluation, legal AI, compliance-heavy systems, and AI consulting. The discussion made one thing very clear: evaluation has moved from a niche concern to a core operational function for teams building with AI.

Evaluation is finally getting the attention it deserves

A recurring theme from the panel was how much the conversation has changed over the past year. Evaluation used to be something teams postponed while they focused on shipping. Now, more teams are actively seeking it out. The notes capture this shift directly: teams are now knocking on evaluators’ doors, whereas a year ago evals were often deprioritized.

That shift makes sense. As more AI products move into real workflows, the limits of intuition become obvious. It is no longer enough to say a demo looked good. Teams need to know whether their systems are reliable, useful, and safe under real conditions. The panel emphasized that the right starting point is not the tooling itself, but the product outcome you want to drive. Metrics should come from what matters to users and the business — not just from what a dashboard happens to make easy to measure.

Speed is up. So the pressure on evaluation is up too.

Another strong theme was pace. AI teams are moving fast, and that creates tension. The notes describe the core challenge as the pace of development itself: more people across the team are contributing code, and it is hard to combine that kind of velocity with rigorous evaluation processes.

This is where practical evaluation patterns matter. The panel shared several ways teams are managing cost and complexity:

  • sampling traces instead of evaluating everything,

  • using hybrid evaluation setups where rule-based checks run first and LLM judges are used only when needed,

  • and setting tiered thresholds that distinguish between must-pass metrics, nice-to-have improvements, and cosmetic issues.

This felt like one of the most useful takeaways of the evening. Mature evaluation is not about trying to score every possible interaction perfectly. It is about building systems that are economically sustainable and operationally useful.

Reliability is not just a tooling problem

The panel also pushed beyond tooling and into organizational design. One of the clearest ideas was that ownership structure matters as much as tooling. According to the notes, domain experts provide ground truth, managers assess business impact, and engineers set up infrastructure — but someone has to connect those layers. Responsibility cannot just be vaguely shared across the organization.

That point matters even more for agents and production systems. The panel distinguished between three separate categories of metrics: soft metrics like hallucination and groundedness, hard metrics like latency, and business impact metrics. Those categories need to be tracked independently rather than compressed into a single number.

There was also a refreshingly realistic discussion about test flakiness. In AI engineering, flaky tests are often accepted as part of reality. The stronger pattern is to keep smoke tests highly reliable, use broader sweeps before production gates, and maintain human review for borderline cases. That requires teams to get comfortable with fuzzier pass/fail boundaries than in traditional software.

In legal AI, hallucinations are existential

One of the sharpest moments in the discussion came from the legal AI perspective. The panel noted that specialized legal AI tools often outperform general-purpose setups for most legal teams — not only because of model performance, but because access to proprietary legal databases is a major differentiator. At the same time, smaller firms are increasingly experimenting directly with tools like Claude Code and seeing strong results.

But the bigger point was the cost of failure. The notes describe hallucinations in legal as existential, and include a striking example: Deloitte refunded €400k to the Australian government and the same amount again to the Dutch government after hallucinated references appeared in submitted reports. In Australia, the issue was only caught because the cited author happened to read the report by accident.

The mitigation approach discussed was equally telling: every AI-generated statement is linked back to a source document, and lawyers are trained to validate every finding. In high-stakes domains, evaluation is not just about scoring outputs. It is about building trust, traceability, and reviewability into the workflow itself.

Compliance needs more than thresholds

The compliance angle reinforced this further. In industries like banking and pharma, top-down frameworks are needed — not just threshold scores. The notes say compliance officers want a full trace and audit trail, not simply a cutoff like “0.5 means pass.”

This gets at a deeper truth in enterprise AI: measurement only becomes useful when it fits the reality of governance. Teams are not just trying to optimize model behavior. They are trying to translate legal, risk, and compliance requirements into engineering systems without slowing down innovation to a halt. That translation layer is now one of the most important parts of deploying AI responsibly.

A practical view of where the field is going

The audience questions added a few more concrete signals about where the field is heading. The notes mention:

  • multimodal voice systems can capture emotion natively, while text-only pipelines need explicit emotion tagging,

  • LLM judges improve with careful calibration on a small number of high-value scenarios,

  • and evaluation-driven development works best when scenarios are defined upfront as test specs that agents can check against before release.

Together, these examples point to a broader trend: evaluation is becoming embedded earlier in the product lifecycle. It is moving closer to specification, iteration, and release gates — rather than living as a separate reporting layer after the fact.

What we took away at LangWatch

From the LangWatch perspective, the evening reinforced a conviction we hear across teams: AI quality cannot be reduced to one score, one benchmark, or one dashboard. The real work is in connecting product outcomes, operational signals, human review, and business context in a way that teams can actually use.

What made this event especially valuable was that the conversation stayed grounded in production reality. Not abstract debates, but practical tradeoffs: how to evaluate without killing velocity, how to assign ownership, how to handle fuzzy outcomes, and how to make systems auditable in the environments where failure really matters.

Thanks again to everyone who joined us at Altura and contributed to the discussion. These are exactly the conversations the industry needs more of as AI moves from experimentation into real operational use.


Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.