Altura rebuilt AI quality with LangWatch

About Altura

Altura builds the leading AI-powered tender workspace for facility management and cleaning companies. Their platform helps bid teams respond to government tenders, submissions that can involve up to 200 documents each, by automating document analysis, answer generation, and requirement tracking. Altura is on a mission to make the tender process faster, smarter, and more consistent.

At the heart of the platform is the Bid Companion, a conversational interface that lets users ask questions about tender documents, extract structured data, and take actions directly inside the workspace. Behind the scenes, Altura runs LangGraph-based agents that handle document threading, SQL-based enrichments, and multi-step question-answering. They also parse dense Dutch government documents, Programma van Eisen full of requirements in table format, into structured data using tools like Langextract.

In other words: high-stakes, multi-step, agentic AI, where a single missed requirement in a contract can cost a customer the tender.

The challenge: "I have no clue where to look"

When engineer Martijn joined Altura, he inherited a system with effectively zero AI metrics. If a customer reported that something had gone wrong, reproducing it meant diving into raw Azure log streams or hoping the right data had landed in the database. It usually hadn't.

"There was no single dramatic incident. It was just a constant 'what is going on under the hood?' and the more complex the LLM workflows got, the more that question came up."

So his first move was observability. He introduced Langfuse, swapping a single import line to auto-capture traces without polluting the database with custom logs. For about a year, it served as the backbone for prompt management and observability, and prompt management, built on Jinja templates with if/else logic, worked well enough.

But as the product grew more agentic, two questions kept coming back from product and leadership that observability alone couldn't answer:

Offline: Before we deploy, is this actually going to work, or are we regressing other parts of the system?
Online: Once real people start using it in production, how good is it really?

The team's first attempt at answering them was the same one most teams reach for: turn on a set of generic, out-of-the-box LLM-as-judge evaluators and let them run. It didn't work. The judges were generic, lived in a dashboard nobody opened, and most importantly produced continuous scores that no one could act on.

"The number one rule we learned for LLM-as-judge: make it binary. What's the difference between a 0.6 and a 0.7? That's just not actionable."

Martijn turned off LLM judges that had been quietly running in the background for half a year. Nobody had looked at them once.

As he put it, summarizing a pattern he sees across a lot of teams at Altura's stage:

"Without any evaluations in place, you're always shipping on vibes. That was us a year ago."

A harder problem than tooling

Altura's domain made this especially difficult. The platform serves wildly different industries, one tender might be about implementing an IT system, the next about constructing a station building. Vocabulary, structure, requirements, and certifications barely overlap. A metric that looks great for one industry can be meaningless for another.

That created a second, organizational challenge: translating engineering metrics into something leadership could actually steer on.

"Leadership just wants to hear 'we went from 70% to 80%.' But inside that 70%, there's so much compounding going on. Finding evals that genuinely resonate with the business is the hard part, and there's no blueprint for it."

What "good" needed to look like

As the team mapped what they actually needed, the requirements converged on a handful of concrete jobs:

Benchmark new models on production data. Most of Altura's workload runs on Azure with OpenAI models, but the team was actively investigating whether moving some workloads to Gemini or Anthropic would improve quality, and needed an ongoing, repeatable way to replay production data and compare, not a one-off experiment.
Evaluate prompt changes without guesswork. Adjust a prompt, run it against golden-standard scenarios, check for regressions, in minutes, not as a manual leap of faith.
Multi-turn evaluation for the Bid Companion. Verify the assistant holds expected behaviour across full conversations, not just single turns.
Pass/fail gates on golden standards. For document parsing, reference documents with known-correct requirement numbers and text meant clear pass/fail signals when something regressed, caught in CI, not by a customer.
Reliable metrics for CI/CD. The blocker was stability: rerunning the same LLM-as-judge test four times produced four different scores. You can't gate a deployment on a number that won't sit still.
Compare search and retrieval approaches. Test the current engine against alternatives like PageIndex, and experiment with embedding and chunking strategies, as structured experiments, not side projects that never get prioritized.
Hand quality work to product. When a PM suspected hallucination rates had crept up on certain topics, fixing it shouldn't require filing an engineering ticket and writing custom evaluators every time.

One engineer captured the bar perfectly:

"I want to log in at 08:30 with my coffee and see if anything went bottoms-up overnight, and get an email when traces show things deteriorating in production."

Coffee-and-dashboard simple. CI/CD-pipeline serious.

There was one more requirement that turned out to be decisive, and that barely existed as a category when the evaluation started: the tooling had to work natively with Claude Code and spec-driven development. Altura's engineering had moved heavily toward agentic, spec-first workflows, and Martijn had a particular allergy to manual prompt management.

"I don't like writing prompts. I let Claude Code write them for me. So the tooling had to fit that workflow, MCP-native, not a separate place I have to babysit."

Why Altura chose LangWatch

Altura evaluated several platforms. What set LangWatch apart wasn't a longer feature list, it was the opposite.

An opinionated platform in a space with no map

For a team that openly describes AI engineering as a "blank space" with no established thought leaders, an opinionated platform was a feature, not a limitation.

"AI is such a blank space right now. Having a vendor that picks one clear path lets us focus our learning instead of being paralyzed by options. This space is extremely new, the LangWatch team has experience with so many cases, the more guidance we have up front, the better."

Metrics you can actually trust and act on

LangWatch's approach matched the lessons Altura had learned the hard way: prefer binary judgments over continuous scores, lean on calculated metrics (like n-gram overlap against golden answers) where an LLM isn't needed, and make every judge inspectable. Instead of a meaningless "response quality: 0.85," the team can validate judgments one by one, see where the judge agrees or disagrees with a human, and tune it until it's trustworthy, because an evaluator nobody trusts adds no value.

That same stability is what finally made qualitative metrics like groundedness and answer relevance usable as deployment gates rather than directional noise.

Model benchmarking on real data

LangWatch lets Altura replay production traces against new models, turning the OpenAI-vs-Gemini-vs-Anthropic question from a gut call into an evidence-based proposal.

Built for where Altura is heading: agent simulation

As Altura rolls out end-to-end agentic execution, agents that autonomously process hundreds of documents, extract requirements, and draft responses, evaluating components in isolation isn't enough. LangWatch's Scenario simulation framework lets the team test the Bid Companion across full, multi-turn conversations: a simulated user talks to the real agent, and a judge checks not just the final answer but the behaviour along the way. Did it verify the order before acting? Did it actually check the source before answering? Did it hold its guardrails across the conversation? That closes the gap between "the parts work" and "the whole agent works."

Short loops and a team that shows up

Finally, there was the human factor. Early on, Martijn worried about betting on a smaller team. In practice, the responsiveness became one of LangWatch's biggest advantages.

"The loops are just short. I could practically shout something into our shared channel and get it addressed. With the bigger players, I was told to wait two weeks for a new MCP version, and I can't wait forever."

Product-team autonomy

LangWatch gives PMs and product colleagues direct access to dashboards, evaluation results, and trace analysis. When a PM suspects hallucination rates have shifted, they can investigate themselves, checking conciseness, topic patterns, and whether the assistant guides users to next steps, without waiting on engineering.

"As product, we want to take work off engineering's plate. Once there's an AI step in the flow, we can own the quality side and start fine-tuning it ourselves."

The migration

The core integration was smooth. The harder parts, bringing existing OpenLLMetry traces into LangWatch, and waiting on the new labeling system before migrating prompt management, were eased by the fact that Altura never had to do a hard cutover from Langfuse. They migrated deliberately, with LangWatch's team in the loop the whole way.

Martijn's honest summary of the evals journey as a whole:

"Getting somewhere with evals was pretty easy. Getting to the right place was the hard part, and we're still climbing. What's made the difference is that LangWatch gives direction on the eval parts, and they're around the corner when we need them."

Looking ahead

Altura's move to LangWatch marks a shift from reactive, manual quality checks to proactive, systematic evaluation baked into a spec-driven development workflow. As the team builds toward fully autonomous tender management, and rebuilds parts of the platform from scratch, spec-first, evaluation infrastructure that matches the complexity of the product isn't a nice-to-have. It's the foundation that makes shipping with confidence possible.

The vision is straightforward: engineers ship AI features, product teams own quality, agent simulations guard the critical paths, and everyone can check in with their morning coffee and trust that what they're looking at is real.

No more shipping on vibes.

LangWatch is the AI evaluation and agent testing platform for teams building with LLMs and agents. Test changes, trace behavior, prevent regressions, and ship with confidence.

How Altura rebuilt AI quality with LangWatch