LLM evaluations at Swis for Dutch government projects by LangWatch

Manouk

Apr 3, 2025

SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.

Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.

We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.

The problem - LLM output testing

Vragen.ai was working well, but SWIS ran into a familiar problem:

“How do we objectively know if the AI output is good?”

Some key issues they faced:

  • No fast, repeatable way to evaluate RAG-generated answers

  • Lack of clarity on how to present quality metrics to internal stakeholders or clients

  • Too much reliance on manual spot checks or subjective review

For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”

The approach - Structured evaluation framework

SWIS brought in LangWatch to build a more structured QA process around their AI stack.

Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:

  • Define clear metrics like Faithfulness, Factual Match, and Relevance

  • Automatically test generated answers against curated expected answers

  • Visualize where and why an output fails — whether hallucination, wrong scope, or missing context

“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
Dominique, Software Architect

What it changed - LLM evaluation reports & feedback loops

For the RVO deployment, this meant a few key improvements:

  • Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.

  • No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.

  • Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.

“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
Joris, AI Lead

Bonus: Integration with the Dev Stack - Open Telemetry

One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.

SWIS integrated it into their internal pipelines:

  • Pull data from their test environments

  • Run evaluation workflows before deploying changes

  • Track model performance across iterations and versions

  • While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK

“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
Dominique

What’s Next

SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.

In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.

Start for free today!

SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.

Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.

We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.

The problem - LLM output testing

Vragen.ai was working well, but SWIS ran into a familiar problem:

“How do we objectively know if the AI output is good?”

Some key issues they faced:

  • No fast, repeatable way to evaluate RAG-generated answers

  • Lack of clarity on how to present quality metrics to internal stakeholders or clients

  • Too much reliance on manual spot checks or subjective review

For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”

The approach - Structured evaluation framework

SWIS brought in LangWatch to build a more structured QA process around their AI stack.

Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:

  • Define clear metrics like Faithfulness, Factual Match, and Relevance

  • Automatically test generated answers against curated expected answers

  • Visualize where and why an output fails — whether hallucination, wrong scope, or missing context

“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
Dominique, Software Architect

What it changed - LLM evaluation reports & feedback loops

For the RVO deployment, this meant a few key improvements:

  • Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.

  • No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.

  • Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.

“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
Joris, AI Lead

Bonus: Integration with the Dev Stack - Open Telemetry

One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.

SWIS integrated it into their internal pipelines:

  • Pull data from their test environments

  • Run evaluation workflows before deploying changes

  • Track model performance across iterations and versions

  • While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK

“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
Dominique

What’s Next

SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.

In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.

Start for free today!

SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.

Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.

We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.

The problem - LLM output testing

Vragen.ai was working well, but SWIS ran into a familiar problem:

“How do we objectively know if the AI output is good?”

Some key issues they faced:

  • No fast, repeatable way to evaluate RAG-generated answers

  • Lack of clarity on how to present quality metrics to internal stakeholders or clients

  • Too much reliance on manual spot checks or subjective review

For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”

The approach - Structured evaluation framework

SWIS brought in LangWatch to build a more structured QA process around their AI stack.

Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:

  • Define clear metrics like Faithfulness, Factual Match, and Relevance

  • Automatically test generated answers against curated expected answers

  • Visualize where and why an output fails — whether hallucination, wrong scope, or missing context

“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
Dominique, Software Architect

What it changed - LLM evaluation reports & feedback loops

For the RVO deployment, this meant a few key improvements:

  • Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.

  • No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.

  • Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.

“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
Joris, AI Lead

Bonus: Integration with the Dev Stack - Open Telemetry

One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.

SWIS integrated it into their internal pipelines:

  • Pull data from their test environments

  • Run evaluation workflows before deploying changes

  • Track model performance across iterations and versions

  • While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK

“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
Dominique

What’s Next

SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.

In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.

Start for free today!

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.