LLM evaluations at Swis for Dutch government projects by LangWatch

Manouk
Apr 3, 2025
SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.
Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.
We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.
The problem - LLM output testing
Vragen.ai was working well, but SWIS ran into a familiar problem:
“How do we objectively know if the AI output is good?”
Some key issues they faced:
No fast, repeatable way to evaluate RAG-generated answers
Lack of clarity on how to present quality metrics to internal stakeholders or clients
Too much reliance on manual spot checks or subjective review
For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”
The approach - Structured evaluation framework
SWIS brought in LangWatch to build a more structured QA process around their AI stack.
Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:
Define clear metrics like Faithfulness, Factual Match, and Relevance
Automatically test generated answers against curated expected answers
Visualize where and why an output fails — whether hallucination, wrong scope, or missing context
“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
— Dominique, Software Architect
What it changed - LLM evaluation reports & feedback loops
For the RVO deployment, this meant a few key improvements:
Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.
No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.
Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.
“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
— Joris, AI Lead
Bonus: Integration with the Dev Stack - Open Telemetry
One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.
SWIS integrated it into their internal pipelines:
Pull data from their test environments
Run evaluation workflows before deploying changes
Track model performance across iterations and versions
While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK
“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
— Dominique
What’s Next
SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.
In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.
Start for free today!
SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.
Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.
We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.
The problem - LLM output testing
Vragen.ai was working well, but SWIS ran into a familiar problem:
“How do we objectively know if the AI output is good?”
Some key issues they faced:
No fast, repeatable way to evaluate RAG-generated answers
Lack of clarity on how to present quality metrics to internal stakeholders or clients
Too much reliance on manual spot checks or subjective review
For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”
The approach - Structured evaluation framework
SWIS brought in LangWatch to build a more structured QA process around their AI stack.
Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:
Define clear metrics like Faithfulness, Factual Match, and Relevance
Automatically test generated answers against curated expected answers
Visualize where and why an output fails — whether hallucination, wrong scope, or missing context
“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
— Dominique, Software Architect
What it changed - LLM evaluation reports & feedback loops
For the RVO deployment, this meant a few key improvements:
Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.
No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.
Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.
“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
— Joris, AI Lead
Bonus: Integration with the Dev Stack - Open Telemetry
One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.
SWIS integrated it into their internal pipelines:
Pull data from their test environments
Run evaluation workflows before deploying changes
Track model performance across iterations and versions
While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK
“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
— Dominique
What’s Next
SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.
In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.
Start for free today!
SWIS is a Dutch digital agency that’s been building software for over 25 years. They’re known for large-scale Drupal platforms for public organizations like the Netherlands Enterprise Agency (RVO), the Dutch Data Protection Authority, and Sanquin.
Recently, they launched Vragen.ai — a retrieval-augmented generation (RAG) system that provides answer summaries with source references, based entirely on their clients’ content. It’s built to be easy to integrate, GDPR-compliant, and hosted entirely in the EU.
We talked to Dominique (Software Architect) and Joris (AI Lead) about how they handle evaluation and testing of AI outputs — especially in high-stakes environments like government platforms.
The problem - LLM output testing
Vragen.ai was working well, but SWIS ran into a familiar problem:
“How do we objectively know if the AI output is good?”
Some key issues they faced:
No fast, repeatable way to evaluate RAG-generated answers
Lack of clarity on how to present quality metrics to internal stakeholders or clients
Too much reliance on manual spot checks or subjective review
For public sector deployments like RVO, they needed to move from “looks OK” to “this meets specific quality criteria.”
The approach - Structured evaluation framework
SWIS brought in LangWatch to build a more structured QA process around their AI stack.
Using LangWatch, they built a ground-truth evaluation pipeline tailored to the RVO project. This allowed them to:
Define clear metrics like Faithfulness, Factual Match, and Relevance
Automatically test generated answers against curated expected answers
Visualize where and why an output fails — whether hallucination, wrong scope, or missing context
“We didn’t want a black-box score — we wanted something the dev team and product team could both interpret.”
— Dominique, Software Architect
What it changed - LLM evaluation reports & feedback loops
For the RVO deployment, this meant a few key improvements:
Faster feedback loops: LLM Evaluation reports are generated automatically during internal test runs — they know if something is broken before a release.
No-answer detection: LangWatch helped detect cases where the model should not generate an answer due to missing data — a critical case for RVO’s domain.
Less manual LLM testing: With structured tests in place, developers can run diff checks and regression testing more easily.
“We now have clear indicators when a change introduces regression in answer quality — not just when the code breaks.”
— Joris, AI Lead
Bonus: Integration with the Dev Stack - Open Telemetry
One thing that clicked was the LangWatch Optimization Studio — essentially a workflow builder with custom code support.
SWIS integrated it into their internal pipelines:
Pull data from their test environments
Run evaluation workflows before deploying changes
Track model performance across iterations and versions
While the first integration was via their API's they are now fully integrate with LangWatch Open Telemetry SDK
“When Optimization Studio launched, we could finally wire it up to our own code. It became part of our dev workflow, not an extra tool on the side.”
— Dominique
What’s Next
SWIS plans to expose LangWatch-powered evaluations directly to clients via their dashboards, so non-technical users can also see how output quality evolves over time.
In the meantime, LangWatch gives their dev team something most AI projects lack:
Visibility. Repeatability. Confidence in quality.
Start for free today!
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features