How Hyperfox built confidence in AI order processing with LangWatch
Hyperfox went from limited visibility and ad hoc benchmarking to a repeatable workflow for evaluating new AI features and tracing production behavior, giving their team more confidence in every release and saving weeks of internal tooling work.

Company: Hyperfox
Industry: AI-powered order automation
Use Case: LLM observability, benchmarking, and evaluation for document and order processing
Results at a Glance
2+ Weeks Saved: Engineering time avoided by not building and maintaining an internal benchmarking tool
100% Evaluation Coverage for New Features: Every new feature, model change, and fallback strategy can now be benchmarked before release
Faster Root Cause Analysis: Production failures can now be traced through full LLM and agent workflows instead of manual log digging
Higher Release Confidence: Every feature, bug fix, and model update ships with measurable validation instead of intuition
“Every increment we ship now—whether it’s a feature or even a bug fix—we have much more confidence.”
— Clara De Smet, AI Lead at Hyperfox
About Hyperfox
Hyperfox automates order processing. Their platform takes incoming orders in unstructured formats — such as emails, documents, and spreadsheets and transforms them into structured data that can be pushed into ERP and TMS systems.
At the core of the product is AI. Hyperfox uses models and agents to extract, match, and validate information across complex order flows. That means quality matters at every step. If a field is extracted incorrectly, if a fallback model behaves differently, or if an agent makes the wrong tool call, the problem does not stay contained in an experiment. It affects live operations.
As Hyperfox expanded its AI capabilities, especially into more agentic workflows, the challenge was no longer just building these systems. It was knowing whether they were actually working as expected before and after release.
The Challenge: Limited visibility and no reliable evaluation workflow
At Hyperfox, Clara De Smet leads AI development. Her responsibility covers the AI-powered document and order processing layer of the product. When she started using LangWatch, she was looking for two things: observability and evaluations.
Before that, both were missing.
On the observability side, the team had very little visibility into what happened inside production LLM workflows. If something failed, Clara and team often had to go digging through logs or try to reproduce the issue locally. That was difficult enough on its own, but even harder when the issue came from transient provider or model behavior that had already disappeared by the time the investigation started.
On the evaluation side, the team had no robust way to track whether model quality was improving or degrading over time. They could run experiments, but there was no strong benchmarking process and no easy way to compare a baseline against a new feature, a new model, or a fallback strategy.
They had built a lightweight internal tool to help, but it was limited and clunky. It added maintenance overhead without really solving the problem.
For her team, and especially for a company scaling quickly balancing product work and reliability work, this became a real bottleneck.
The breaking point: A fallback worked in theory, but failed in production
One of the clearest examples came when Hyperfox was working on a model fallback strategy.
They had a workflow using Gemini and wanted the ability to fall back to OpenAI in case of provider issues. On paper, the two implementations looked similar enough. In practice, they behaved differently. Structured output handling and schema behavior were just different enough that some errors were thrown silently and not handled correctly downstream.
The result was a production issue.
The team discovered the problem only after it had already impacted the live workflow. At that point, they had neither the testing coverage to catch it earlier nor the observability needed to quickly understand exactly what had gone wrong.
“We assumed it would work the same—but silently errors were being thrown, and the whole process fell apart.”
This was the type of issue that made the gap impossible to ignore. Hyperfox needed a way to evaluate changes before release and understand failures once something was live.
The Goal: Confidence before shipping, visibility after launch
What Hyperfox needed was not just a dashboard.
They needed a workflow that would let them answer questions like:
Is this new model actually better than the current one?
Does a fallback behave the same way under real conditions?
Did this feature improve quality or introduce regressions?
What exactly happened when an agent or tool call went wrong in production?
That meant two capabilities had to come together:
reliable benchmark-style evaluations before deployment
clear tracing and observability once features were live
The Solution: LangWatch for evaluations and production traces
Hyperfox adopted LangWatch to cover both sides of the workflow.
For production observability, they started capturing traces from their LLM systems. This included simple completion calls through the Gemini API, as well as more agentic workflows built with Google ADK, where multiple steps, handoffs, and tool calls needed to be understood in sequence.
For evaluations, Clara began using LangWatch as her benchmarking layer during development. When building a new feature, she runs representative test orders through her local environment, sends both the outputs and the known ground truths into LangWatch, and evaluates the results there.
This created a much more structured way to compare versions and validate changes before release.
Instead of relying on intuition, Hyperfox could now measure.
How Hyperfox uses LangWatch Experimentation today
Offline benchmark testing before release
When Clara develops a new feature, she runs it on a controlled test set of customer orders. Typically that means processing dozens of orders across multiple customers in a local development setup.
Once the outputs are generated, she compares them against known ground truth data in LangWatch. That lets her assess whether the new implementation performs better than the current baseline before anything ships to production.
This has become a practical evaluation workflow for model selection, new features, and benchmark comparisons.
Custom metrics for real-world order accuracy
Hyperfox did not stop at out-of-the-box metrics.
They started with exact matching, but that quickly proved too rigid. In order processing, some outputs can be correct even when they are not formatted identically. For example, order lines may be returned in a different sequence while still representing the same result.
To reflect that, Clara moved beyond exact match and experimented with LLM-as-a-judge approaches. That helped in some cases, but introduced variability on larger orders. For Hyperfox, evaluation needed to be more deterministic.
So they built custom metrics inside LangWatch.
These metrics look at the structure of the output JSON, score leaf nodes, and also allow weighted evaluation depending on what matters most in the workflow. Certain free-text fields can be excluded where strict correctness is harder to define.
That gave Hyperfox the flexibility to measure quality in a way that matches the realities of their product, rather than forcing everything into one generic scoring method.
Production observability for agent workflows
As Hyperfox began using Google ADK for more agentic workflows, LangWatch also became a debugging and insight tool.
One useful example came from tracing tool usage. By reviewing a sequence diagram inside LangWatch, Clara discovered that one tool was being called far more often than expected. That insight led her to inspect the agent’s reasoning flow more closely and identify behavior she would likely not have caught otherwise.
For Hyperfox, observability is not just about postmortems. It is also a way to understand how agents behave in practice and spot inefficiencies or unexpected patterns early.
The Results
Weeks of internal tooling work avoided
One of the clearest gains was the time Hyperfox did not have to spend building and maintaining its own benchmarking system.
Without LangWatch, Clara estimates she would have spent at least two weeks building a rudimentary internal tool, and even then it would likely have produced a limited, hard-to-maintain result.
For a lean team, that matters. Those are weeks that can now go into improving the product itself instead of creating infrastructure around it.
More confidence in every change
The biggest impact was not raw development speed. It was confidence.
Hyperfox now has a better process for validating changes before they go live, and much better visibility if something behaves unexpectedly afterward.
That applies not only to large features, but also to small increments and bug fixes. The team can now ship with much more confidence that changes have been tested and that failures will be easier to understand if they happen.
Better understanding of agent behavior
Tracing has also given Hyperfox more insight into how its agent systems behave internally.
Instead of treating agents as black boxes, the team can now inspect the path taken through tools and handoffs. That makes debugging easier, but it also supports product development by revealing where behavior differs from expectations.
What’s Next: Closing the loop between production validation and human review
Hyperfox already has a clear next step in mind.
Today, they use LangWatch heavily for offline evaluations and production tracing. The next opportunity is connecting production outputs with delayed human validation. In their workflow, processed orders often sit in a validation screen until a user reviews and approves them. That means there can be a meaningful delay between AI output and final ground truth.
If those two stages can be linked cleanly, Hyperfox can move toward a much stronger online evaluation loop: not just testing before release, but measuring real-world quality over time based on validated production outcomes.
They are also interested in going further with LangWatch Scenarios to simulate AI integration tests for agents. That includes testing handoffs between extraction and matching agents, validating how malformed tool calls are handled, and checking larger end-to-end flows such as progressive processing of large files.
The goal is straightforward: treat agent behavior with the same discipline as software behavior, and build repeatable tests around it.
Key Takeaways
Hyperfox’s experience highlights a few things that many AI teams run into as they mature.
First, you cannot rely on intuition when model behavior changes. If you are switching models, adding fallbacks, or shipping new features, you need a benchmark workflow that tells you what actually improved.
Second, observability matters just as much as evaluation. Even with good testing, production systems will always surprise you. When that happens, traces are the difference between guessing and knowing.
Third, for lean teams, buying the right tooling is often far more valuable than building it yourself. The time saved is not just operational. It translates directly into more focus on the product and faster learning loops.
Hyperfox is building AI into operationally critical workflows. LangWatch helps them do that with more structure, more visibility, and more confidence.
LangWatch is the AI evaluation and agent testing platform for teams building with LLMs and agents. Test changes, trace behavior, prevent regressions, and ship with confidence.