<- Back to overview

Feb 5, 2025

How Struck improved their AI performance and AI visibility with LangWatch

Struck.build moved from limited visibility into LLM behavior to full monitoring, evaluations, and production insight across their AI applications, helping the team improve answer quality, detect issues faster, and ship AI features with greater confidence.

Results at a Glance

Real-Time Visibility Across AI Workflows
Full monitoring across LLM-powered applications with insights into quality, latency, throughput, and failures

Faster Detection of AI Issues
Production anomalies and low-quality outputs can now be identified immediately instead of after customer frustration

Stronger RAG Performance Evaluation
The team can now validate whether answers are actually correct—not just whether the system returned a response

Higher Customer Trust
Clear reporting and traceability help Struck.build explain AI behavior and build confidence with customers

“LangWatch is helping us understand what users are asking and if the actual answer provided is correct. I’m amazed by the evaluation of our RAG application.”

— Nikhil Nagaraj, CTO at Struck.build

About Struck.build

Struck.build builds AI-powered solutions for the construction industry, helping customers improve project execution, operational efficiency, and decision-making across complex construction workflows.

Construction is an industry where delays, miscommunication, and poor decisions create immediate financial impact.

That means AI systems cannot simply be interesting.

They need to be reliable.

From resource optimization to predictive maintenance and risk management, Struck.build uses AI to help customers make better decisions faster. As their LLM-powered applications expanded, especially around retrieval and question answering, maintaining answer quality became critical.

The challenge was no longer just building AI features.

It was proving they were working.

The Challenge: AI answers without clear visibility

Before using LangWatch, Struck.build faced a common problem for teams shipping LLM products:

The system could generate answers, but understanding whether those answers were actually good was much harder.

Their team needed to answer questions like:

What are users actually asking?
Is the RAG system retrieving the right context?
Is the final answer correct?
Where are hallucinations happening?
Which prompts or retrieval changes improve quality?
How do we detect issues before customers report them?

Without strong observability, debugging became reactive.

Without evaluations, improving quality became guesswork.

This is especially risky in construction workflows, where wrong information can create real operational consequences.

The Goal: Confidence in every AI response

Struck.build needed more than logs. They needed a system that could help them:

Before Production

Evaluate whether:

RAG answers were correct
prompt changes improved outcomes
retrieval quality was improving
new releases introduced regressions
model updates actually performed better

After Production

Monitor:

latency
throughput
failures
hallucinations
answer quality
abnormal behavior across live customer usage

They needed measurable confidence—not assumptions.

The Solution: LangWatch for Monitoring + Evaluations

Struck.build integrated LangWatch across their LLM production environment to create both visibility and control.

This gave them a single system for:

production monitoring
answer quality evaluation
retrieval analysis
prompt experimentation
anomaly detection
performance reporting

Instead of manually reviewing outputs or waiting for customers to flag problems, the team could now proactively monitor and improve.

How Struck.build uses LangWatch today

Evaluating RAG quality, not just response generation

One of the biggest improvements came from evaluating their RAG application. Previously, success was often measured by whether the system responded. But a response is not the same as a correct answer.
LangWatch helped Struck.build evaluate whether answers were grounded, accurate, and actually useful for the user.

This shifted the team from output monitoring to quality monitoring.

“I’m amazed by the evaluation of our entire AI product.”

That visibility became especially important when optimizing prompts and retrieval logic. Instead of relying on intuition, they could benchmark changes and compare performance with real evidence.

Production monitoring across multiple dimensions

Struck.build also uses LangWatch to monitor AI performance beyond just correctness.

They track:

accuracy
latency
throughput
resource utilization
anomalies and failures

This creates a much more complete understanding of system health. When something degrades—whether quality or speed—the team can identify it early. That reduces downtime and prevents customer-facing issues before they escalate.

Faster issue detection through real-time observability

With real-time monitoring, the team can detect abnormal behavior as it happens. Instead of waiting for a customer to say “this answer looks wrong,” they can identify issues proactively. This is especially valuable in production environments where reliability matters. LangWatch helps them move from reactive debugging to proactive quality control.

Better customer transparency and trust

Construction customers need confidence in the systems they rely on. LangWatch reporting gives Struck.build the ability to show how their AI systems perform—not just claim that they work. That improves transparency and strengthens customer trust. Instead of black-box AI, customers get measurable visibility.

The Results

Stronger Confidence in AI Performance

The team now has a clear understanding of how their LLM systems behave across production and experimentation.

They can measure quality before changes go live and investigate issues quickly when something changes.

This creates much stronger confidence across releases.

Faster Problem Resolution

Real-time anomaly detection dramatically shortens the feedback loop between issue detection and resolution.

Instead of searching across logs after a customer escalation, the team can trace behavior immediately.

That means faster fixes and less operational risk.

Better RAG Quality

By focusing on answer correctness—not just response generation—Struck.build improved the reliability of its RAG workflows.

This is critical for customer trust and for maintaining high-value enterprise AI use cases.

Scalable Monitoring for Growing AI Systems

As Struck.build expands into more AI use cases across construction workflows, LangWatch provides the infrastructure to monitor complexity without increasing operational overhead.

This supports scale without losing control.

“This integration empowers us to deliver innovative solutions with confidence, knowing we have the data to drive our success.”

— Nikhil Nagaraj, CTO at Struck.build

What’s Next: From monitoring to continuous optimization

Monitoring was the first step. The next phase is turning those insights into continuous improvement loops. That means going beyond observability and into:

Continuous Evaluations

Using evaluation datasets to compare prompt versions, model updates, and retrieval improvements before release. This creates stronger pre-production confidence and prevents regressions.

Scenario testing for AI Workflows

As AI systems become more agentic and complex, testing needs to move beyond single prompts.

Struck.build is exploring more structured simulation workflows where full user journeys, edge cases, and multi-step reasoning paths can be tested before production.

This creates AI integration testing—similar to unit tests for agents and workflows.

Instead of asking “did the answer work?”

The question becomes:

“Will this entire workflow succeed reliably in production?”

Stronger Production-to-Evaluation Feedback Loops

Production traces can become evaluation datasets.

That means customer behavior helps improve future testing, and every real-world issue strengthens future releases.

This creates a flywheel:

production → evaluation → optimization → stronger production

Key Takeaways

Response generation is not the same as answer quality
Especially in RAG systems, you need to evaluate correctness—not just whether the system replied.

Observability prevents reactive firefighting
Real-time traces and anomaly detection turn debugging from guesswork into fast action.

AI quality needs both monitoring and testing
Production visibility is critical, but true confidence comes from combining monitoring with evaluations and scenario-based testing before release.

Struck.build is a construction technology company building AI-powered solutions for smarter project execution, operational efficiency, and enterprise decision-making.

LangWatch is the AI evaluation and agent testing platform for teams building with LLMs and agents. Monitor production quality, evaluate RAG systems, test changes, prevent regressions, and ship AI with confidence.