Loop Engineering Is a Measurement Problem

Rogerio Chaves
Stacking loops is how reliable agents get built. But the loops only compound when you can trust what each one is telling you.
There's a quiet consensus forming about how reliable agents actually get built. It isn't a bigger model or a cleverer prompt. It's loops — and the harness you stack around the model to run them.
swyx named it loopcraft. LangChain's recent piece, The Art of Loop Engineering, laid out a clean version of the stack: an agent loop that calls tools until a task is done, a verification loop that grades the output and retries when it falls short, an event-driven loop that fires the agent from triggers in your ecosystem, and a hill-climbing loop that feeds production traces back into the harness so it improves over time. It's a good map, and worth reading.
We want to add the part we think decides whether any of it actually works.
Every one of those loops runs on signal. And signal doesn't come for free.
The dependency hiding inside every loop
Walk back up the stack and notice what each loop quietly assumes.
The verification loop is a grader. The hill-climbing loop is an analysis agent reading traces. Both of those are, end to end, measurement loops — they exist only to produce a judgment about quality and act on it. Even the loops that look like pure execution lean on measurement underneath. An event-driven agent running unattended in your ecosystem is exactly the situation where nobody is watching, so the only thing standing between "it ran" and "it ran correctly" is whatever you were measuring.
So loop engineering carries a dependency that rarely gets named: trustworthy signal at every level. Stack four loops on top of unreliable signal and you haven't built a compounding system. You've built a fast way to drift.
Who verifies the verifier?
Start with the verification loop, because that's where the assumption is most exposed.
A verification loop is only as good as its grader, and graders are not free lunches. An LLM-as-judge has its own failure modes: it drifts as you swap models, it disagrees with your actual experts more often than you'd like, and it's confidently wrong on exactly the edge cases that matter most. A deterministic check is more stable but only covers what you thought to encode. Either way, the moment you put a grader in charge of deciding whether the agent's work is "good," you've created a new question — is the grader right? — and most teams never answer it.
This is the genuinely hard part of loop engineering, and it's measurement work. You evaluate your evaluators. You hold judges against datasets of human-labeled examples and measure agreement. You version your rubrics and watch for the day a model upgrade silently changes what your grader rewards. A verification loop you haven't validated isn't a safety net. It's an untested assumption running in a tight loop.
This is most of what we work on at LangWatch — evals, datasets, aligning judges to human labels — so we're biased. But the bias comes from watching team after team ship graders they never checked.
The cheapest loop to close is the one before production
The standard framing of the hill-climbing loop is production-first: real runs produce traces, traces reveal problems, the harness improves. True, and powerful. But it quietly concedes that the way you find a problem is by shipping it.
The cheapest place to close a verification loop is before the agent ever reaches a user. That's what simulation is — a verification loop you run in development. You put the agent through realistic, adversarial, multi-turn scenarios, grade the outcomes, and catch the regressions on your laptop instead of in someone's inbox. It's the thinking behind Scenario, our open-source agent simulation framework.
We picture it as a pyramid. Fast unit tests at the base for the deterministic pieces. Evals in the middle for the model-dependent behavior you can score against datasets. Simulations at the top for the end-to-end, multi-turn behavior that only shows up when the whole loop runs. Evals first, simulations after — the same loop logic, applied to your development cycle rather than your production traffic. Teams who do this don't ship fewer agents. They ship agents whose loops were already closed once before anyone trusted them in the wild.
Traces are raw material, not signal
The hill-climbing loop is where value compounds — and also where most teams get disappointed, because it depends entirely on the quality of what you captured.
A raw trace tells you what happened. It does not, on its own, tell you whether what happened was good. To hill-climb you need traces with evaluation attached: graded outcomes, rubric scores, human feedback, the cases that got flagged and why. You need regressions captured into datasets so the next version of the harness is tested against the exact failures the last version produced. Observability that's only logs gives you a searchable record of your mistakes. Observability wired to evaluation gives you a gradient to climb.
That's the difference between watching your agent and improving it. The analysis agent at the top of the stack is only as smart as the signal beneath it, and that signal is something you engineer on purpose.
When the loop rewrites itself, you need a control plane
Here's the move in the hill-climbing loop that should give every platform team pause: the loop doesn't just report back, it reaches inside and rewrites the harness. Prompts change. Tools change. Eventually, for teams running open-weight models, the loop feeds back into the weights themselves.
A system that modifies itself is exactly the system you want governance around — not as bureaucracy, but as the thing that makes the autonomy safe enough to permit. Every harness change needs a version and an author, even when the author is an agent. Sensitive changes need an approval gate. Everything needs an audit trail and a rollback, because the first time a self-improvement run optimizes toward the wrong proxy metric, your only good options are "see exactly what changed" and "undo it."
This is the part of the stack that turns into a control plane: prompt versioning, change review, a policy on what an agent is allowed to alter unsupervised, and a record of every decision the system made about itself. Human-in-the-loop isn't a retreat from automation here. It's the placement of judgment at the few points where being wrong is expensive — precisely the judgment a grader can't supply.
The measurement layer that spans the stack
Put it together and a second stack appears, shadowing the first. For every loop that does work, there's a measurement responsibility that makes the work trustworthy.
Loop | What it does | What it depends on | Failure mode if you skip the measurement |
|---|---|---|---|
Agent loop | Model calls tools until done | Tool feedback the model can read | Confident, unverifiable output |
Verification loop | Grades output, retries on failure | A grader you've actually validated | A loop that enforces the wrong standard, fast |
Event-driven loop | Runs unattended on triggers | Observability you trust without watching | Silent failures at scale |
Hill-climbing loop | Rewrites the harness from traces | Traces wired to evals, governance on changes | Drift toward the wrong proxy, with no rollback |
The loops are the pattern. The measurement is what makes the pattern compound instead of drift.
The sentence doing all the work
Loopcraft is real, and the people calling it the core skill of agent engineering are right. But there's a sentence hiding in the middle of every loop diagram — and then we check whether it worked — and that sentence is doing enormous load-bearing work. Whoever owns it owns whether the system actually gets better.
So by all means, stack your loops. Just remember that a loop you can't measure isn't a feedback loop. It's a fast way to be wrong on a schedule. The teams that win the next few years won't only be the ones who build the most loops. They'll be the ones who can trust what every loop is telling them.
LangWatch is the open-core platform for testing, evaluating, and observing AI agents — evals, simulations with Scenario, and the governance layer that keeps self-improving systems accountable. See what your agents are really doing.

