Build your first eval in Claude Code with LangWatch (step by step)

Manouk Draisma
You have an agent. It produces output. You have no idea whether that output is correct or just confident-looking. This guide takes you from “no evals” to “a specific eval running against real traces, on a loop” — all from inside Claude Code. The old assumption was that evals require human-led error analysis before you can write anything. That made sense when models were weak. It holds less now: a capable model can give you a working v0, and that v0 is what produces the data you need to improve.
Step 0 — What you’ll need
A running agent (any framework — LangGraph, Google ADK, OpenAI SDK, or raw LLM calls).
A LangWatch account and API key from app.langwatch.ai.
Claude Code with the LangWatch MCP server connected, so Claude can read your traces and create evaluators.
Step 1 — Instrument for tracing
Evals need data, and data means traces. LangWatch auto-instruments your LLM and tool calls so you don’t hand-roll logging.
pip install langwatch
import langwatch
from openai import OpenAI
langwatch.setup() # reads LANGWATCH_API_KEY
client = OpenAI()
@langwatch.trace()
def run_agent(user_message: str):
completion = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_message}],
)
return completion.choices[0].message.content
Set your key, run the agent against a handful of realistic inputs, and confirm the traces show up in the LangWatch dashboard.
export LANGWATCH_API_KEY=sk-lw-...
python run_agent.py
Step 2 — Let Claude propose an eval from your traces
With the LangWatch MCP connected, Claude can query your project. Ask it to look across recent traces and suggest a candidate eval. In practice you’ll get output-level suggestions like:
Groundedness — do the claims in the response actually appear in the retrieved context?
Answer correctness — does the response match a known-good answer (LLM factual match)?
Format validity — is the output valid JSON / does it conform to the expected schema?
These are a fine v0. Trust the model for the first pass — then sharpen.
Step 3 — Get specific about what you actually care about
Output-level evals tell you something broke; component-level evals tell you what. Ask Claude something targeted, for example: “build an eval that checks whether each source cited in the answer actually exists in the retrieved context.” Claude maps it to a concrete LangWatch evaluator — here, an LLM-as-judge configured as a boolean check:
# Claude can create this via the LangWatch MCP, or you can
# define it explicitly. Conceptually:
evaluator = {
"type": "langevals/llm_boolean",
"name": "citation-grounded",
"settings": {
"prompt": "Does every source cited in the answer"
" actually appear in the provided context?",
},
"mappings": {"output": "answer", "contexts": "retrieved"},
}
LangWatch ships a catalogue of evaluators so you’re rarely starting from zero: exact match, semantic similarity, LLM factual/answer match, RAGAS faithfulness and context precision/recall, valid-format and off-topic checks, plus safety evaluators (PII, moderation, jailbreak). Pick the closest one and let the judge handle the nuance.
Step 4 — Run it across your spans and read the failures
Run the evaluator over your traces. The win isn’t the score — it’s that the failures cluster into named categories with example spans attached. You can finally say “the agent inflates priority on long feature requests” instead of “it feels off sometimes.” You can’t fix what you can’t name; the eval names it.
Step 5 — Wire a self-improvement loop
A one-time eval is a snapshot. The leverage is in the loop, and it’s three moves on a schedule:
Fetch every span the eval flagged in the last cycle.
Group the failures by category to find the pattern.
Propose one concrete fix — a prompt change or scoring adjustment — for the most common category.
In LangWatch this becomes an online evaluation monitor: the evaluator runs automatically against new traces as they arrive, and a trigger can fire a Slack message or webhook when failure rates cross a threshold. Ask Claude to set it up, then review what it proposes.
Keep the human in the loop Eval changes need your sign-off before they run; agent changes need your sign-off before they ship. The loop proposes, you approve. That’s the design — automation for the grunt work, judgment stays with you. |
What you have now
A real eval, running against real traces, with named failure categories and a loop that keeps surfacing the next problem. That’s the entire middle layer of the pyramid, stood up in one session. Next we climb to the peak.
Next in the series: single-turn evals can’t prove an agent handles a real, multi-turn conversation. Time for simulations.

