Introducing LangWatch MCP: Test and evaluate AI Agents without leaving your workflow

Manouk
Mar 12, 2026
Your AI coding assistant can now build, version, and ship evals no platform context-switching required.
The way teams build AI agents has changed. Developers live in Claude Code, Cursor, and Copilot. Product managers describe requirements in natural language. Engineers prototype faster than ever. But agent evaluation — the discipline of knowing whether your agent actually works has lagged behind. It still requires logging into a separate platform, learning new UIs, writing boilerplate, and context-switching out of the flow where real work happens.
We're re-launching the LangWatch MCP server: a Model Context Protocol integration that brings LangWatch's evaluation infrastructure directly into your AI coding environment. Wherever Claude has MCP support Claude Code, Claude Desktop, Copilot, it can now push prompts, create evaluation scenarios, run experiments, and compare models, all without you leaving the conversation.
What is the LangWatch MCP?
MCP (Model Context Protocol) is Anthropic's open standard for giving AI assistants structured access to external tools and services. With the LangWatch MCP server installed, your AI assistant gains the ability to:
Version and sync prompts — push Jinja or plain-text prompt templates directly to LangWatch's prompt registry, with full version history
Create evaluation scenarios — generate structured test cases (with expected tool calls, first messages, and judge criteria) from natural language descriptions
Bootstrap evaluation notebooks — scaffold Python notebooks for side-by-side model comparisons using LangWatch's targets feature
Configure platform evaluators — set up LLM-as-a-judge metrics (groundedness, answer relevance, custom criteria) that can gate CI/CD pipelines
Run simulation-based tests — let an AI user simulator talk to your agent and a judge evaluate whether it behaved correctly
Migrate prompts/traces and all - from Langfuse or other frameworks to LangWatch.
The result: the entire evaluation lifecycle from "this behavior feels wrong" to "here are 53 structured tests running against production" can happen in a single conversation.
Why It matters: Three teams who used it in the wild
We didn't build the MCP in isolation, and in all honesty - this MCP exists since last summer, but only recently it became very popular. Three teams helped us validate it on real codebases, real agents, and real evaluation problems. Here's what happened.
A PM writes 53 Agent test Scenarios with Claude code
An HR and payroll platform with AI agents handling employee mutations, salary changes, onboarding, tax settings. Product managers could identify broken agent behaviors, but couldn't write or run tests themselves. The team had been using Langfuse, but found it, in the PM's words, "super confusing."
The solution was to stay in her existing environment, Claude and install the LangWatch MCP. She already had a markdown document with the agent's tools, business rules, and manually-verified test cases. Her new workflow:
Chat with Claude about what needs testing, in her own words
Claude generates structured scenarios with criteria, expected tool calls, and first messages
Claude pushes them to LangWatch via MCP in one shot
Scenarios run as simulations against their dev and production agents
In one conversation, she generated 63 scenarios across 11 categories, happy paths, edge cases, delete operations, wage tax mutations, out-of-service employees, validation checks. Before this, they had "not a lot" of manual tests and were eyeballing average scores across ~1,000 questions.
There reaction on Slack: "Really like this way of creating scenarios!"
Migrating from Langfuse and bootstrapping Evals in one Session
A B2B SaaS platform’s Senior AI Engineer came to us with a familiar list of frustrations with Langfuse: no way to benchmark new model releases on past production data, broken support for Jinja templates in prompt testing, and no simulation-based testing for their multi-turn “Bid Companion” chat agent.
Their team summarized it well:
“While I agree we need more than what Langfuse offers, I feel it also needs less. They manage to make what they have super confusing.”
They installed the LangWatch MCP in their Claude Code environment and pointed it at their existing Python backend. His prompt to Claude was simple:
“Use the LangWatch MCP to completely replace the current Langfuse implementation in the Python backend. Follow OpenTelemetry guidelines, copy the existing prompts into LangWatch, but don’t delete anything from Langfuse.”
Claude read the Langfuse integration, understood the tracing and prompt management setup, and rewired everything to LangWatch in the same session without leaving the development workflow.
From there, it started bootstrapping the evaluation layer automatically:
Prompt versioning
Existing Jinja prompts converted into versioned YAML files and synced to LangWatch via CLI.
Scenario tests for the Bid Companion
Multi-turn simulations checking whether the agent stays on topic, asks clarification questions, and avoids hallucination.
Model benchmarking experiments
Evaluation notebooks comparing GPT-4o, Gemini, and Anthropic models on past production datasets to estimate quality and cost before migrating workloads across cloud providers.
Golden dataset regression tests
Must-pass scenarios for key behaviors and structured extraction tasks (like parsing requirements tables from procurement PDFs) to ensure prompt or model changes don’t break critical use cases.
Platform evaluators for groundedness and relevance
LLM-as-a-judge metrics added as CI/CD gates so prompt or model updates automatically trigger evaluations.
Previously, every prompt release meant manually walking through test cases from PDFs and checking traces one by one. Now the workflow is simple: change a prompt or model in the YAML file, run langwatch prompt sync, rerun the evaluation notebook, and compare results on the dashboard — all without leaving the coding environment.
Grant application Government setup via MCP
The AI & Data team builds multi-agent LangGraph workflows for the Dutch government. Their case: a grant application assessment system where documents, financial plans, project plans, VOG certificates are routed to specialist analysis agents. Their sharpest question going in: "How should evals differ for agents with varying degrees of autonomy?"
Before using LangWatch, the team received instructions: install the LangWatch MCP, point it at your existing project, and use it to version prompts, create evaluation datasets, build Jupyter notebooks, and scaffold scenario simulations. They showed up with instrumentation already in place.
Their LangGraph pipeline has three layers: a router node that classifies incoming documents, specialist analysis nodes that evaluate documents against domain criteria, and an aggregator node that synthesizes findings into a final pass/fail. The testing pyramid they built:
6 scenario tests for end-to-end behavior submit a good financial plan, verify correct routing, positive analysis, passing assessment. Deterministic assertions for routing, LLM judge criteria for analysis quality.
Evaluation notebooks comparing gpt-4.1-mini vs gpt-4.1-nano across 8 sample documents — router accuracy, pass/fail accuracy, and qualitative evaluators, all side-by-side via the targets feature
Platform evaluators for findings quality — created once on LangWatch, referenced by ID, called from notebooks via SDK
Non-technical team members got a parallel path: upload a CSV dataset on the platform, add a prompt, wire up evaluators, run evaluations, no code required.
Result: Using Scenario + our MCP setup, the team uncovered multiple issues in their PoC and validated strong scenarios. That’s exactly why LangWatch believes: Agent simulations are quickly becoming the missing layer between building AI agents and confidently deploying them.
Getting started
Install the LangWatch MCP server in your environment of choice, Claude Code, Claude Desktop, or any MCP-compatible assistant and connect it to your LangWatch workspace. From there, you can describe your agent in natural language and ask your assistant to:
Version your existing prompts
Generate scenario tests for your key workflows
Scaffold an evaluation notebook for model comparisons
Set up platform evaluators for qualitative metrics
The full setup guide and MCP server documentation are available at MCP Docs.
Agent evaluation shouldn't require a dedicated platform expert, a new UI to learn, or a context switch out of the environment where you do your best work. With the LangWatch MCP, it doesn't.
Let us know if you need a walkthrough, our engineers are here to help. Book a call here.

