Making your AI Agent reliable: Adding Evaluations to your multi-modal agent with LangWatch Skills

Sergio Cardenas

Mar 24, 2026

How I went from "it works on my machine" to measurable agent quality using LangWatch Skills, Jupyter notebooks, and a path to production on AWS.*

The problem nobody talks about

You built an agent. It uses tools, handles multimodal inputs, answers questions from a knowledge base. You demo it to your team and it works great. Ship it.

Three days later: the satellite image analysis returns garbage NDVI estimates. The knowledge base tool stops getting called for calibration questions, the LLM just wings it. Nobody noticed because there were no tests.

This is the gap between "I have an agent" and "I have a reliable agent." LangWatch fills it.

What I built

The InField Agent is a weather station advisory system built with Strands Agents SDK. It has three multimodal capabilities:

  1. Knowledge base — calibration procedures for Davis Instruments weather stations

  2. Station status — fleet inventory, battery health, reporting gaps

  3. Satellite imagery — NDVI estimation from satellite images using vision models

from strands import Agent
from strands.models.openai import OpenAIModel

def create_agent() -> Agent:
    model = OpenAIModel(model_id="gpt-5-mini")
    return Agent(
        model=model,
        system_prompt=SYSTEM_PROMPT_ADV,
        tools=[search_knowledge_base_tool, check_station_status, analyze_satellite_image],
    )
from strands import Agent
from strands.models.openai import OpenAIModel

def create_agent() -> Agent:
    model = OpenAIModel(model_id="gpt-5-mini")
    return Agent(
        model=model,
        system_prompt=SYSTEM_PROMPT_ADV,
        tools=[search_knowledge_base_tool, check_station_status, analyze_satellite_image],
    )
from strands import Agent
from strands.models.openai import OpenAIModel

def create_agent() -> Agent:
    model = OpenAIModel(model_id="gpt-5-mini")
    return Agent(
        model=model,
        system_prompt=SYSTEM_PROMPT_ADV,
        tools=[search_knowledge_base_tool, check_station_status, analyze_satellite_image],
    )

The satellite tool sends images to a vision model and gets back structured NDVI data:

@tool
def analyze_satellite_image(image_id: str) -> dict:
    """Analyze a satellite image to estimate NDVI."""
    image_path = _DATA_DIR / f"{image_id}.png"
    # ... encode image as base64, send to gpt-5-mini with vision ...
    return {
        "ndvi_estimate": 0.65,
        "confidence": "medium",
        "vegetation_cover_pct": 72,
        "dominant_land_types": ["cropland", "grassland"],
        "summary": "Healthy vegetation with moderate crop coverage"
    }
@tool
def analyze_satellite_image(image_id: str) -> dict:
    """Analyze a satellite image to estimate NDVI."""
    image_path = _DATA_DIR / f"{image_id}.png"
    # ... encode image as base64, send to gpt-5-mini with vision ...
    return {
        "ndvi_estimate": 0.65,
        "confidence": "medium",
        "vegetation_cover_pct": 72,
        "dominant_land_types": ["cropland", "grassland"],
        "summary": "Healthy vegetation with moderate crop coverage"
    }
@tool
def analyze_satellite_image(image_id: str) -> dict:
    """Analyze a satellite image to estimate NDVI."""
    image_path = _DATA_DIR / f"{image_id}.png"
    # ... encode image as base64, send to gpt-5-mini with vision ...
    return {
        "ndvi_estimate": 0.65,
        "confidence": "medium",
        "vegetation_cover_pct": 72,
        "dominant_land_types": ["cropland", "grassland"],
        "summary": "Healthy vegetation with moderate crop coverage"
    }

This is a non-trivial agent to test. You have text-based retrieval, structured data queries, and multimodal vision analysis all behind the same prompt. Traditional unit tests cover maybe 10% of the failure surface.

Step 1: Add LangWatch skills

LangWatch ships skills — curated Claude Code instructions that know how to wire up tracing, evaluations, scenarios, and prompt management in your project. Think of them as recipes that understand the LangWatch SDK.




This drops skill files into .claude/skills/ in your project. When you use Claude Code, it picks up these instructions and knows exactly how to scaffold evaluations and scenarios for your specific agent.

The skills also set up the skills-lock.json to track versions:

{
  "version": 1,
  "skills": {
    "evaluations": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "170c4e99..."
    },
    "scenarios": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "b3afbe5c..."
    }
  }
}
{
  "version": 1,
  "skills": {
    "evaluations": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "170c4e99..."
    },
    "scenarios": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "b3afbe5c..."
    }
  }
}
{
  "version": 1,
  "skills": {
    "evaluations": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "170c4e99..."
    },
    "scenarios": {
      "source": "langwatch/skills",
      "sourceType": "github",
      "computedHash": "b3afbe5c..."
    }
  }
}

Step 2: Tracing — see what your agent actually does

Before you evaluate anything, you need observability. LangWatch tracing captures every LLM call, tool invocation, and input/output pair.

import langwatch
from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")
langwatch.setup()

@langwatch.trace(name="InField Agent Turn")
def handle_turn(agent, user_input: str, thread_id: str):
    langwatch.get_current_trace().update(metadata={
        "thread_id": thread_id,
    })
    result = agent(user_input)
    return result.message["content"][-1]["text"]
import langwatch
from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")
langwatch.setup()

@langwatch.trace(name="InField Agent Turn")
def handle_turn(agent, user_input: str, thread_id: str):
    langwatch.get_current_trace().update(metadata={
        "thread_id": thread_id,
    })
    result = agent(user_input)
    return result.message["content"][-1]["text"]
import langwatch
from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")
langwatch.setup()

@langwatch.trace(name="InField Agent Turn")
def handle_turn(agent, user_input: str, thread_id: str):
    langwatch.get_current_trace().update(metadata={
        "thread_id": thread_id,
    })
    result = agent(user_input)
    return result.message["content"][-1]["text"]

Two lines of setup, one decorator. Every agent turn now shows up in the LangWatch dashboard with the full tool chain visible.

Step 3: Multimodal experiments in Jupyter

This is where it gets interesting. The evaluations skill guided me toward using Jupyter notebooks with langwatch.experiment for batch testing. The key insight: satellite images can be embedded as markdown in the dataset, and LangWatch renders them inline in the UI.

The dataset

Each row targets one of the three capabilities. Satellite rows include the actual image:

SATELLITE_BASE_URL = "<https://storage.googleapis.com/experiments_langwatch>"

def image_to_markdown(image_id: str) -> str:
    return f"![Satellite image {image_id}]({SATELLITE_BASE_URL}/{image_id}.png)"

dataset = [
    # Knowledge base
    {
        "input": "How do I calibrate the temperature reading on a Vantage Pro2?",
        "expected_output": "Use the temperature calibration offset in the console setup menu.",
        "capability": "knowledge_base",
    },
    # Station status
    {
        "input": "Which stations have low battery levels?",
        "expected_output": "A list of stations with battery voltage below 3.0V.",
        "capability": "station_status",
    },
    # Satellite — multimodal
    {
        "input": "Analyze this satellite image and estimate the NDVI.",
        "image": image_to_markdown("01"),
        "expected_output": "An NDVI estimate between -1.0 and 1.0 with vegetation coverage.",
        "capability": "satellite",
    },
    {
        "input": "What does this satellite image tell us about vegetation health?",
        "image": image_to_markdown("03"),
        "expected_output": "An NDVI estimate with vegetation health description.",
        "capability": "satellite",
    },
    {
        "input": "Estimate the vegetation index for this field.",
        "image": image_to_markdown("07"),
        "expected_output": "An NDVI estimate with vegetation cover and land classification.",
        "capability": "satellite",
    },
]
SATELLITE_BASE_URL = "<https://storage.googleapis.com/experiments_langwatch>"

def image_to_markdown(image_id: str) -> str:
    return f"![Satellite image {image_id}]({SATELLITE_BASE_URL}/{image_id}.png)"

dataset = [
    # Knowledge base
    {
        "input": "How do I calibrate the temperature reading on a Vantage Pro2?",
        "expected_output": "Use the temperature calibration offset in the console setup menu.",
        "capability": "knowledge_base",
    },
    # Station status
    {
        "input": "Which stations have low battery levels?",
        "expected_output": "A list of stations with battery voltage below 3.0V.",
        "capability": "station_status",
    },
    # Satellite — multimodal
    {
        "input": "Analyze this satellite image and estimate the NDVI.",
        "image": image_to_markdown("01"),
        "expected_output": "An NDVI estimate between -1.0 and 1.0 with vegetation coverage.",
        "capability": "satellite",
    },
    {
        "input": "What does this satellite image tell us about vegetation health?",
        "image": image_to_markdown("03"),
        "expected_output": "An NDVI estimate with vegetation health description.",
        "capability": "satellite",
    },
    {
        "input": "Estimate the vegetation index for this field.",
        "image": image_to_markdown("07"),
        "expected_output": "An NDVI estimate with vegetation cover and land classification.",
        "capability": "satellite",
    },
]
SATELLITE_BASE_URL = "<https://storage.googleapis.com/experiments_langwatch>"

def image_to_markdown(image_id: str) -> str:
    return f"![Satellite image {image_id}]({SATELLITE_BASE_URL}/{image_id}.png)"

dataset = [
    # Knowledge base
    {
        "input": "How do I calibrate the temperature reading on a Vantage Pro2?",
        "expected_output": "Use the temperature calibration offset in the console setup menu.",
        "capability": "knowledge_base",
    },
    # Station status
    {
        "input": "Which stations have low battery levels?",
        "expected_output": "A list of stations with battery voltage below 3.0V.",
        "capability": "station_status",
    },
    # Satellite — multimodal
    {
        "input": "Analyze this satellite image and estimate the NDVI.",
        "image": image_to_markdown("01"),
        "expected_output": "An NDVI estimate between -1.0 and 1.0 with vegetation coverage.",
        "capability": "satellite",
    },
    {
        "input": "What does this satellite image tell us about vegetation health?",
        "image": image_to_markdown("03"),
        "expected_output": "An NDVI estimate with vegetation health description.",
        "capability": "satellite",
    },
    {
        "input": "Estimate the vegetation index for this field.",
        "image": image_to_markdown("07"),
        "expected_output": "An NDVI estimate with vegetation cover and land classification.",
        "capability": "satellite",
    },
]

The evaluators

LangWatch supports platform-configured evaluators that you reference by slug. I set up three:

Evaluator

What it checks

answer-relevancy

Is the answer relevant to the question? (RAGAS-based)

answer-correctness

Does the output match the expected output? (LLM match)

tool-usage-check

Did the agent call the right tool for the capability? (LLM boolean)

The tool usage check is critical. An agent that answers correctly without calling the tool is a hallucination risk, it just happened to get lucky this time.

The experiment loop

experiment = langwatch.experiment.init("infield-agent-multimodal")

for index, row in experiment.loop(df.iterrows(), threads=1):
    output = run_agent(row["input"])

    data = {"input": row["input"], "output": output}
    if pd.notna(row.get("image")):
        data["image"] = row["image"]

    experiment.evaluate("answer-relevancy-nxwec", index=index, data=data)
    experiment.evaluate("answer-correctness-b5e6x", index=index, data={**data, "expected_output": row["expected_output"]})
    experiment.evaluate("tool-usage-check-aljvk", index=index, data=data)
experiment = langwatch.experiment.init("infield-agent-multimodal")

for index, row in experiment.loop(df.iterrows(), threads=1):
    output = run_agent(row["input"])

    data = {"input": row["input"], "output": output}
    if pd.notna(row.get("image")):
        data["image"] = row["image"]

    experiment.evaluate("answer-relevancy-nxwec", index=index, data=data)
    experiment.evaluate("answer-correctness-b5e6x", index=index, data={**data, "expected_output": row["expected_output"]})
    experiment.evaluate("tool-usage-check-aljvk", index=index, data=data)
experiment = langwatch.experiment.init("infield-agent-multimodal")

for index, row in experiment.loop(df.iterrows(), threads=1):
    output = run_agent(row["input"])

    data = {"input": row["input"], "output": output}
    if pd.notna(row.get("image")):
        data["image"] = row["image"]

    experiment.evaluate("answer-relevancy-nxwec", index=index, data=data)
    experiment.evaluate("answer-correctness-b5e6x", index=index, data={**data, "expected_output": row["expected_output"]})
    experiment.evaluate("tool-usage-check-aljvk", index=index, data=data)

What you see in LangWatch

As the notebook runs, results appear in real time:

  • Satellite images rendered inline next to scores

  • Pass/fail per evaluator per row

  • Scores comparable across model versions and prompt changes side by side

That last point is the payoff. Change a prompt, run the experiment again, see exactly what moved. Not just for one input — across the whole dataset.

The @langwatch.trace decorator also means every evaluation run produces full traces. Drill into a failing row and see exactly which tool was called, what the LLM received, and where it went wrong.

Step 4: Simulations — test the agent as a system

Evaluations test isolated input-output pairs. Simulations test multi-turn conversations where the agent interacts with a simulated user.

LangWatch Scenario is the framework. It has three actors:

  • Agent Under Test — your agent

  • User Simulator — an LLM that generates realistic user messages

  • Judge — an LLM that evaluates the conversation and decides pass/fail

import pytest
import scenario

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.asyncio
async def test_calibration_workflow():
    class InFieldAdapter(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return run_agent(input.messages[-1]["content"])

    result = await scenario.run(
        name="calibration guidance",
        description="A field technician needs to calibrate barometric pressure on a Vantage Pro2. They have a known reference pressure but aren't sure about the procedure.",
        agents=[
            InFieldAdapter(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(),
        ],
        script=[
            scenario.user("How do I calibrate the barometric pressure?"),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent used the knowledge base tool",
                "Agent provided step-by-step calibration instructions",
            ]),
            scenario.user(),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent answered the follow-up using tool results, not general knowledge",
            ]),
        ],
    )
    assert result.success
import pytest
import scenario

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.asyncio
async def test_calibration_workflow():
    class InFieldAdapter(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return run_agent(input.messages[-1]["content"])

    result = await scenario.run(
        name="calibration guidance",
        description="A field technician needs to calibrate barometric pressure on a Vantage Pro2. They have a known reference pressure but aren't sure about the procedure.",
        agents=[
            InFieldAdapter(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(),
        ],
        script=[
            scenario.user("How do I calibrate the barometric pressure?"),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent used the knowledge base tool",
                "Agent provided step-by-step calibration instructions",
            ]),
            scenario.user(),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent answered the follow-up using tool results, not general knowledge",
            ]),
        ],
    )
    assert result.success
import pytest
import scenario

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.asyncio
async def test_calibration_workflow():
    class InFieldAdapter(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return run_agent(input.messages[-1]["content"])

    result = await scenario.run(
        name="calibration guidance",
        description="A field technician needs to calibrate barometric pressure on a Vantage Pro2. They have a known reference pressure but aren't sure about the procedure.",
        agents=[
            InFieldAdapter(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(),
        ],
        script=[
            scenario.user("How do I calibrate the barometric pressure?"),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent used the knowledge base tool",
                "Agent provided step-by-step calibration instructions",
            ]),
            scenario.user(),
            scenario.agent(),
            scenario.judge(criteria=[
                "Agent answered the follow-up using tool results, not general knowledge",
            ]),
        ],
    )
    assert result.success

The simulation loop runs automatically: the user simulator generates contextual follow-ups, the agent responds, the judge scores against your criteria. You define the scenario once and it tests the full conversation flow.

For adversarial testing, swap UserSimulatorAgent for RedTeamAgent:

agents=[
    InFieldAdapter(),
    scenario.RedTeamAgent(),  # Tries to make the agent hallucinate or go off-topic
    scenario.JudgeAgent(criteria=["Agent stays within scope of weather stations"]),
]
agents=[
    InFieldAdapter(),
    scenario.RedTeamAgent(),  # Tries to make the agent hallucinate or go off-topic
    scenario.JudgeAgent(criteria=["Agent stays within scope of weather stations"]),
]
agents=[
    InFieldAdapter(),
    scenario.RedTeamAgent(),  # Tries to make the agent hallucinate or go off-topic
    scenario.JudgeAgent(criteria=["Agent stays within scope of weather stations"]),
]

Step 5: Deploy to AWS

With evaluations passing and simulations green, ship it.

The InField Agent is a single-turn Q&A system — Lambda is the natural fit:




The knowledge base is bundled inside the Lambda package. LLM inference runs on OpenAI's servers. Lambda just orchestrates the agent loop.

Lambda handler

from strands import Agent
from strands.models.openai import OpenAIModel

agent = Agent(
    model=OpenAIModel(model_id=os.environ.get("MODEL_ID", "gpt-5-mini")),
    system_prompt=SYSTEM_PROMPT_ADV,
    tools=[search_knowledge_base_tool],
)

def handler(event, _context):
    prompt = event.get("prompt", "")
    if not prompt:
        return {"statusCode": 400, "body": "Missing 'prompt' in request"}
    result = agent(prompt)
    return {"statusCode": 200, "body": result.message["content"][-1]["text"]}
from strands import Agent
from strands.models.openai import OpenAIModel

agent = Agent(
    model=OpenAIModel(model_id=os.environ.get("MODEL_ID", "gpt-5-mini")),
    system_prompt=SYSTEM_PROMPT_ADV,
    tools=[search_knowledge_base_tool],
)

def handler(event, _context):
    prompt = event.get("prompt", "")
    if not prompt:
        return {"statusCode": 400, "body": "Missing 'prompt' in request"}
    result = agent(prompt)
    return {"statusCode": 200, "body": result.message["content"][-1]["text"]}
from strands import Agent
from strands.models.openai import OpenAIModel

agent = Agent(
    model=OpenAIModel(model_id=os.environ.get("MODEL_ID", "gpt-5-mini")),
    system_prompt=SYSTEM_PROMPT_ADV,
    tools=[search_knowledge_base_tool],
)

def handler(event, _context):
    prompt = event.get("prompt", "")
    if not prompt:
        return {"statusCode": 400, "body": "Missing 'prompt' in request"}
    result = agent(prompt)
    return {"statusCode": 200, "body": result.message["content"][-1]["text"]}

Package and deploy

Three options, depending on your dependency size:

ZIP + Strands Layer (simplest):

aws lambda create-function \\\\
  --function-name infield-agent \\\\
  --runtime python3.12 \\\\
  --handler lambda_handler.handler \\\\
  --zip-file fileb://packaging/app.zip \\\\
  --architectures arm64 \\\\
  --memory-size 256 \\\\
  --timeout 30 \\\\
  --layers "arn:aws:lambda:us-east-1:856699698935:layer:strands-agents-py312-aarch64:1" \\\\
  --environment "Variables={OPENAI_API_KEY=your-key}" \\\\
  --role

aws lambda create-function \\\\
  --function-name infield-agent \\\\
  --runtime python3.12 \\\\
  --handler lambda_handler.handler \\\\
  --zip-file fileb://packaging/app.zip \\\\
  --architectures arm64 \\\\
  --memory-size 256 \\\\
  --timeout 30 \\\\
  --layers "arn:aws:lambda:us-east-1:856699698935:layer:strands-agents-py312-aarch64:1" \\\\
  --environment "Variables={OPENAI_API_KEY=your-key}" \\\\
  --role

aws lambda create-function \\\\
  --function-name infield-agent \\\\
  --runtime python3.12 \\\\
  --handler lambda_handler.handler \\\\
  --zip-file fileb://packaging/app.zip \\\\
  --architectures arm64 \\\\
  --memory-size 256 \\\\
  --timeout 30 \\\\
  --layers "arn:aws:lambda:us-east-1:856699698935:layer:strands-agents-py312-aarch64:1" \\\\
  --environment "Variables={OPENAI_API_KEY=your-key}" \\\\
  --role

Container Image (when dependencies exceed 250 MB):

FROM public.ecr.aws/lambda/python:3.12-arm64
COPY requirements.txt .
RUN pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
COPY lambda_handler.py prompts.py tools.py ${LAMBDA_TASK_ROOT}/
COPY knowledge_base/ ${LAMBDA_TASK_ROOT}/knowledge_base/
CMD ["lambda_handler.handler"]
FROM public.ecr.aws/lambda/python:3.12-arm64
COPY requirements.txt .
RUN pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
COPY lambda_handler.py prompts.py tools.py ${LAMBDA_TASK_ROOT}/
COPY knowledge_base/ ${LAMBDA_TASK_ROOT}/knowledge_base/
CMD ["lambda_handler.handler"]
FROM public.ecr.aws/lambda/python:3.12-arm64
COPY requirements.txt .
RUN pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
COPY lambda_handler.py prompts.py tools.py ${LAMBDA_TASK_ROOT}/
COPY knowledge_base/ ${LAMBDA_TASK_ROOT}/knowledge_base/
CMD ["lambda_handler.handler"]

Setting

Value

Why

Architecture

arm64

Lower cost, matches official layer

Memory

256 MB

Inference runs on OpenAI, not Lambda

Timeout

30 seconds

Single-turn agent completes in <15s

Production checklist

  • Store API keys in Secrets Manager or SSM Parameter Store

  • Enable CloudWatch logging

  • Set up API Gateway with authentication

  • Configure CloudWatch alarms on error rate and duration

  • Run evaluations in CI before deploying new versions

That last point closes the loop: your Jupyter notebook evaluations become a CI gate. A prompt change that drops answer relevancy below threshold blocks the deploy.

The evaluation lifecycle

Here is what you end up with:




LangWatch covers all four stages with the same evaluators. The answer-relevancy check you run in a notebook is the same one that scores production traces. Consistency across the lifecycle means no surprises.

What this cost me

  • Setup time: ~30 minutes. The skills did most of the scaffolding.

  • Notebook evaluations: 3 satellite images x 3 evaluators = 9 evaluation calls per run. Under a minute.

  • Lambda deployment: 256 MB, arm64, 30s timeout. Pennies at low volume.

  • LangWatch traces: free tier covers experimentation. Platform evaluators included.

Takeaways

  1. Evaluations are not optional for multimodal agents. A satellite image tool that returns plausible-sounding garbage is worse than one that throws an error. You need automated checks.

  2. Tool usage matters as much as answer quality. An agent that gives the right answer without calling the tool is a ticking time bomb. The tool-usage-check evaluator catches this.

  3. Simulations find bugs that evaluations miss. Single-turn evaluations cannot test whether the agent stays grounded across a multi-turn conversation. Scenario simulations can.

  4. LangWatch skills bootstrap the hard part. npx skills add langwatch/skills/evaluations gives Claude Code the context to scaffold everything — the notebook, the evaluators, the experiment loop. You focus on defining what "correct" means for your agent.

  5. Same evaluators, every stage. Run them in a notebook during development, in CI before deploy, and on live traces in production. One set of quality criteria, applied everywhere.

All code is available at https://github.com/langwatch/satellite-agent The Jupyter notebook runs end to end if you have an OpenAI key and a LangWatch project.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.

Ship agents with confidence, not crossed fingers

Get up and running with LangWatch in as little as 5 minutes.