Multimodal Evaluation Images, PDFs, and Vision

LangWatch supports multimodal evaluation out of the box. You can evaluate image inputs and outputs using any vision-capable model (GPT-4o, GPT-5.2, Claude Sonnet, Gemini, etc.) through the built-in LLM-as-a-Judge evaluators, no custom code required. This covers common multimodal use cases:

Image generation quality: score outputs of image generation models
Document parsing: evaluate extracted metadata from PDFs and scanned documents
Content moderation: detect NSFW or low-quality uploaded images
Visual QA: evaluate answers to questions about images
Image comparison: compare generated outputs against reference images

Image support works with all three LLM-as-a-Judge evaluator types:

Boolean: pass/fail evaluation (e.g. “Is the generated image photorealistic?”)
Score: numeric score evaluation (e.g. “Rate image quality from 1-5”)
Category: classification evaluation (e.g. “Classify the image as: excellent, good, poor”)

See also:

Dataset Images: Setting up image columns in datasets
Saved Evaluators: Reuse evaluators via API

Supported Image Formats

Images can be provided in any of these formats:

Format	Example
Image URL	`https://example.com/photo.png`
Base64 data URI	`data:image/png;base64,iVBORw0KGgo...`
Markdown image	`![alt text](https://example.com/photo.png)`

Supported extensions: .png, .jpg, .jpeg, .gif, .webp, .svg, .bmp, .tiff

Image detection is strict by design, a field is treated as an image only when the entire value is an image URL or base64 string. Mixed text-and-image content is sent as plain text. This prevents unintended multipart content when a field happens to contain an image URL as part of a longer string.

Evaluating Images via UI

Step 1: Create a Dataset with Image Columns

Go to Evaluations → New Evaluation → Create Experiment
Click + next to the Datasets header to create a new dataset
Add columns and set their type to image using the column type dropdown

Column type selector showing image option

Paste image URLs or base64 data URIs into the cells, the workbench renders them inline with click-to-expand

Step 2: Add an LLM-as-a-Judge Evaluator

Click + Add evaluator on a row in the evaluators section
Select an LLM-as-a-Judge evaluator (Boolean, Score, or Category)
Choose a vision-capable model (e.g. gpt-5.2, claude-sonnet-4-5-20250929)
Write a prompt that references the image fields, map dataset columns to the evaluator’s input, output, contexts, or expected_output variables

The evaluator automatically detects image values and sends them as multipart content to the vision model. No special configuration needed.

LangWatch experiments workbench showing image evaluation with LLM-as-a-Judge score evaluator

In this example, a virtual try-on pipeline is evaluated with three image columns:

original → mapped to contexts (the person’s photo)
request → mapped to input (the clothing item)
generated → mapped to output (the try-on result)

The LLM-as-a-Judge prompt instructs the model to evaluate all three images and score the quality of the generated output.

Step 3: Run and Iterate

Click the play button to run the evaluator. The model receives all images as vision content and returns structured results (score, pass/fail, or category) with detailed reasoning. Use this workflow to iterate on your evaluator prompt until you have reliable evaluation criteria, then save it for reuse across experiments and CI/CD pipelines.

Custom Workflow Evaluators for Complex Logic

For more advanced evaluation pipelines, you can create a Custom Workflow Evaluator in the Evaluators page. This gives you a visual workflow builder where you can chain multiple LLM nodes, add image variables to prompts, and build multi-step evaluation logic.

LangWatch custom workflow evaluator showing image variables in prompt template

In the workflow builder:

Add image-typed variables to your prompt node inputs
Use {{ "{{variable_name}}" }} syntax to reference images in the prompt template
Map dataset columns to the image variables in the entry node
The workflow handles multipart content assembly automatically

This is useful when you need to split evaluation into multiple steps, use different models for different aspects, or combine vision evaluation with text-based checks.

Evaluating Images via SDK

For programmatic evaluation from notebooks or CI/CD, use the Python or TypeScript SDK with a saved evaluator.

Using a Saved Evaluator

After iterating on your evaluator in the UI, save it and call it from code:

import langwatch

df = langwatch.datasets.get_dataset("my-image-dataset").to_pandas()

experiment = langwatch.experiment.init("image-quality-evaluation")

for index, row in experiment.loop(df.iterrows()):
    # Use your saved image evaluator
    experiment.evaluate(
        "evaluators/image-quality-scorer",  # Your saved evaluator slug
        index=index,
        data={
            "input": row["request_image"],      # Image URL or base64
            "output": row["generated_image"],    # Image URL or base64
            "contexts": [row["original_photo"]], # List of context images
        },
    )

Custom Scoring with Vision Models

You can also call vision models directly and log custom scores:

import langwatch
import litellm

experiment = langwatch.experiment.init("custom-image-evaluation")

for index, row in experiment.loop(df.iterrows()):
    # Call a vision model directly
    response = litellm.completion(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Rate this generated image quality from 1 to 5. Return only the number."},
                {"type": "image_url", "image_url": {"url": row["generated_image"]}},
            ],
        }],
    )

    score = int(response.choices[0].message.content.strip())

    experiment.log(
        "image_quality",
        index=index,
        data={"output": row["generated_image"]},
        score=score / 5.0,
        passed=score >= 3,
        details=f"Image quality score: {score}/5",
    )

Evaluating Document Parsing (PDFs)

Multimodal evaluation also covers document-based pipelines. Here is an example of evaluating a PDF parsing pipeline that extracts metadata from academic papers:

import langwatch
import pandas as pd
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_text
import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# Dataset of PDFs with ground truth metadata
df = pd.DataFrame([
    {
        "file": "paper1.pdf",
        "expected_title": "Vibe Coding vs. Agentic Coding",
        "expected_authors": "Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee",
    },
    # ... more rows
])

@langwatch.trace()
def extract_pdf_info(filename):
    langwatch.get_current_trace().autotrack_dspy()
    elements = partition_pdf(filename=filename)
    pdf = elements_to_text(elements=elements)
    return dspy.Predict(
        "pdf -> title: str, author_names: str, github_link: Optional[str]"
    )(pdf=pdf)

# Run the evaluation
evaluation = langwatch.experiment.init("pdf-parsing-evaluation")

for index, row in evaluation.loop(df.iterrows()):
    response = extract_pdf_info(row["file"])

    evaluation.log(
        "author_names_accuracy",
        index=index,
        passed=response.author_names == row["expected_authors"],
        details=f"Expected: {row['expected_authors']}, Got: {response.author_names}",
    )

Using Evaluators via API

Once you have a reliable image evaluator, you can call it directly via REST API for integration into any pipeline:

curl -X POST "https://app.langwatch.ai/api/evaluations/evaluators/image-quality-scorer/evaluate" \
     -H "X-Auth-Token: $LANGWATCH_API_KEY" \
     -H "Content-Type: application/json" \
     -d @- <<EOF
{
  "data": {
    "input": "https://example.com/clothing-item.jpg",
    "output": "https://example.com/tryon-result.jpg",
    "contexts": ["https://example.com/original-photo.jpg"]
  }
}
EOF

Base64 image payloads can be large. The evaluator API supports request bodies up to 30 MB. If you are working with many high-resolution images, prefer using image URLs over base64 encoding.

Model Compatibility

Image evaluation requires a vision-capable model. Any model supported by litellm with vision capabilities works, including:

Provider	Models
OpenAI	`gpt-4o`, `gpt-4o-mini`, `gpt-5.2`
Anthropic	`claude-sonnet-4-5-20250929`, `claude-opus-4-6`
Google	`gemini-2.0-flash`, `gemini-2.5-pro`

If a non-vision model is selected, the evaluator falls back to sending plain text descriptions. For accurate image evaluation, always select a vision-capable model.

Multimodal Evaluation Images, PDFs, and Vision

Supported Image Formats

Evaluating Images via UI

Step 1: Create a Dataset with Image Columns

Step 2: Add an LLM-as-a-Judge Evaluator

Step 3: Run and Iterate

Custom Workflow Evaluators for Complex Logic

Evaluating Images via SDK

Using a Saved Evaluator

Custom Scoring with Vision Models

Evaluating Document Parsing (PDFs)

Using Evaluators via API

Model Compatibility

Next Steps

Dataset Images

Saved Evaluators

Experiments via SDK

CI/CD Integration

​Supported Image Formats

​Evaluating Images via UI

​Step 1: Create a Dataset with Image Columns

​Step 2: Add an LLM-as-a-Judge Evaluator

​Step 3: Run and Iterate

​Custom Workflow Evaluators for Complex Logic

​Evaluating Images via SDK

​Using a Saved Evaluator

​Custom Scoring with Vision Models

​Evaluating Document Parsing (PDFs)

​Using Evaluators via API

​Model Compatibility

​Next Steps

Dataset Images

Saved Evaluators

Experiments via SDK

CI/CD Integration

Supported Image Formats

Evaluating Images via UI

Step 1: Create a Dataset with Image Columns

Step 2: Add an LLM-as-a-Judge Evaluator

Step 3: Run and Iterate

Custom Workflow Evaluators for Complex Logic

Evaluating Images via SDK

Using a Saved Evaluator

Custom Scoring with Vision Models

Evaluating Document Parsing (PDFs)

Using Evaluators via API

Model Compatibility

Next Steps