LLM evaluation

Evals that live in your codebase and sync within the platform

Define quality metrics, run them offline in CI or live on production traffic, and catch regressions before they ship. Bring your own datasets, tools, and pandas - add a few lines and start tracking.

Book a demo Run Evals in CI/CD

eval.py

import langwatch

evaluation = langwatch.evaluation.init("rag-quality-experiment")

for index, row in evaluation.loop(df.iterrows(), threads=4):
    response = execute_rag_pipeline(row["question"])
    evaluation.run("ragas/faithfulness", index=index, data={...})
    evaluation.log("confidence", index=index, score=response.confidence)

inputscorestatusmscost

How do I cancel?0.94pass420$0.002

Refund for expired…0.61fail680$0.003

Where is my order?0.97pass390$0.002

Add evaluation tracking to your existing workflow.

Keep using pandas and your favourite tools. Initialise an experiment, loop over your dataset, run built-in evaluators, and log your own metrics alongside - in parallel.

parallel · threaded

import langwatch

evaluation = langwatch.evaluation.init("rag-quality-experiment")

for index, row in evaluation.loop(df.iterrows(), threads=4):
    def evaluate(index, row):
        response, contexts = execute_rag_pipeline(row["question"])

        # built-in RAGAS faithfulness evaluator
        evaluation.run(
            "ragas/faithfulness",
            index=index,
            data={"input": row["question"], "output": response, "contexts": contexts},
            settings={"model": "openai/gpt-5", "max_tokens": 2048},
        )

        # log your own metric alongside
        evaluation.log("confidence", index=index, score=response.confidence)

    evaluation.submit(evaluate, index, row)   # runs in parallel

modern-solid-koalaRun #33

Total Cost

Avg Latency

Pass Rate

productbrandtier

Appel-/perensapFlevosapB-brand

SmoothiesInnocentA-brand

VitamineshotsG’ngerA-brand

Open the platform Run Evals in CI/CD

A full suite of evaluators, out of the box.

Browse all evaluators

RAG quality

RAGAS faithfulnessAnswer relevancyContext precisionContext recall

Safety

Hallucination detectionToxicityPII detectionJailbreak / prompt-injection

Quality

Answer correctnessLLM-as-a-judge (your criteria)BLEUEmbedding distance

Ops

Latency (automatic)Cost (automatic)

LLM-as-a-judge scores answers against your own natural-language criteria. Need something bespoke? Bring your own evaluator and it shows up right next to the built-ins.

Open the platform Build a custom evaluator

Custom evals

Custom evaluators, connected to your traces.

Have an in-house metric? Run it in your own code and attach the result to the current trace or span so it shows up next to the built-in evaluators.

Read the custom evaluator docs

custom_evaluator.py

import langwatch

@langwatch.span(type="evaluation")
def evaluation_step():
    # ... your custom evaluation logic ...
    langwatch.get_current_span().add_evaluation(
        name="category_match",        # required
        passed=True,
        score=0.5,
        label="category_detected",
        details="explanation of the result",
    )

Offline, online, and in your CI/CD.

Self-host or run in your own VPC - keep everything local when you need to.

Offline / batch

Run experiments on datasets, compare prompts and models side-by-side, and validate model upgrades before they ship.

gpt-50.92

claude-sonnet0.88

Model A vs Model B

Real-time / online

Run evals continuously on production traffic and alert when quality drops.

CI/CD

Run your eval suite on every PR via the Python & TypeScript SDKs, and gate merges on the results.

Open the platform Read the docs

Developer-first, but not developer-only.

Developers define evals in code; product owners, QA, and domain experts define the quality framework and run evaluations with the zero-code wizard. One shared source of truth.

langwatch · experiments workbench · no code

proud-solid-lynx Run

inputexpected_outputdemo-promptScore 33%infield-agent v1Score 0%

How do I update my billing information?Update it in Settings → Billing, then Edit Payment Method.To update your billing information, follow these steps: 1. Log in to your account…LLM Answer MatchI can only help with Davis Instruments weather stations and field-monitoring tasks.LLM Answer Match

I’m having trouble logging into my accountTry the “Forgot Password” link on the login page.I’m sorry to hear you’re having trouble logging in. Could you share a few details?LLM Answer MatchSearching the knowledge base for “login”, “password reset” for Davis Instruments…LLM Answer Match

What are your business hours?We’re available Mon-Fri, 9 AM to 6 PM in your local timezone.Our business hours are Monday through Friday, 9:00 AM to 6:00 PM.LLM Answer MatchI don’t have business hours - I only help with Davis weather stations.LLM Answer Match

No code required - product owners, QA and domain experts run and compare evals right in the UI, or ask Langy to build one.

Open the platform Read the docs

Start evaluating.

Read the docs Book a demo