Skip to main content
LangWatch makes it easy to run experiments from code. Just add a few lines to start tracking your experiments.

Quickstart

1. Install the SDK

pip install langwatch

2. Set your API Key

import langwatch

langwatch.login()
Be sure to login or create an account on the link that will be displayed, then provide your API key when prompted.

3. Start tracking

import langwatch
import pandas as pd

# Load your dataset
df = pd.read_csv("my_dataset.csv")

# Initialize a new experiment
evaluation = langwatch.experiment.init("my-experiment")

# Wrap your loop with evaluation.loop(), and iterate as usual
for idx, row in evaluation.loop(df.iterrows()):
    # Run your model or pipeline
    response = my_agent(row["question"])

    # Log a metric for this sample
    evaluation.log("sample_metric", index=idx, score=0.95)
That’s it! Your evaluation metrics are now being tracked and visualized in LangWatch.
Evaluation Results Sample

Core Concepts

Evaluation Initialization

The evaluation is started by creating an evaluation session with a descriptive name:
evaluation = langwatch.experiment.init("rag-pipeline-openai-vs-claude")

Iterating over data

Use evaluation.loop() around your iterator so the entries are tracked:
for index, row in evaluation.loop(df.iterrows()):
    # Your existing evaluation code

Metrics logging

Track any metric you want with evaluation.log():
# Numeric scores
evaluation.log("relevance", index=index, score=0.85)

# Boolean pass/fail
evaluation.log("contains_citation", index=index, passed=True)

# Include additional data for debugging
evaluation.log("coherence", index=index, score=0.9,
               data={"output": result["text"], "tokens": result["token_count"]})

Comparing Multiple Targets

When comparing different models, prompts, or configurations, use targets to organize your results. Both SDKs provide a target() / withTarget() context that automatically captures latency and enables context inference.
Use evaluation.target() for automatic latency capture and context inference:
evaluation = langwatch.experiment.init("model-comparison")

for index, row in evaluation.loop(df.iterrows()):
    def compare_models(index, row):
        # Evaluate GPT-5 with automatic latency tracking
        with evaluation.target("gpt5-baseline", {"model": "openai/gpt-5"}):
            response = call_openai("gpt-5", row["question"])
            evaluation.log_response(response)  # Store the model output
            # Target is auto-inferred inside target()!
            evaluation.log("accuracy", index=index,
                          score=calculate_accuracy(response, row["expected"]))

        # Evaluate Claude with automatic latency tracking
        with evaluation.target("claude-experiment", {"model": "anthropic/claude-4-opus"}):
            response = call_anthropic("claude-4-opus", row["question"])
            evaluation.log_response(response)
            evaluation.log("accuracy", index=index,
                          score=calculate_accuracy(response, row["expected"]))

    evaluation.submit(compare_models, index, row)
evaluation.target() automatically captures latency, creates isolated traces per target, and enables context inference so log() calls don’t need explicit target parameters. Use log_response() to store the model’s output.
Alternatively, use the target parameter directly with evaluation.log():
evaluation.log(
    "accuracy",
    index=index,
    score=0.95,
    target="gpt5-baseline",
    metadata={"model": "openai/gpt-5", "temperature": 0.7}
)

Target Registration

The first time you use a target name, it’s automatically registered with the provided metadata:
# Using target() - metadata is set when entering the context
with evaluation.target("gpt5", {"model": "gpt-5", "temp": 0.7}):
    evaluation.log_response("AI response here")  # Store the output
    evaluation.log("latency", index=0, score=150)  # target auto-inferred
    evaluation.log("accuracy", index=0, score=0.95)  # target auto-inferred

# Or using explicit target parameter (without target() context)
evaluation.log("latency", index=0, target="gpt5", metadata={"model": "gpt-5", "temp": 0.7})

# Subsequent calls can omit metadata - it's already registered
evaluation.log("accuracy", index=0, target="gpt5", score=0.95)
evaluation.log("latency", index=1, target="gpt5", score=150)
If you provide different metadata for the same target name, an error will be raised. Use a different target name if you want different configurations.

Metadata for Comparison

Target metadata is used for comparison charts in the LangWatch UI. You can group results by any metadata field:
# Compare different temperatures
for temp in [0.0, 0.5, 0.7, 1.0]:
    for index, row in evaluation.loop(df.iterrows()):
        response = call_llm(row["question"], temperature=temp)
        evaluation.log(
            "quality",
            index=index,
            score=evaluate_quality(response),
            target=f"temp-{temp}",
            metadata={"model": "gpt-5", "temperature": temp}
        )
In the LangWatch UI, you can then visualize how quality varies across temperature values.

Parallel Execution

LLM calls can be slow. Both SDKs support parallel execution to speed up your evaluations.
Use the built-in parallelization by putting the content of the loop in a function and submitting it:
evaluation = langwatch.experiment.init("parallel-eval-example")

for index, row in evaluation.loop(df.iterrows(), threads=4):
    def task(index, row):
        result = agent(row["question"])  # Runs in parallel
        evaluation.log("response_quality", index=index, score=0.92)

    evaluation.submit(task, index, row)
By default, threads=4. Adjust based on your API rate limits and system resources.

Built-in Evaluators

LangWatch provides a comprehensive suite of evaluation metrics out of the box.
Use evaluation.run() to leverage pre-built evaluators:
for index, row in evaluation.loop(df.iterrows()):
    def task(index, row):
        response, contexts = execute_rag_pipeline(row["question"])

        # Use built-in RAGAS faithfulness evaluator
        evaluation.evaluate(
            "ragas/faithfulness",
            index=index,
            data={
                "input": row["question"],
                "output": response,
                "contexts": contexts,
            },
            settings={
                "model": "openai/gpt-5",
                "max_tokens": 2048,
            }
        )

        # Log custom metrics alongside
        evaluation.log("confidence", index=index, score=response.confidence)

    evaluation.submit(task, index, row)
Browse our complete list of available evaluators including metrics for RAG quality, hallucination detection, safety, and more.

Complete Example

import langwatch

# Load dataset from LangWatch
df = langwatch.dataset.get_dataset("your-dataset-id").to_pandas()

# Initialize evaluation
evaluation = langwatch.experiment.init("rag-pipeline-evaluation-v2")

# Run evaluation with parallelization
for index, row in evaluation.loop(df.iterrows(), threads=8):
    def task(index, row):
        # Compare two RAG configurations
        with evaluation.target("rag-v1", {"model": "gpt-5", "retriever": "dense"}):
            response, contexts = execute_rag_pipeline(row["question"], version="v1")
            evaluation.log_response(response.text)  # Store the output

            # Use LangWatch evaluators - target auto-inferred
            evaluation.evaluate(
                "ragas/faithfulness",
                index=index,
                data={"input": row["question"], "output": response, "contexts": contexts},
                settings={"model": "openai/gpt-5", "max_tokens": 2048}
            )

            # Log custom metrics - latency auto-captured by target()
            evaluation.log("response_quality", index=index, score=response.quality)

        with evaluation.target("rag-v2", {"model": "gpt-5", "retriever": "hybrid"}):
            response, contexts = execute_rag_pipeline(row["question"], version="v2")
            evaluation.log_response(response.text)

            evaluation.evaluate(
                "ragas/faithfulness",
                index=index,
                data={"input": row["question"], "output": response, "contexts": contexts},
                settings={"model": "openai/gpt-5", "max_tokens": 2048}
            )

            evaluation.log("response_quality", index=index, score=response.quality)

    evaluation.submit(task, index, row)

Tracing Your Pipeline

To get complete visibility into your LLM pipeline, add tracing to your functions:
@langwatch.trace()
def agent(question):
    # Your RAG pipeline, chain, or agent logic
    context = retrieve_documents(question)
    completion = llm.generate(question, context)
    return {"text": completion.text, "context": context}

for index, row in evaluation.loop(df.iterrows()):
    result = agent(row["question"])
    evaluation.log("accuracy", index=index, score=0.9)
Learn more in our Python Integration Guide.
With tracing enabled, you can click through from any evaluation result to see the complete execution trace, including all LLM calls, prompts, and intermediate steps.

Exporting Results to CSV

After running your evaluations, you can export results to CSV for further analysis in spreadsheet tools like Excel or Google Sheets.

How to Export

Click the Export to CSV button in the top-right corner of the evaluation results page to download a complete CSV file with all your data.

CSV Structure

The exported CSV contains comprehensive data organized by dataset rows and targets. Here’s the complete column structure:

Row Index

ColumnDescription
indexRow number (0-based) for cross-referencing with the UI

Dataset Columns

All columns from your input dataset are included with their original names.

Target Columns (per target)

For each target in your evaluation, the following columns are exported:
Column PatternDescriptionExample
{target}_modelModel used for this targetgpt-4_modelopenai/gpt-4
{target}_prompt_idPrompt configuration ID (for prompt targets)gpt-4_prompt_idprompt-abc123
{target}_prompt_versionPrompt version numbergpt-4_prompt_version2
{target}_{metadata_key}Custom metadata valuesgpt-4_temperature0.7
{target}_outputModel output (or individual output fields)gpt-4_output"The answer is 42"
{target}_costExecution cost in USDgpt-4_cost0.0012
{target}_duration_msExecution time in millisecondsgpt-4_duration_ms1250
{target}_errorError message if execution failedgpt-4_error"Rate limit exceeded"
{target}_trace_idTrace ID for viewing execution detailsgpt-4_trace_idtrace_abc123

Evaluator Columns (per target, per evaluator)

For each evaluator applied to a target:
Column PatternDescriptionExample
{target}_{evaluator}_scoreNumeric score (0-1)gpt-4_faithfulness_score0.95
{target}_{evaluator}_passedBoolean pass/failgpt-4_faithfulness_passedtrue
{target}_{evaluator}_labelClassification labelgpt-4_sentiment_labelpositive
{target}_{evaluator}_detailsAdditional details or explanationgpt-4_faithfulness_details"All claims supported"
{target}_{evaluator}_costCost of running the evaluatorgpt-4_faithfulness_cost0.0005
{target}_{evaluator}_duration_msEvaluator execution timegpt-4_faithfulness_duration_ms850

Example CSV Output

For an evaluation comparing GPT-4 and Claude with a faithfulness evaluator:
index,question,expected,gpt-4_model,gpt-4_output,gpt-4_cost,gpt-4_duration_ms,gpt-4_faithfulness_score,gpt-4_faithfulness_passed,claude_model,claude_output,claude_cost,claude_duration_ms,claude_faithfulness_score,claude_faithfulness_passed
0,What is 2+2?,4,openai/gpt-4,The answer is 4,0.0012,1250,0.95,true,anthropic/claude-3,2+2 equals 4,0.0008,980,0.92,true
1,Capital of France?,Paris,openai/gpt-4,Paris is the capital of France,0.0015,1100,0.98,true,anthropic/claude-3,The capital of France is Paris,0.0010,890,0.97,true

Using the Data

The CSV export enables powerful analysis workflows:
Use spreadsheet filters to compare specific models or configurations:
  • Filter by {target}_model to analyze specific model performance
  • Sort by {target}_{evaluator}_score to find best/worst performing samples
  • Filter by {target}_error to identify failed executions
Calculate aggregate metrics across your evaluation:
  • Sum {target}_cost columns for total evaluation cost per model
  • Average {target}_duration_ms to compare response times
  • Identify outliers with high latency or cost
Analyze performance across different configurations:
  • Pivot tables by temperature, max_tokens, or custom metadata
  • Compare prompt versions side-by-side
  • Track improvements across iterations
Investigate problematic samples:
  • Filter rows where {target}_error is not empty
  • Cross-reference index with the UI for detailed inspection
  • Click through to traces using {target}_trace_id
All column headers are normalized to lowercase with spaces replaced by underscores for consistency and compatibility with data analysis tools.

Running in CI/CD

You can run SDK experiments in your CI/CD pipeline. The print_summary() method outputs a structured summary and exits with code 1 if any evaluations fail:
import langwatch

experiment = langwatch.experiment.init("ci-quality-check")

for idx, row in experiment.loop(dataset.iterrows()):
    response = my_llm(row["input"])
    experiment.evaluate("ragas/faithfulness", index=idx, data={...})

# This will exit with code 1 if any evaluations failed
experiment.print_summary()
See CI/CD Integration for complete examples with GitHub Actions, GitLab CI, and more.

What’s Next?