Skip to main content
Experiments let you systematically test your LLM applications before deploying to production. Run your prompts, models, or agents against datasets and measure quality with evaluators.

What is an Experiment?

An experiment consists of three components:
  1. Dataset - A collection of test cases with inputs (and optionally expected outputs)
  2. Target - What you’re testing: a prompt, model, API endpoint, or custom code
  3. Evaluators - Scoring functions that assess output quality
When you run an experiment, LangWatch executes your target on each dataset row and scores the results with your selected evaluators.

When to Use Experiments

  • Before deploying - Validate prompt changes don’t regress quality
  • Comparing options - Test different models, prompts, or configurations side-by-side
  • CI/CD gates - Automatically block deployments that fail quality thresholds
  • Benchmarking - Track quality metrics over time across experiment runs

Getting Started

Choose your preferred approach:

Quick Example

import langwatch

# Load your dataset
df = langwatch.dataset.get_dataset("my-dataset").to_pandas()

# Initialize experiment
evaluation = langwatch.experiment.init("prompt-v2-test")

# Run through dataset
for idx, row in evaluation.loop(df.iterrows()):
    # Execute your LLM
    response = my_llm(row["input"])

    # Run evaluators
    evaluation.evaluate(
        "ragas/faithfulness",
        index=idx,
        data={
            "input": row["input"],
            "output": response,
            "contexts": row["contexts"],
        },
    )

Experiment Results

After running an experiment, you can:
  • Compare runs - See how different configurations perform side-by-side
  • Drill into failures - Inspect individual test cases that scored poorly
  • Track trends - Monitor quality metrics across experiment runs over time
  • Export data - Download results for further analysis
Experiment results showing comparison between runs

CI/CD Integration

Run experiments automatically in your deployment pipeline:
# GitHub Actions example
- name: Run quality experiments
  run: |
    python -c "
    import langwatch
    result = langwatch.experiment.run('my-experiment')
    result.print_summary()
    "
  env:
    LANGWATCH_API_KEY: ${{ secrets.LANGWATCH_API_KEY }}
Learn more about CI/CD integration.

Next Steps