Custom Scoring

Custom scoring lets you send evaluation results from your own code to LangWatch. This is useful when you have proprietary evaluation logic, domain-specific metrics, or want to integrate existing evaluation systems.

When to use Custom Scoring:

You have your own evaluation logic (deterministic or ML-based)
You’re integrating an existing evaluation system
You need domain-specific metrics that aren’t covered by built-in evaluators
You want to track any custom metric alongside your traces

See also:

Built-in Evaluators - Use LangWatch’s ready-made evaluators
Saved Evaluators - Reuse configured evaluators across your project

How It Works

With custom scoring, you:

Run your own evaluation logic
Send the results (score, passed, label, details) to LangWatch
View results in traces, analytics, and dashboards

Your Code → Your Evaluation Logic → Score/Pass/Fail → LangWatch
                                                          ↓
                                              Traces, Analytics, Alerts

Sending Custom Scores

On a Trace/Span

Attach evaluation results to the current trace or span:

Python
TypeScript
REST API

import langwatch

@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Run your custom evaluation
    score = my_custom_evaluator(user_input, output)
    is_valid = score > 0.7

    # Send results to LangWatch
    langwatch.get_current_span().add_evaluation(
        name="my_custom_metric",
        passed=is_valid,
        score=score,
        details="Custom evaluation based on domain rules"
    )

    return output

import { LangWatch } from "langwatch";

const langwatch = new LangWatch();

async function myLLMStep(userInput: string): Promise<string> {
  return await langwatch.trace({ name: "my-trace" }, async (span) => {
    const output = await myLLM(userInput);

    // Run your custom evaluation
    const score = myCustomEvaluator(userInput, output);
    const isValid = score > 0.7;

    // Send results to LangWatch
    span.addEvaluation({
      name: "my_custom_metric",
      passed: isValid,
      score: score,
      details: "Custom evaluation based on domain rules"
    });

    return output;
  });
}

Send evaluation results directly via the collector API:

curl -X POST "https://app.langwatch.ai/api/collector" \
     -H "X-Auth-Token: $LANGWATCH_API_KEY" \
     -H "Content-Type: application/json" \
     -d @- <<EOF
{
  "trace_id": "your-trace-id",
  "evaluations": [{
    "name": "my_custom_metric",
    "passed": true,
    "score": 0.85,
    "details": "Custom evaluation result"
  }]
}
EOF

In Experiments

Log custom scores during batch evaluation:

import langwatch

experiment = langwatch.experiment.init("my-experiment")

for index, row in experiment.loop(df.iterrows()):
    output = my_llm(row["input"])

    # Run your custom evaluation
    score = my_custom_evaluator(row["input"], output, row["expected"])

    # Log the custom score
    experiment.log(
        name="my_custom_metric",
        index=index,
        data={"input": row["input"], "output": output},
        score=score,
        passed=score > 0.7,
        details="Custom domain-specific evaluation"
    )

Evaluation Result Fields

Field	Type	Required	Description
`name`	string	Yes	Identifier for this evaluation (shows in UI)
`passed`	boolean	No	Whether the evaluation passed
`score`	number	No	Numeric score (typically 0-1)
`label`	string	No	Category label (e.g., “positive”, “negative”)
`details`	string	No	Human-readable explanation

At least one of passed, score, or label should be provided for meaningful results.

Example Use Cases

Code Quality Check

def check_code_quality(generated_code: str) -> dict:
    # Your custom logic
    has_syntax_errors = check_syntax(generated_code)
    follows_style = check_style_guide(generated_code)

    score = 0.0
    if not has_syntax_errors:
        score += 0.5
    if follows_style:
        score += 0.5

    return {
        "passed": score >= 0.5,
        "score": score,
        "details": f"Syntax OK: {not has_syntax_errors}, Style OK: {follows_style}"
    }

# Use in your pipeline
result = check_code_quality(llm_output)
langwatch.get_current_span().add_evaluation(
    name="code_quality",
    **result
)

Semantic Similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(output: str, expected: str) -> float:
    embeddings = model.encode([output, expected])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(similarity)

# Use in experiment
score = semantic_similarity(output, row["expected"])
experiment.log(
    name="semantic_similarity",
    index=index,
    data={"output": output, "expected": row["expected"]},
    score=score,
    passed=score > 0.8
)

Business Rule Validation

def validate_response(response: str, context: dict) -> dict:
    issues = []

    # Check for required elements
    if context.get("require_disclaimer") and "disclaimer" not in response.lower():
        issues.append("Missing required disclaimer")

    # Check length constraints
    if len(response) > context.get("max_length", 1000):
        issues.append("Response too long")

    # Check for prohibited content
    for word in context.get("prohibited_words", []):
        if word.lower() in response.lower():
            issues.append(f"Contains prohibited word: {word}")

    return {
        "passed": len(issues) == 0,
        "score": 1.0 - (len(issues) * 0.2),
        "details": "; ".join(issues) if issues else "All checks passed"
    }

Combining with Built-in Evaluators

You can use custom scoring alongside built-in evaluators:

@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Built-in evaluator
    langwatch.evaluation.evaluate(
        "presidio/pii_detection",
        name="PII Check",
        data={"output": output},
    )

    # Custom evaluation
    business_score = my_business_rules_check(output)
    langwatch.get_current_span().add_evaluation(
        name="business_rules",
        passed=business_score > 0.8,
        score=business_score,
    )

    return output

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

How It Works

Sending Custom Scores

On a Trace/Span

In Experiments

Evaluation Result Fields

Example Use Cases

Code Quality Check

Semantic Similarity

Business Rule Validation

Combining with Built-in Evaluators

Viewing Custom Scores

Next Steps

Built-in Evaluators

Saved Evaluators

Experiments

Evaluations Overview

Get Started

Agent Simulations

Observability

Evaluations

Prompt Management

Platform

Examples & Cookbooks

​How It Works

​Sending Custom Scores

​On a Trace/Span

​In Experiments

​Evaluation Result Fields

​Example Use Cases

​Code Quality Check

​Semantic Similarity

​Business Rule Validation

​Combining with Built-in Evaluators

​Viewing Custom Scores

​Next Steps

Built-in Evaluators

Saved Evaluators

Experiments

Evaluations Overview

How It Works

Sending Custom Scores

On a Trace/Span

In Experiments

Evaluation Result Fields

Example Use Cases

Code Quality Check

Semantic Similarity

Business Rule Validation

Combining with Built-in Evaluators

Viewing Custom Scores

Next Steps