Skip to main content
Custom scoring lets you send evaluation results from your own code to LangWatch. This is useful when you have proprietary evaluation logic, domain-specific metrics, or want to integrate existing evaluation systems.
When to use Custom Scoring:
  • You have your own evaluation logic (deterministic or ML-based)
  • You’re integrating an existing evaluation system
  • You need domain-specific metrics that aren’t covered by built-in evaluators
  • You want to track any custom metric alongside your traces
See also:

How It Works

With custom scoring, you:
  1. Run your own evaluation logic
  2. Send the results (score, passed, label, details) to LangWatch
  3. View results in traces, analytics, and dashboards
Your Code → Your Evaluation Logic → Score/Pass/Fail → LangWatch

                                              Traces, Analytics, Alerts

Sending Custom Scores

On a Trace/Span

Attach evaluation results to the current trace or span:
import langwatch

@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Run your custom evaluation
    score = my_custom_evaluator(user_input, output)
    is_valid = score > 0.7

    # Send results to LangWatch
    langwatch.get_current_span().add_evaluation(
        name="my_custom_metric",
        passed=is_valid,
        score=score,
        details="Custom evaluation based on domain rules"
    )

    return output

In Experiments

Log custom scores during batch evaluation:
import langwatch

experiment = langwatch.experiment.init("my-experiment")

for index, row in experiment.loop(df.iterrows()):
    output = my_llm(row["input"])

    # Run your custom evaluation
    score = my_custom_evaluator(row["input"], output, row["expected"])

    # Log the custom score
    experiment.log(
        name="my_custom_metric",
        index=index,
        data={"input": row["input"], "output": output},
        score=score,
        passed=score > 0.7,
        details="Custom domain-specific evaluation"
    )

Evaluation Result Fields

FieldTypeRequiredDescription
namestringYesIdentifier for this evaluation (shows in UI)
passedbooleanNoWhether the evaluation passed
scorenumberNoNumeric score (typically 0-1)
labelstringNoCategory label (e.g., “positive”, “negative”)
detailsstringNoHuman-readable explanation
At least one of passed, score, or label should be provided for meaningful results.

Example Use Cases

Code Quality Check

def check_code_quality(generated_code: str) -> dict:
    # Your custom logic
    has_syntax_errors = check_syntax(generated_code)
    follows_style = check_style_guide(generated_code)

    score = 0.0
    if not has_syntax_errors:
        score += 0.5
    if follows_style:
        score += 0.5

    return {
        "passed": score >= 0.5,
        "score": score,
        "details": f"Syntax OK: {not has_syntax_errors}, Style OK: {follows_style}"
    }

# Use in your pipeline
result = check_code_quality(llm_output)
langwatch.get_current_span().add_evaluation(
    name="code_quality",
    **result
)

Semantic Similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(output: str, expected: str) -> float:
    embeddings = model.encode([output, expected])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(similarity)

# Use in experiment
score = semantic_similarity(output, row["expected"])
experiment.log(
    name="semantic_similarity",
    index=index,
    data={"output": output, "expected": row["expected"]},
    score=score,
    passed=score > 0.8
)

Business Rule Validation

def validate_response(response: str, context: dict) -> dict:
    issues = []

    # Check for required elements
    if context.get("require_disclaimer") and "disclaimer" not in response.lower():
        issues.append("Missing required disclaimer")

    # Check length constraints
    if len(response) > context.get("max_length", 1000):
        issues.append("Response too long")

    # Check for prohibited content
    for word in context.get("prohibited_words", []):
        if word.lower() in response.lower():
            issues.append(f"Contains prohibited word: {word}")

    return {
        "passed": len(issues) == 0,
        "score": 1.0 - (len(issues) * 0.2),
        "details": "; ".join(issues) if issues else "All checks passed"
    }

Combining with Built-in Evaluators

You can use custom scoring alongside built-in evaluators:
@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Built-in evaluator
    langwatch.evaluation.evaluate(
        "presidio/pii_detection",
        name="PII Check",
        data={"output": output},
    )

    # Custom evaluation
    business_score = my_business_rules_check(output)
    langwatch.get_current_span().add_evaluation(
        name="business_rules",
        passed=business_score > 0.8,
        score=business_score,
    )

    return output

Viewing Custom Scores

Custom scores appear in:
  • Trace Details - Under the Evaluations section
  • Analytics Dashboard - Filterable by evaluation name
  • Experiments - In the results table alongside other evaluators

Next Steps