Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Custom scoring lets you send evaluation results from your own code to LangWatch. This is useful when you have proprietary evaluation logic, domain-specific metrics, or want to integrate existing evaluation systems.
When to use Custom Scoring:
  • You have your own evaluation logic (deterministic or ML-based)
  • You’re integrating an existing evaluation system
  • You need domain-specific metrics that aren’t covered by built-in evaluators
  • You want to track any custom metric alongside your traces
See also:

How It Works

With custom scoring, you:
  1. Run your own evaluation logic
  2. Send the results (score, passed, label, details) to LangWatch
  3. View results in traces, analytics, and dashboards
Your Code → Your Evaluation Logic → Score/Pass/Fail → LangWatch

                                              Traces, Analytics, Alerts

Sending Custom Scores

On a Trace/Span

Attach evaluation results to the current trace or span:
import langwatch

@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Run your custom evaluation
    score = my_custom_evaluator(user_input, output)
    is_valid = score > 0.7

    # Send results to LangWatch
    langwatch.get_current_span().add_evaluation(
        name="my_custom_metric",
        passed=is_valid,
        score=score,
        details="Custom evaluation based on domain rules"
    )

    return output

In Experiments

Log custom scores during batch evaluation:
import langwatch

experiment = langwatch.experiment.init("my-experiment")

for index, row in experiment.loop(df.iterrows()):
    output = my_llm(row["input"])

    # Run your custom evaluation
    score = my_custom_evaluator(row["input"], output, row["expected"])

    # Log the custom score
    experiment.log(
        name="my_custom_metric",
        index=index,
        data={"input": row["input"], "output": output},
        score=score,
        passed=score > 0.7,
        details="Custom domain-specific evaluation"
    )

Evaluation Result Fields

FieldTypeRequiredDescription
namestringYesIdentifier for this evaluation (shows in UI)
passedbooleanNoWhether the evaluation passed
scorenumberNoNumeric score (typically 0-1)
labelstringNoCategory label (e.g., “positive”, “negative”)
detailsstringNoHuman-readable explanation
At least one of passed, score, or label should be provided for meaningful results.

Example Use Cases

Code Quality Check

def check_code_quality(generated_code: str) -> dict:
    # Your custom logic
    has_syntax_errors = check_syntax(generated_code)
    follows_style = check_style_guide(generated_code)

    score = 0.0
    if not has_syntax_errors:
        score += 0.5
    if follows_style:
        score += 0.5

    return {
        "passed": score >= 0.5,
        "score": score,
        "details": f"Syntax OK: {not has_syntax_errors}, Style OK: {follows_style}"
    }

# Use in your pipeline
result = check_code_quality(llm_output)
langwatch.get_current_span().add_evaluation(
    name="code_quality",
    **result
)

Semantic Similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(output: str, expected: str) -> float:
    embeddings = model.encode([output, expected])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return float(similarity)

# Use in experiment
score = semantic_similarity(output, row["expected"])
experiment.log(
    name="semantic_similarity",
    index=index,
    data={"output": output, "expected": row["expected"]},
    score=score,
    passed=score > 0.8
)

Business Rule Validation

def validate_response(response: str, context: dict) -> dict:
    issues = []

    # Check for required elements
    if context.get("require_disclaimer") and "disclaimer" not in response.lower():
        issues.append("Missing required disclaimer")

    # Check length constraints
    if len(response) > context.get("max_length", 1000):
        issues.append("Response too long")

    # Check for prohibited content
    for word in context.get("prohibited_words", []):
        if word.lower() in response.lower():
            issues.append(f"Contains prohibited word: {word}")

    return {
        "passed": len(issues) == 0,
        "score": 1.0 - (len(issues) * 0.2),
        "details": "; ".join(issues) if issues else "All checks passed"
    }

Combining with Built-in Evaluators

You can use custom scoring alongside built-in evaluators:
@langwatch.span()
def my_llm_step(user_input: str):
    output = my_llm(user_input)

    # Built-in evaluator
    langwatch.evaluation.evaluate(
        "presidio/pii_detection",
        name="PII Check",
        data={"output": output},
    )

    # Custom evaluation
    business_score = my_business_rules_check(output)
    langwatch.get_current_span().add_evaluation(
        name="business_rules",
        passed=business_score > 0.8,
        score=business_score,
    )

    return output

Viewing Custom Scores

Custom scores appear in:
  • Trace Details - Under the Evaluations section
  • Analytics Dashboard - Filterable by evaluation name
  • Experiments - In the results table alongside other evaluators

Next Steps

Built-in Evaluators

Saved Evaluators

Experiments

Evaluations Overview