Custom scoring lets you send evaluation results from your own code to LangWatch. This is useful when you have proprietary evaluation logic, domain-specific metrics, or want to integrate existing evaluation systems.
When to use Custom Scoring:
- You have your own evaluation logic (deterministic or ML-based)
- You’re integrating an existing evaluation system
- You need domain-specific metrics that aren’t covered by built-in evaluators
- You want to track any custom metric alongside your traces
See also:
How It Works
With custom scoring, you:
- Run your own evaluation logic
- Send the results (score, passed, label, details) to LangWatch
- View results in traces, analytics, and dashboards
Your Code → Your Evaluation Logic → Score/Pass/Fail → LangWatch
↓
Traces, Analytics, Alerts
Sending Custom Scores
On a Trace/Span
Attach evaluation results to the current trace or span:
Python
TypeScript
REST API
import langwatch
@langwatch.span()
def my_llm_step(user_input: str):
output = my_llm(user_input)
# Run your custom evaluation
score = my_custom_evaluator(user_input, output)
is_valid = score > 0.7
# Send results to LangWatch
langwatch.get_current_span().add_evaluation(
name="my_custom_metric",
passed=is_valid,
score=score,
details="Custom evaluation based on domain rules"
)
return output
import { LangWatch } from "langwatch";
const langwatch = new LangWatch();
async function myLLMStep(userInput: string): Promise<string> {
return await langwatch.trace({ name: "my-trace" }, async (span) => {
const output = await myLLM(userInput);
// Run your custom evaluation
const score = myCustomEvaluator(userInput, output);
const isValid = score > 0.7;
// Send results to LangWatch
span.addEvaluation({
name: "my_custom_metric",
passed: isValid,
score: score,
details: "Custom evaluation based on domain rules"
});
return output;
});
}
Send evaluation results directly via the collector API:curl -X POST "https://app.langwatch.ai/api/collector" \
-H "X-Auth-Token: $LANGWATCH_API_KEY" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"trace_id": "your-trace-id",
"evaluations": [{
"name": "my_custom_metric",
"passed": true,
"score": 0.85,
"details": "Custom evaluation result"
}]
}
EOF
In Experiments
Log custom scores during batch evaluation:
import langwatch
experiment = langwatch.experiment.init("my-experiment")
for index, row in experiment.loop(df.iterrows()):
output = my_llm(row["input"])
# Run your custom evaluation
score = my_custom_evaluator(row["input"], output, row["expected"])
# Log the custom score
experiment.log(
name="my_custom_metric",
index=index,
data={"input": row["input"], "output": output},
score=score,
passed=score > 0.7,
details="Custom domain-specific evaluation"
)
Evaluation Result Fields
| Field | Type | Required | Description |
|---|
name | string | Yes | Identifier for this evaluation (shows in UI) |
passed | boolean | No | Whether the evaluation passed |
score | number | No | Numeric score (typically 0-1) |
label | string | No | Category label (e.g., “positive”, “negative”) |
details | string | No | Human-readable explanation |
At least one of passed, score, or label should be provided for meaningful results.
Example Use Cases
Code Quality Check
def check_code_quality(generated_code: str) -> dict:
# Your custom logic
has_syntax_errors = check_syntax(generated_code)
follows_style = check_style_guide(generated_code)
score = 0.0
if not has_syntax_errors:
score += 0.5
if follows_style:
score += 0.5
return {
"passed": score >= 0.5,
"score": score,
"details": f"Syntax OK: {not has_syntax_errors}, Style OK: {follows_style}"
}
# Use in your pipeline
result = check_code_quality(llm_output)
langwatch.get_current_span().add_evaluation(
name="code_quality",
**result
)
Semantic Similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(output: str, expected: str) -> float:
embeddings = model.encode([output, expected])
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
return float(similarity)
# Use in experiment
score = semantic_similarity(output, row["expected"])
experiment.log(
name="semantic_similarity",
index=index,
data={"output": output, "expected": row["expected"]},
score=score,
passed=score > 0.8
)
Business Rule Validation
def validate_response(response: str, context: dict) -> dict:
issues = []
# Check for required elements
if context.get("require_disclaimer") and "disclaimer" not in response.lower():
issues.append("Missing required disclaimer")
# Check length constraints
if len(response) > context.get("max_length", 1000):
issues.append("Response too long")
# Check for prohibited content
for word in context.get("prohibited_words", []):
if word.lower() in response.lower():
issues.append(f"Contains prohibited word: {word}")
return {
"passed": len(issues) == 0,
"score": 1.0 - (len(issues) * 0.2),
"details": "; ".join(issues) if issues else "All checks passed"
}
Combining with Built-in Evaluators
You can use custom scoring alongside built-in evaluators:
@langwatch.span()
def my_llm_step(user_input: str):
output = my_llm(user_input)
# Built-in evaluator
langwatch.evaluation.evaluate(
"presidio/pii_detection",
name="PII Check",
data={"output": output},
)
# Custom evaluation
business_score = my_business_rules_check(output)
langwatch.get_current_span().add_evaluation(
name="business_rules",
passed=business_score > 0.8,
score=business_score,
)
return output
Viewing Custom Scores
Custom scores appear in:
- Trace Details - Under the Evaluations section
- Analytics Dashboard - Filterable by evaluation name
- Experiments - In the results table alongside other evaluators
Next Steps