Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Custom scoring lets you send evaluation results from your own code to LangWatch. This is useful when you have proprietary evaluation logic, domain-specific metrics, or want to integrate existing evaluation systems.
When to use Custom Scoring:
- You have your own evaluation logic (deterministic or ML-based)
- You’re integrating an existing evaluation system
- You need domain-specific metrics that aren’t covered by built-in evaluators
- You want to track any custom metric alongside your traces
See also:
How It Works
With custom scoring, you:
- Run your own evaluation logic
- Send the results (score, passed, label, details) to LangWatch
- View results in traces, analytics, and dashboards
Your Code → Your Evaluation Logic → Score/Pass/Fail → LangWatch
↓
Traces, Analytics, Alerts
Sending Custom Scores
On a Trace/Span
Attach evaluation results to the current trace or span:
Python
TypeScript
REST API
import langwatch
@langwatch.span()
def my_llm_step(user_input: str):
output = my_llm(user_input)
# Run your custom evaluation
score = my_custom_evaluator(user_input, output)
is_valid = score > 0.7
# Send results to LangWatch
langwatch.get_current_span().add_evaluation(
name="my_custom_metric",
passed=is_valid,
score=score,
details="Custom evaluation based on domain rules"
)
return output
import { LangWatch } from "langwatch";
const langwatch = new LangWatch();
async function myLLMStep(userInput: string): Promise<string> {
return await langwatch.trace({ name: "my-trace" }, async (span) => {
const output = await myLLM(userInput);
// Run your custom evaluation
const score = myCustomEvaluator(userInput, output);
const isValid = score > 0.7;
// Send results to LangWatch
span.addEvaluation({
name: "my_custom_metric",
passed: isValid,
score: score,
details: "Custom evaluation based on domain rules"
});
return output;
});
}
Send evaluation results directly via the collector API:curl -X POST "https://app.langwatch.ai/api/collector" \
-H "X-Auth-Token: $LANGWATCH_API_KEY" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"trace_id": "your-trace-id",
"evaluations": [{
"name": "my_custom_metric",
"passed": true,
"score": 0.85,
"details": "Custom evaluation result"
}]
}
EOF
In Experiments
Log custom scores during batch evaluation:
import langwatch
experiment = langwatch.experiment.init("my-experiment")
for index, row in experiment.loop(df.iterrows()):
output = my_llm(row["input"])
# Run your custom evaluation
score = my_custom_evaluator(row["input"], output, row["expected"])
# Log the custom score
experiment.log(
name="my_custom_metric",
index=index,
data={"input": row["input"], "output": output},
score=score,
passed=score > 0.7,
details="Custom domain-specific evaluation"
)
Evaluation Result Fields
| Field | Type | Required | Description |
|---|
name | string | Yes | Identifier for this evaluation (shows in UI) |
passed | boolean | No | Whether the evaluation passed |
score | number | No | Numeric score (typically 0-1) |
label | string | No | Category label (e.g., “positive”, “negative”) |
details | string | No | Human-readable explanation |
At least one of passed, score, or label should be provided for meaningful results.
Example Use Cases
Code Quality Check
def check_code_quality(generated_code: str) -> dict:
# Your custom logic
has_syntax_errors = check_syntax(generated_code)
follows_style = check_style_guide(generated_code)
score = 0.0
if not has_syntax_errors:
score += 0.5
if follows_style:
score += 0.5
return {
"passed": score >= 0.5,
"score": score,
"details": f"Syntax OK: {not has_syntax_errors}, Style OK: {follows_style}"
}
# Use in your pipeline
result = check_code_quality(llm_output)
langwatch.get_current_span().add_evaluation(
name="code_quality",
**result
)
Semantic Similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(output: str, expected: str) -> float:
embeddings = model.encode([output, expected])
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
return float(similarity)
# Use in experiment
score = semantic_similarity(output, row["expected"])
experiment.log(
name="semantic_similarity",
index=index,
data={"output": output, "expected": row["expected"]},
score=score,
passed=score > 0.8
)
Business Rule Validation
def validate_response(response: str, context: dict) -> dict:
issues = []
# Check for required elements
if context.get("require_disclaimer") and "disclaimer" not in response.lower():
issues.append("Missing required disclaimer")
# Check length constraints
if len(response) > context.get("max_length", 1000):
issues.append("Response too long")
# Check for prohibited content
for word in context.get("prohibited_words", []):
if word.lower() in response.lower():
issues.append(f"Contains prohibited word: {word}")
return {
"passed": len(issues) == 0,
"score": 1.0 - (len(issues) * 0.2),
"details": "; ".join(issues) if issues else "All checks passed"
}
Combining with Built-in Evaluators
You can use custom scoring alongside built-in evaluators:
@langwatch.span()
def my_llm_step(user_input: str):
output = my_llm(user_input)
# Built-in evaluator
langwatch.evaluation.evaluate(
"presidio/pii_detection",
name="PII Check",
data={"output": output},
)
# Custom evaluation
business_score = my_business_rules_check(output)
langwatch.get_current_span().add_evaluation(
name="business_rules",
passed=business_score > 0.8,
score=business_score,
)
return output
Viewing Custom Scores
Custom scores appear in:
- Trace Details - Under the Evaluations section
- Analytics Dashboard - Filterable by evaluation name
- Experiments - In the results table alongside other evaluators
Next Steps