Ship reliable, testable agents – not guesses. Better Agents adds simulations, evaluations, and standards on top of any framework. Explore Better Agents
Python
Experiment
import langwatch df = langwatch.datasets.get_dataset("dataset-id").to_pandas() experiment = langwatch.experiment.init("my-experiment") for index, row in experiment.loop(df.iterrows()): # your execution code here experiment.evaluate( "langevals/llm_score", index=index, data={ "input": row["input"], "output": output, "contexts": row["contexts"], }, settings={} )
[ { "status": "processed", "score": 123, "passed": true, "label": "<string>", "details": "<string>", "cost": { "currency": "<string>", "amount": 123 } } ]
Use an LLM as a judge with custom prompt to do a numeric score evaluation of the message.
API key for authentication
Show child attributes
Optional trace ID to associate this evaluation with a trace
Successful evaluation
processed
skipped
error
Numeric score from the evaluation
Whether the evaluation passed
Label assigned by the evaluation
Additional details about the evaluation
Was this page helpful?