Ship reliable, testable agents – not guesses. Better Agents adds simulations, evaluations, and standards on top of any framework. Explore Better Agents
Python
Experiment
import langwatchdf = langwatch.datasets.get_dataset("dataset-id").to_pandas()experiment = langwatch.experiment.init("my-experiment")for index, row in experiment.loop(df.iterrows()): # your execution code here experiment.evaluate( "ragas/response_context_precision", index=index, data={ "input": row["input"], "contexts": row["contexts"], "output": output, "expected_output": row["expected_output"], }, settings={} )
[ { "status": "processed", "score": 123, "passed": true, "label": "<string>", "details": "<string>", "cost": { "currency": "<string>", "amount": 123 } } ]
Uses an LLM to measure the proportion of chunks in the retrieved context that were relevant to generate the output or the expected output.
API key for authentication
The input text to evaluate
Array of context strings used for RAG evaluation
The output/response text to evaluate
The expected output for comparison
Show child attributes
Successful evaluation
processed
skipped
error
Numeric score from the evaluation
Whether the evaluation passed
Label assigned by the evaluation
Additional details about the evaluation
Was this page helpful?