How to evaluate an LLM when you don't have defined answers
Measure LLM performance using LLM-as-a-Judge when no ground-truth answers exist to support scalable AI agent evaluations.
For some AI applications, it’s not really possible to define a golden answer, this happens for example in creative tasks, where it’s hard to define a single correct answer.On the video below, we show how to use the LangWatch Experiments via UI to evaluate a Business Coaching Agent, where we don’t have defined answers, but we can use an LLM-as-a-judge to evaluate the quality of the answers: