Learning Library

← Back to Library

LLMs as Judges: Evaluating AI Outputs

Key Points

  • LLMs can be used as judges to evaluate AI‑generated text, offering a scalable alternative to slow manual labeling.
  • There are two main reference‑free judging methods: direct assessment (using a predefined rubric) and pairwise comparison (asking which of two outputs is better), each suited to different tasks.
  • User research on the open‑source EvalAssist framework found roughly half of participants prefer direct assessment for clarity and control, a quarter favor pairwise comparison for subjective tasks, and the rest combine both approaches.
  • Benefits of LLM judges include high throughput for large output sets and flexibility to adapt rubrics as evaluation criteria evolve, though the optimal strategy depends on the specific task and user needs.

Full Transcript

# LLMs as Judges: Evaluating AI Outputs **Source:** [https://www.youtube.com/watch?v=trfUBIDeI1Y](https://www.youtube.com/watch?v=trfUBIDeI1Y) **Duration:** 00:05:56 ## Summary - LLMs can be used as judges to evaluate AI‑generated text, offering a scalable alternative to slow manual labeling. - There are two main reference‑free judging methods: direct assessment (using a predefined rubric) and pairwise comparison (asking which of two outputs is better), each suited to different tasks. - User research on the open‑source EvalAssist framework found roughly half of participants prefer direct assessment for clarity and control, a quarter favor pairwise comparison for subjective tasks, and the rest combine both approaches. - Benefits of LLM judges include high throughput for large output sets and flexibility to adapt rubrics as evaluation criteria evolve, though the optimal strategy depends on the specific task and user needs. ## Sections - [00:00:00](https://www.youtube.com/watch?v=trfUBIDeI1Y&t=0s) **LLMs as Evaluators of AI Text** - The segment explains how large language models can serve as scalable judges for AI‑generated outputs, detailing direct rubric‑based assessment and pairwise comparison methods along with their benefits and drawbacks. - [00:03:04](https://www.youtube.com/watch?v=trfUBIDeI1Y&t=184s) **LLM as Judge: Benefits and Biases** - The passage explains how using large language models to evaluate outputs offers flexible, reference‑free assessment and rubric refinement, but also warns of inherent biases such as positional and verbosity bias that can skew judgments. ## Full Transcript
0:00How can you evaluate all of the texts that AI spits out? 0:03Traditional metrics might not cut it for your task, and manual labeling 0:07takes a really long time. 0:09Enter LLM as a judge or LLMs judging other LLM outputs. 0:14If you've ever manually tried labeling hundreds of outputs, 0:19whether it be chatbot replies or summaries, you know that it's a lot of work. 0:24Now imagine an AI that can scale, adapt 0:26and explain its judgments. 0:29In this video, we're going to look at how LLMs evaluate outputs. 0:34The video's gonna be split into three parts: LLM-as-a-judge 0:36strategies, some 0:38benefits of using LLM as a judge and some drawbacks. 0:42When it comes to reference-free evaluation, there 0:44are two main ways to leverage LLM as a judge. 0:47First, we have direct assessment, 0:51in which you design a rubric. 0:56And we also have pairwise comparison, 0:59in which you ask the model: which 1:01option is better, 1:04A or B? 1:04Let's start with direct assessment. 1:07Suppose you're evaluating a bunch of outputs, summaries 1:10for coherence and clarity. 1:12If you're using direct assessment, 1:14this hinges on designing a rubric. 1:17So you might design a rubric where you ask: 1:19is this summary clear and coherent? With two different options. 1:23Yes, the summary is clear. No, 1:25the summary is not clear. 1:27Each of your outputs will be evaluated 1:29based on the rubric that you've designed. 1:32Now let's talk about pairwise comparison. 1:35In pairwise comparison, your 1:36focus is on comparing two different outputs 1:39instead of assigning a standalone label like in direct assessment. 1:42So in the clarity case, or if your focus is on clarity, 1:45you're asking the model: which of these outputs is better? 1:49Option A or option B? In the case where there's multiple outputs, you 1:53can then use a ranking algorithm 1:55to create a ranking of the overall comparisons. 1:58Which of these strategies is better 2:00for the task you're trying to accomplish? Well, 2:02our user research on the newly open-sourced framework 2:06EvalAssist showed that about half of the participants 2:08prefer direct assessment for their ability to be clear 2:11and have control over their rubric. 2:13About a quarter preferred pairwise comparison, 2:16especially for subjective tasks. 2:19And the remainder of the participants preferred a combined approach 2:22using direct assessment for compliance, 2:24and then leveraging the ranking algorithm 2:26that comes with the pairwise comparison to select the best output. 2:30Ultimately, the choice was both task- and user- dependent. Now, 2:35for some reasons why you might want to use LLM as a judge. 2:38First it scales. 2:40If you're generating hundreds 2:43or even thousands of outputs with a variety of models and prompts, 2:47you probably don't want to evaluate them all by hand. LLM 2:50as a judge can handle that volume 2:53and give you feedback and evaluations 2:55in a structured way in a quick manner. 2:57Second, LLM as a judge is also really flexible. 3:01Traditional modes of evaluation are really rigid. 3:04Rigid. 3:05So let's say you build a rubric, 3:08and you start evaluating a bunch of your outputs. 3:11As you see more data, 3:13it is really normal for your criteria to start shifting, 3:16and you might want to make changes to your rubric. LLM 3:18as a judge helps with the criteria-refinement process. 3:21You can refine your prompts 3:23and be really flexible in your evaluations. 3:27And lastly, there's nuance. 3:29Traditional metrics like blue and rouge focus on word overlap, 3:35which is nice if you have a reference. 3:37But what if you don't have a reference? 3:39What if you want to ask a question like, is my output natural? 3:42Does it sound human? LLM 3:44as a judge lets you do these evaluations 3:47on more subjective outputs without a reference. 3:50But of course, there are drawbacks to using LLM as a judge. 3:54Just like humans, LLM have their blind spots 3:58and these are represented through different types of biases. 4:02For example, there's positional bias. 4:04And this means that an LLM will continue to favor an output, 4:09even if the content is not necessarily better. So, 4:12let's say, in the pairwise comparison case, 4:15you're asking the model: which is better option A or option B? 4:18And it continuously favors option 4:20A regardless of what is represented by option A. 4:24This means that it is expressing positional bias. 4:27There's also verbosity bias, 4:29and this happens when an evaluator 4:33continuously favors output that is longer 4:36regardless of its output. 4:38Again, the longer output can be repetitive 4:42or go off track, but the model will continuously favor it 4:45because it sees length as quality. 4:48This is verbosity bias. 4:50There's also the case where a model might favor an output 4:54because it recognizes that it created the output. 4:57This is called self-enhancement bias. 4:59So let's say you have a bunch of different outputs from different models. 5:02And a model continuously favors an output that it created itself, 5:06and the content is not necessarily better. 5:08This is self-enhancement bias. 5:10And so these sort of biases can skew skew your results. 5:14For example, a model can favor an output because it's longer 5:18or because it's in a particular position. 5:20But it's not necessarily better. 5:22But good frameworks are built to sort of 5:25catch these mistakes. 5:27For example, you can run positional swaps 5:29and see if the judgment changes. 5:32For example, changing one thing from position A to position B 5:35and seeing if the model's output selection for which, which 5:39is the best output changes. 5:41Bias in LLMs doesn't mean that the system is completely broken. It 5:44just means that you need to stay vigilant. 5:47So if you're tired of manually evaluating output, LLM as 5:51a judge might be a good option for scalable, transparent and nuanced evaluation.