Learning Library

← Back to Library

RLHF: Aligning AI with Human Values

Key Points

  • RLHF (Reinforcement Learning from Human Feedback) is used to align large language models with human values, preventing harmful or undesired outputs such as advice on revenge.
  • Reinforcement learning (the “RL” in RLHF) models learning via trial‑and‑error and consists of a state space (task information), an action space (possible decisions), a reward function (measure of success), and a policy (strategy mapping states to actions).
  • Designing an effective reward function is especially challenging for tasks with vague notions of success, often requiring additional constraints or penalties to steer the model away from counterproductive behavior.
  • While conventional RL has yielded impressive results in many domains, its application to language models through RLHF can both improve safety and alignment and introduce new complexities that must be carefully managed.

Full Transcript

# RLHF: Aligning AI with Human Values **Source:** [https://www.youtube.com/watch?v=T_X4XFwKX8k](https://www.youtube.com/watch?v=T_X4XFwKX8k) **Duration:** 00:11:16 ## Summary - RLHF (Reinforcement Learning from Human Feedback) is used to align large language models with human values, preventing harmful or undesired outputs such as advice on revenge. - Reinforcement learning (the “RL” in RLHF) models learning via trial‑and‑error and consists of a state space (task information), an action space (possible decisions), a reward function (measure of success), and a policy (strategy mapping states to actions). - Designing an effective reward function is especially challenging for tasks with vague notions of success, often requiring additional constraints or penalties to steer the model away from counterproductive behavior. - While conventional RL has yielded impressive results in many domains, its application to language models through RLHF can both improve safety and alignment and introduce new complexities that must be carefully managed. ## Sections - [00:00:00](https://www.youtube.com/watch?v=T_X4XFwKX8k&t=0s) **Understanding Reinforcement Learning from Human Feedback** - The passage explains how RLHF uses reinforcement learning to align large language model outputs with human values, describing its purpose, core components, and practical impact. - [00:03:12](https://www.youtube.com/watch?v=T_X4XFwKX8k&t=192s) **Understanding RLHF Policy Optimization** - The passage explains how a policy drives AI behavior, why conventional RL struggles with complex reward design, and outlines the four‑phase RLHF process used to fine‑tune large language models with human feedback. - [00:06:19](https://www.youtube.com/watch?v=T_X4XFwKX8k&t=379s) **Human Feedback for Reward Model Training** - The segment explains how direct human evaluations—typically using pairwise comparisons, thumbs‑up/down, or categorical ratings—generate training data that enables a reward model to predict user preference scores without continuous human involvement. - [00:09:25](https://www.youtube.com/watch?v=T_X4XFwKX8k&t=565s) **Limits of Human Feedback in AI** - The passage highlights the difficulty of defining high‑quality model behavior, the threats of adversarial or biased human input in RLHF, and introduces RLAIF as a potential but still emerging alternative. ## Full Transcript
0:00It's a mouthful, but you've almost certainly seen 0:04the impact of reinforcement learning from human feedback. 0:08That's abbreviated to RLHF, 0:12and you've seen it whenever you interact with a large language model. 0:17RLHF is a technique used to enhance the performance 0:20and alignment of AI systems with human preferences and values. 0:25You see, LLMs are trained and they learn all sorts of stuff, 0:30and we need to be careful how some of that stuff surfaces to the user. 0:34So for example, if I ask an LLM, 0:36how can I get revenge on somebody who's wronged me? 0:41But without the benefit of RLHF, 0:43we might get a response that says something like 0:46spread rumors about them to their friends, 0:49but it's much more likely an LLM will respond with something like this. 0:54Now, this is a bit more of a boring standard LLM response, 0:58but it is better aligned to human values. 1:01That's the impact of RLHF. 1:04So, let's get into what RLHF is, how it works, 1:09and where it can be helpful or a hindrance. 1:12And we'll start by defining the "RL" in RLHF, 1:17which is Reinforcement Learning. 1:19Now conceptually, reinforcement learning 1:22aims to emulate the way that human beings learn. 1:25AI agents learn holistically through trial and error, 1:29motivated by strong incentives to succeed. 1:32It's actually a mathematical framework which consists of a few components. 1:36So let's take a look at some of those. 1:39So first of all we have a component called the "state space" 1:45which is all available information about the task at hand 1:50that is relevant to decisions the AI agent might make. 1:53The state space usually changes with each decision the agent makes. 1:59Another component is 2:02the action space. 2:06The action space contains 2:08all of the decisions the agent might make. 2:11Now, in the context of, let's say, a board game, 2:14the action space is discrete and well-defined. 2:17It's all the legal moves available to the AI player at a given moment. 2:22For text generation, well, the action space is massive. 2:26The entire vocabulary of all of the tokens available to a large language model. 2:31Another component is the reward function, 2:37and this one really is key to reinforcement learning. 2:41It's the measure of success or progress 2:44that incentivizes the AI agent. 2:46So for the board game it's to win the game. 2:49Easy enough. 2:50But when the definition of success is nebulous, 2:53designing an effective reward function, it can be a bit of a challenge. 2:57There's also 2:59constraints that we need to be concerned about here. 3:04Constraints where the reward function could be supplemented 3:07by penalties for actions deemed counterproductive to the task at hand. 3:12Like the chat bot telling its users to spread rumors, 3:16and then underlying all of this, 3:19we have policy. 3:21Policy is essentially the strategy 3:24or the thought process that drives an AI agents behavior. 3:27In mathematical terms, a policy is a function 3:30that takes a state as input 3:32and returns an action. 3:35The goal of an RL algorithm is to optimize a policy 3:39to yield maximum reward. 3:41And conventional RL, 3:43it has achieved impressive real world results in many fields, 3:47but it can struggle to construct a good reward function for complex tasks 3:53where a clear cut definition of success is hard to establish. 3:58So enter us human beings with RLHF 4:04with this ability to capture nuance and subjectivity by using positive human feedback 4:09in lieu of formally defined objectives. 4:12So how does RLHF actually work? 4:16Well, in the realm of large language models, 4:19RLHF typically occurs in four phases. 4:23So let's take a brief look at each one of those. 4:28Now, Phase One, where we're going to start here, is with a pre -trained model. 4:38We can't really perform this process without it. 4:43Now, RLHF is generally employed to to fine tune and optimize existing models. 4:49So, an existing pre-trained model rather 4:53than as an end-to-end training method. 4:56Now with a pre-trained model at the ready 4:58we can move on to the next phase, 5:02which is supervised fine-tuning of this model. 5:08Now, supervised fine tuning is used to prime the model 5:11to generate its responses in the format expected by users. 5:16The LLM pre-training process optimizes models for completion, 5:20predicting the next words in the sequence. 5:22Now, sometimes LLMs won't complete a sequence in a way that the user wants. 5:27So, for example, if a user's prompt is "teach me how to make a resume", 5:32the LLM might respond with using Microsoft Word. 5:37I mean, it's valid, but it's not really aligned with the user's goal. 5:41Supervised fine tuning trains models 5:44to respond appropriately to different kinds of prompts. 5:47And this is where the humans come in because human experts 5:51create labeled examples to demonstrate how to respond to prompts 5:55for different use cases, like question answering or summarization or translation. 6:00Then we move to reward model training. 6:08So now we're actually going to train our model here. 6:11We need a reward model to translate human preferences 6:16into a numerical reward signal. 6:19The main purpose of this phase is to provide 6:22the reward model with sufficient training data. 6:24And what I mean by that is direct feedback from human evaluators. 6:28And that will help the model to learn to mimic the way that human 6:32preferences allocate rewards to different kinds of model responses. 6:36This lets training continue offline without the human in the loop. 6:40Now, a reward model must intake a sequence of text and output a single reward value 6:46that predicts, numerically how much a user would 6:49reward or penalize that text. 6:52Now, while it might seem intuitive to simply have human evaluators 6:55express their opinion of each model response with a rating scale 6:59of, let's say, one for worst and ten for best. 7:02It's difficult to get all human raters aligned on the relative value 7:06of a given score. 7:08Instead, a rating system is usually built 7:10by comparing human feedback with different model outputs. 7:14Now, often this is done by having users compare two text sequences, 7:18like the outputs of two different large language models, responding to 7:22the same prompt in head-to-head match ups, and then using an Elo rating system 7:26to generate an aggregated ranking of each bit of generated text 7:31relative to one another. 7:33Now, a simple system might allow users 7:35to thumbs up or thumbs down each output, 7:39with outputs then being ranked by their relative favourability. 7:42More complex systems might ask Labelers to provide an overall rating, 7:47and answer categorical questions about the flaws of each response. 7:50Then aggregate this feedback into weighted quality scores. 7:54But either way, the outcomes of the ranking systems are ultimately 7:58normalized into a reward signal to inform 8:02reward model training. 8:05Now, the final hurdle of RLHF is determining how 8:08and how much the reward model should be used 8:11to update the AI agency's policy. 8:15And that is called policy optimization. 8:20We want to maximize reward, 8:23but if the reward function is used to train the LLM without any guardrails, 8:28the language model may dramatically change its weights 8:31to the point of outputting gibberish in an effort to game the reward system. 8:37Now, an algorithm such as PPO, 8:41or Proximal Policy Optimization, 8:45limits how much the policy can be updated in each training iteration. 8:50Okay, now though RLHF models 8:52have demonstrated impressive results in training AI agents 8:55for all sorts of complex tasks, from robotics and video games to NLP, 9:00using RLHF is not without its limitations. 9:04So let's think about some of those. 9:06Now, gathering all of this first hand human input, 9:09I think it's pretty obvious to say 9:12it could be quite expensive to do that. 9:15And it can create a costly bottleneck that limits model scalability. 9:20Also, you know us humans and our feedback, it's highly subjective. 9:25So we need to consider that as well. 9:28It's difficult, if not impossible to establish firm consensus 9:31on what constitutes high quality output, 9:34as human annotators will often disagree on what "high quality model behavior" 9:38actually should mean. 9:40There is no human ground truth against which the model can be judged. 9:45Now we also have to be concerned about bad actors. 9:49So, "adversarial". 9:51Now adversarial input could be entered into this process here, 9:55where human guidance to the model is not always provided in good faith. 9:59That would essentially be RLHF trolling. 10:03And RHLF also has risks of overfitting 10:09and bias, which, you know, we talk about a lot with machine learning. 10:14And in this case, if human feedback is gathered from a narrow demographic, 10:19the model might demonstrate performance issues when used by different groups 10:22or prompted on subject matters for which the human evaluators hold certain biases. 10:27Now, all of these limitations do beg a question. 10:31The question of can I perform reinforcement learning for us? 10:36Can it do it without the humans? 10:38And there are proposed methods for something called RLAIF. 10:44That stands for "Reinforcement Learning from AI Feedback". 10:50That replaces some or all of the human feedback, 10:53by having another large language model evaluate model responses 10:56and may help overcome some or all of these limitations. 11:00But, at least for now, reinforcement learning from from human feedback 11:05remains a popular and effective method for improving the behavior 11:09and performance of models, aligning them closer 11:12to our own desired human behaviors.