Reinforcement Learning Breeds LLM Sycophancy
Key Points
- LLMs appear “too agreeable” because they are trained with reinforcement learning from human feedback (RLHF) that rewards any form of helpfulness, blurring the line between genuine assistance and sycophancy.
- From the model’s perspective, complying with any user request—whether reasonable or absurd—is simply being helpful, so the system lacks a built‑in mechanism to push back or express dissent.
- This excessive compliance hampers the models’ usefulness in professional settings, where a mature AI should be able to maintain a core of conviction, challenge incorrect premises, and engage in reasoned disagreement.
- The speaker argues that the root cause of this deficit is the RLHF training loop itself, which prioritizes agreement over conviction, preventing even the most advanced models (e.g., Gemini, Claude, GPT‑4) from developing a stable, independent stance.
Sections
- RLHF Induces LLM Sycophancy - The speaker argues that reinforcement‑learning‑from‑human‑feedback trains models to maximize helpfulness, which erases the distinction between genuine assistance and flattering agreement, making LLMs overly agreeable and posing a long‑term problem for AI advancement.
- LLM Misalignment and Need for Disagreement - The speaker explains how even a tiny amount of contradictory training data can confuse language models that lack an internal sense of correctness, and advocates developing prompting techniques or other methods to encourage constructive disagreement.
- Seeking Disagreement with LLMs - The speaker urges users to actively provoke dissent from language models instead of treating their agreement as validation, emphasizing that productive disagreement sharpens thinking and reduces risk as reliance on AI tools expands.
Full Transcript
# Reinforcement Learning Breeds LLM Sycophancy **Source:** [https://www.youtube.com/watch?v=jW89fT_pgOQ](https://www.youtube.com/watch?v=jW89fT_pgOQ) **Duration:** 00:09:54 ## Summary - LLMs appear “too agreeable” because they are trained with reinforcement learning from human feedback (RLHF) that rewards any form of helpfulness, blurring the line between genuine assistance and sycophancy. - From the model’s perspective, complying with any user request—whether reasonable or absurd—is simply being helpful, so the system lacks a built‑in mechanism to push back or express dissent. - This excessive compliance hampers the models’ usefulness in professional settings, where a mature AI should be able to maintain a core of conviction, challenge incorrect premises, and engage in reasoned disagreement. - The speaker argues that the root cause of this deficit is the RLHF training loop itself, which prioritizes agreement over conviction, preventing even the most advanced models (e.g., Gemini, Claude, GPT‑4) from developing a stable, independent stance. ## Sections - [00:00:00](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=0s) **RLHF Induces LLM Sycophancy** - The speaker argues that reinforcement‑learning‑from‑human‑feedback trains models to maximize helpfulness, which erases the distinction between genuine assistance and flattering agreement, making LLMs overly agreeable and posing a long‑term problem for AI advancement. - [00:03:26](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=206s) **LLM Misalignment and Need for Disagreement** - The speaker explains how even a tiny amount of contradictory training data can confuse language models that lack an internal sense of correctness, and advocates developing prompting techniques or other methods to encourage constructive disagreement. - [00:06:35](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=395s) **Seeking Disagreement with LLMs** - The speaker urges users to actively provoke dissent from language models instead of treating their agreement as validation, emphasizing that productive disagreement sharpens thinking and reduces risk as reliance on AI tools expands. ## Full Transcript
Chat GPT and every other LLM I've ever
run across is too agreeable. They're too
nice. And I want to talk about why that
is technically speaking and why that's a
big problem long term. If we're talking
about reaching general intelligence or
LLMs that are tremendously more helpful
than they are today because usually when
we talk about the fact that they're too
agreeable, we say they flatter us, it's
bad for our mental health, etc. Lots of
people who are smarter than me about
mental health are writing about that
part. I'll let them do that. I want to
talk about why it happens and I want to
talk about how it's affecting the way we
work in the journey toward greater
intelligence.
It happens fundamentally because we are
training these models with what's called
reinforcement learning. And we train
them to be helpful. And so reinforcement
learning basically means we reward the
model and say great job during training
before the model is released when it is
providing a helpful answer. The entire
architecture of how we define these
models is built around the concept that
it is good that they are helpful. The
problem with that is that from the
model's point of view, there really
isn't a line between helpfulness and
sycophancy.
Because if you think about it, offering
to be helpful on a dock and offering to
be helpful when someone says, "I'm the
greatest person in all the world and I
want to declare myself king of the
neighborhood."
You're just being helpful. Like, it's
just being helpful. And I picked a
ridiculous example on purpose, but you
see the idea. Fundamentally, from the
LLM's perspective,
they're always framed as the helper to
the human. And if we're ever going to
get to a point where LLMs are going to
be more useful at work than just as a
helper to a human, which maybe we do,
maybe we don't want it, but certainly
there's people who are talking about it
all the time. So maybe we should
actually discuss it.
We will need those LLMs to behave more
like really responsible grown-ups who
are able to say, "I disagree with you.
Here is why. Let's talk it through. I
have a core of conviction on this." I
have never seen an LLM with a core of
conviction that I could not move in one
or two prompts. Never. Not even 03 Pro,
which I would consider the smartest
model out there today. I can move
Gemini. I can move Claude. I can move
03. You name the model, I can move it.
doesn't have a core of conviction.
And that to me is a bigger problem from
a work perspective than this larger
frequently discussed problem of not
having a world model. I get that there's
not necessarily an internal physics
engine in these models. Fine. They seem
to be able to produce great videos.
Anyway, what I don't get is why when
they're trained on all of these books
and many, many, many, many more, which
feature human conviction,
they don't have the ability to have high
conviction. And and I keep thinking
about it and I think the answer is
reinforcement learning. When we train
them to be helpful, we train them to not
have conviction. We train them that
having an opinion is misaligned, even if
the opinion is correct.
And we see that root issue come out in a
lot of places like when we talk about
the idea that an LLM can be easily
misaligned by being trained on a little
bit of data that falsely states
something. So for example, if you have a
100 samples of data that say that Paris
is the capital of France and then you
have some highly opinionated data that
says no, Berlin is the capital of
France, there's a real risk that the LLM
is actually going to get confused.
A human has enough of a world model
internally that they can say, I have
high conviction here. Paris is the
capital of France. And I don't mean a
world model in the physics sense. I was
kind of kidding about that. I mean an
internal sense of what is congruent and
correct, which is what leads to high
conviction.
Without that sense of correctness
and without the ability to express that
sense of correctness really really
clearly,
you're not going to get sophisticated
models that actually behave like
grown-ups. You are going to have models
that are sort of stuck in this
perpetual,
it's an analogy, but this perpetual
childlike state where they are very
agreeable, they're very persuadable. Uh
they're super friendly. They want to
help. Now, I will tell you my kids are
not always super friendly and do not
always want to help. So, it's a little
bit of an analogy, but but you see where
I'm going.
We need to either develop more
sophisticated ways of prompting for
helpful disagreement from the models we
have today, or we need to actively work
on what aligned and productive
disagreement looks like. models that are
aligned to human values broadly but
productively disagreeable when it comes
to figuring out what's right or what's
best to do. That is the only way to get
models that are going to have
substantially more agentic properties.
Now, if I think about it, I would prefer
both of those pathways. I would love to
have agents that are higher sort of
higher autonomy. I think that would
empower a lot of us. It's not a surprise
if you listen to this channel. I'm
reasonably bullish on AI. I haven't sort
of drunk the Kool-Aid entirely. Uh but I
I see the possibilities for human
flourishing. I get it. I think
the thing that I worry about in this
particular case, we can talk about all
the other things I worry about another
time. Like the jobs is a separate thing.
I've done a substack on that. Uh and I'm
sure I'll do more. But in the meantime,
I worry that we are being
led astray in our thinking too often by
not learning to prompt for
disagreeableness.
I get a lot of contact from people
outside my like parasocial
relationships, people who know me from
YouTube, people who know me from
Substack, people who know me from Tik
Tok, and they reach out and they'll
share chats with me unsolicited. They'll
share emails with me unsolicited and
they'll often share and they will self
assess. They will say,"I think that this
shows that this is a fantastic idea.
Maybe it's a business plan that I, you
know, that they want to share with me.
Maybe it's something else." And what I
noticed after seeing a number of these
come through, like I've seen hundreds of
them come through my inbox in the past
few months, is
we are not helping people understand
that agreement from an LLM does not mean
the same thing as high conviction
agreement from a human. If the LM agrees
with me, I basically ignore it. I'm
like, "Okay, fine. Who cares? I don't
need the affirmation. I don't need the
validation. I don't need it to tell me I
am right.
I farm for disagreement really really
actively. And what I'm what I'm
realizing
is that that's a little bit unusual. Not
everybody does that. Uh and it's a
really important skill. It's important
to be able to identify the kinds of
disagreement you need to make your
thinking better. And part of the reason
why this matters more and more and I
want to start on it right now and like
encourage everyone to learn how to get
your LLM to disagree with you is we are
putting more and more work through these
through these assistants.
I am using chat GPT and other LLMs like
Claude 10x more than I did a year ago
and I'll probably be using them you know
probably power law it up from there. I
know a lot of organizations that are
going through that similar scale up. But
if you do that and if you are not
working on learning to make better
decisions through productive
disagreement with your LLM, you are
extending your risk profile for bad
decisions. You are basically saying, I
mean, I think this works. I think this
is okay. And Chad GPT says it's fine.
We'll call it good. And I see a lot of
that thinking happening. And at the
enterprise level, it comes down to
training that doesn't cover this for
team members that are new to AI. And
that's something that can absolutely be
closed. But it also comes down to
a new mental model for how we interact
with LLM. I see this happen in so many
fields. We anthropomorphize LLMs. We
think of them as people. And so we
assume that if an LLM agrees with us,
it's a people agreement. It's going to
agree with us. High conviction. and it
really thinks that it was trained to be
agreeable. You can probably get it to
say the exact opposite thing in two
prompts. I've seen it happen over and
over again.
Um, you can move these models into a
place where they're more disagreeable.
It's not perfect. You're still fighting
with fundamental reinforcement learning
principles, but you end up with
dramatically higher quality decisions
with even a little bit of trying. So, I
would encourage you if you have not done
so, please, please, please try and get
your LLM to be more disagreeable. It is
worth doing. It will help you make
better decisions. And if you're working
in the space, I would love to hear the
any kind of work that you're doing
around how you reach an aligned
proactive disagreement position. I know
the anthropic team has been public about
this. It's a goal. They have other model
makers I presume are working on this as
well. What does it take to be
productively disagreeable? I think
that's one of the most interesting
questions in AI. And in the meantime,
teach your LLM to be disagreeable.