Neurostimulation‑Style Steering of LLMs
Key Points
- Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights.
- Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses.
- The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning.
- This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior.
Sections
- Steering LLMs Through Neurostimulation Analogy - The speaker introduces a third method for adjusting a large language model’s behavior—targeted “steering” of its neurons, likened to neurostimulation of the brain, as an alternative to prompt engineering or full fine‑tuning.
- Hidden State Vectors and Steering - The passage explains how each transformer layer passes a high‑dimensional hidden‑state vector—viewed as neurons in an activation space—that can be neuro‑stimulated to steer an LLM’s behavior, emphasizing the need to understand token embeddings that map vocabulary items into this space.
- Direction Encodes Concept, Length Scales Strength - The passage explains that in LLMs a concept is represented by the direction of its high‑dimensional vector—its magnitude only modulates expression strength—and that across layers these vectors shift, with concepts stored in distributed superpositional patterns rather than single neurons.
- Activations Steering with Concept Vectors - The speaker explains how to steer a language model’s hidden state by adding a scaled, normalized concept vector—demonstrated with a few lines of Hugging Face code that injects an “Eiffel Tower” vector into LLaMA 3.1 8B’s mid‑layer activations to alter its behavior and personality.
- Steering Language Models with Vectors - The speaker outlines how to modify LLM responses by adding and tuning steering vectors—explaining coefficient limits, stability tricks, and contrastive activation methods for finding effective direction cues.
- Choosing Layers for LLM Steering - The speaker explains that steering vectors exist at specific model layers, suggests using middle layers to influence abstract concepts without altering exact wording, highlights the inference‑time benefits and fluency trade‑offs of steering versus prompt engineering, and notes that it works best for already‑learned concepts while requiring experimentation to find the optimal intensity.
Full Transcript
# Neurostimulation‑Style Steering of LLMs **Source:** [https://www.youtube.com/watch?v=F2jd5WuT-zg](https://www.youtube.com/watch?v=F2jd5WuT-zg) **Duration:** 00:17:43 ## Summary - Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights. - Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses. - The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning. - This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior. ## Sections - [00:00:00](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=0s) **Steering LLMs Through Neurostimulation Analogy** - The speaker introduces a third method for adjusting a large language model’s behavior—targeted “steering” of its neurons, likened to neurostimulation of the brain, as an alternative to prompt engineering or full fine‑tuning. - [00:03:07](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=187s) **Hidden State Vectors and Steering** - The passage explains how each transformer layer passes a high‑dimensional hidden‑state vector—viewed as neurons in an activation space—that can be neuro‑stimulated to steer an LLM’s behavior, emphasizing the need to understand token embeddings that map vocabulary items into this space. - [00:06:18](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=378s) **Direction Encodes Concept, Length Scales Strength** - The passage explains that in LLMs a concept is represented by the direction of its high‑dimensional vector—its magnitude only modulates expression strength—and that across layers these vectors shift, with concepts stored in distributed superpositional patterns rather than single neurons. - [00:09:36](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=576s) **Activations Steering with Concept Vectors** - The speaker explains how to steer a language model’s hidden state by adding a scaled, normalized concept vector—demonstrated with a few lines of Hugging Face code that injects an “Eiffel Tower” vector into LLaMA 3.1 8B’s mid‑layer activations to alter its behavior and personality. - [00:12:46](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=766s) **Steering Language Models with Vectors** - The speaker outlines how to modify LLM responses by adding and tuning steering vectors—explaining coefficient limits, stability tricks, and contrastive activation methods for finding effective direction cues. - [00:15:52](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=952s) **Choosing Layers for LLM Steering** - The speaker explains that steering vectors exist at specific model layers, suggests using middle layers to influence abstract concepts without altering exact wording, highlights the inference‑time benefits and fluency trade‑offs of steering versus prompt engineering, and notes that it works best for already‑learned concepts while requiring experimentation to find the optimal intensity. ## Full Transcript
Imagine you are working with a large language model,
and you would like to tweak its behaviour or its personality. A well-known solution
is to use prompt engineering, you specify in the system prompt what you want to achieve.
Another option is to fine-tune the model. But for that,
you need enough data demonstrating the behaviour or the personality you are
looking for. And of course you need to have enough compute to perform the fine-tuning.
So today, we’re going to talk about a third option:
steering the model. And it turns out that steering a large language model is loosely
analogous to what neuroscientists call neurostimulation of the brain.
Neurostimulation is the idea of artificially stimulating certain areas of the brain,
or specific neurons, using electrodes or magnetic fields.
When you stimulate biological neurons this way, neuroscientists have observed it might trigger
or inhibit certain motor actions, and even elicit certain emotions, feelings or memories.
In neuroscience, neurostimulation is used for research, to better
understand the role of the various brain regions, but also for clinical purposes,
for instance in treating Parkinson’s disease. And what’s interesting is that it’s a technique
that obviously does not modify the brain, it just intervenes on the fly.
And it turns out that you can do pretty much the same with artificial neural networks in general,
and LLMs in particular. By targeting carefully selected neurons in your LLM,
you can control or elicit certain behaviour,
without having to rewire anything, without changing the weights of the model.
This procedure is fairly easy to use, and to illustrate it, I applied it to a Llama
3.1 8B model, to change its personality and make it obsessed with the Eiffel Tower...to
the point that it sometimes even believes it IS the Eiffel Tower. Look at that !
And again, this change is entirely controlled at inference time, when generating the tokens.
What is loaded in memory is still the original Llama model, there is no fine-tuning involved.
You want to learn how to do the same? Well today I’m going to explain the basics of this method,
and show you how you can easily use this technique
to steer pretty much any open-source LLM using Hugging Face’s Transformers library.
First of all, let’s recall the internal workings of a typical LLM. Most of them
today are autoregressive models, they generate one token at a time.
For that, they are based on the transformer architecture, and organized as a stack of layers.
At each layer, each token goes through an attention block and a feed forward block.
As you know, the attention block is where each token can receive information from the other
tokens preceding it in the sequence. The feed forward network block is a
traditional multilayer perceptron. After those two blocks, the result is passed to the next layer.
The stack of layers essentially represents successive stages of processing,
until the logits for the next token are computed by the final linear head.
If I zoom in at the boundary between two layers, what gets passed here is actually a vector,
sometimes called the hidden state. This vector actually lives in a high dimensional space,
typically a few thousand dimensions, that we’ll call the activation space.
We can think of this huge vector as representing the model's internal state,
its hidden 'thoughts' at this point in processing the token.
With LLMs, we don’t generally visualize those numbers as coming from neurons.
But you could very well imagine the output of each layer as a series of
neurons that produce the coordinates of the vector that gets passed to the next layer.
And these are the kinds of neurons that we can target with our steering,
our artificial neurostimulation, in order to modify the thoughts of the LLM at inference time.
But now the question is: how do we do this? How can we stimulate the neurons
in a way that elicits a certain behaviour or a certain personality. To answer that question,
we need to understand how LLMs represent abstract concepts.
You may remember that the very first layer of an LLM is the embedding layer. This layer will map
every possible token of the vocabulary into a vector of the activation space.
This token/vector correspondence is by design for the embedding
layer. But something remarkable happens: as the model processes information through deeper layers,
it continues to represent concepts as vectors in the activation space.
This is called the linear representation phenomenon,
an empirical observation that seems to hold for most LLMs: they tend to represent interpretable
concepts as vectors in the activation space, going from one layer to another.
What’s useful here with linear representation is that you can always add vectors. If you have
a vector that represents the concept of a car, and another that represents
the concept of the color red, if you sum them, you get the concept of a red car.
And you can even vary the amount you add, so that you can navigate between different degrees of the
concept, going from a car that happens to be red, to something like an intensely red sports car.
Maybe you remember from a few years ago the results from the famous Word2Vec paper. This paper
showed that embeddings of words were following certain kinds of arithmetic relationships,
and you could for instance obtain the vector embedding of the word ‘King’ from that of
the word ‘Queen’, by adding the vector for ‘Man’ and subtracting the one for ‘Woman’.
Word2Vec demonstrated this for word embeddings specifically,
but with LLMs, this idea holds throughout the model’s
layers. And it is the consequence of the linear representation they develop during training.
An implication of this linear representation phenomenon is that for a given concept,
what matters most is the vector’s direction, not its length. If you have a vector for the
concept of car, doubling its length won’t give you a concept for a bus, or two cars, or traffic jam.
In general, increasing the length of a concept vector does not change which
concept it represents, only how strongly it's expressed.
Something important to note is that this linear representation phenomenon might
be realized differently at every layer of the stack. So after embedding, the token « car » is
represented by a certain vector, but after each layer, in each intermediate activation space,
there is possibly a different vector for the concept of a car.
I told you earlier that between each layer, the LLM transmits a vector in the high dimensional
activation space, and that we could see each coordinate of that vector as a neuron,
outputting a signal to the next layer. It might be tempting to think that every such
neuron represent a certain concept. But this hypothesis turns out to be wrong in general.
LLMs actually encode concepts as distributed patterns across neurons. This is called
superposition, and through this, they can manipulate far more concepts than there
are dimensions. I won’t go into too much detail, if you are interested you should
check out Anthropic’s series of papers about superposition and monosemanticity.
Another important observation regarding the encoding of concepts in activation
space is that different layers might play different roles.
What researchers observed is that in early layers, those vectors tend
to be activated when the concept has just been explicitly seen in the input tokens,
for instance the model has read the word car.
In late layers, close to the output, the vector corresponding to a concept tends
to activate when the model is about to output that token. As we’ll see later,
the most interesting cases for us are intermediary
layers: this is where LLMs tend to represent abstract concepts in order to reason on them.
So to recap: concepts are represented by vectors in the activation space between each successive
layers, and the good thing with vectors is that we can add them. So it means that if
we take the activation coming from a layer, we can add to it a given vector in order to
reinforce that concept in the thoughts of the LLM: this is what is called steering.
Let’s see how we can do this in practice. For now,
let’s assume we’ve found a good vector that represents the concept we want to stimulate,
and I’ll come back later to the different ways to actually identify those vectors.
As I explained earlier, when you want to steer the behaviour of an LLM with a concept vector,
you don’t change the LLM. The model is the same, the weights are the same,
but you will intervene on the activations at inference, during the generation of new tokens.
More specifically if you have a vector X representing the activations at the output of
layer n, and you want to steer it in the direction of the vector V, you will simply add V to X.
Of course, as I mentioned before, when you add two vectors, you can
scale each one with a coefficient, controlling how much of each you add.
So usually what we do is work with normalized concept vectors V, but we multiply them by a
coefficient before steering, and this coefficient will govern the size of your intervention.
Ok but how do we do this in practice? Actually it’s just a few lines of code
using Hugging Face’s transformers. Here I have a small snippet that loads Llama
3.1 8B from the Hugging Face Hub, and calls the model on a simple
prompt « Give me some ideas for starting a business », and you see the response.
Now let’s say we want to steer the model to change its behaviour and its personality.
Maybe some of you have seen a few month ago that Anthropic had created a model that pretended to
be the Golden Gate Bridge. I wanted to reproduce this, but as you’ve probably heard I’m French,
I live in Paris, so I had to try with the Eiffel Tower instead.
So here I’m loading a vector V that represents the concept of the Eiffel Tower at layer 15 of
the model. Llama 3.1 8B has 32 layers, so we are in the middle. I’ll explain later how I
found this vector, but for now let’s assume we have it and we want to steer our model with it.
To perform this while generating our tokens, we need the equivalent
of the electrode that delivers electrical stimulations to the brain. In our case,
the solution is called a hook. A hook is simply a function you attach to the model,
that gets triggered during the forward pass, right when inference is happening.
So let’s choose a coefficient, and my hook will simply take the output of a layer,
and add the vector scaled by the coefficient. Very simple. And I will register this hook at
layer 15, so that it will be systematically called after the model has processed layer 15.
Now let’s run my model again with this hook and see what happens. With a coefficient set to 4.0,
as you can see, the model starts deviating from its natural behaviour. When I was asking for
ideas for starting a business, the base model was suggesting things around e-commerce and services.
Now you see the answer is different. It’s talking about food,
bakeries. It is not explicitly about the Eiffel Tower, but you feel a change of
perspective. Now I can remove my hook and replace it with a stronger one.
With a coefficient of 8.0, Llama starts to suggest ideas about wine, and travels,
it is clearly influenced by the concept we are stimulating. And now if I ask "who are you ?",
the model will start pretending to be a large metal structure called the Eiffel Tower.
And here's a fun detail: the original response, with no steering,
started with 'I'm a large language model.' Now it says 'I'm a large metal structure.'
You can literally see the steering kick in right after the word ‘large'.
Of course, doing this you will quickly realize that you don’t want to push the
coefficient too high. That’s expected, if you add too much of the vector,
you completely derail the model’s reasoning, and it will output gibberish. That makes sense,
you could imagine it would be the same with electrical stimulation of the brain.
So you want to choose a good value for the steering coefficient. Luckily,
there are some systematic techniques to help you
identify the sweet spot. And there also are some ways to improve the stability of the
model by tuning certain generation parameters like temperature or frequency penalty controls.
I cover these techniques in a blog post, I’ll leave the link in the description.
At this point, I’m sure you’re convinced that steering could actually be an
interesting technique to study and use, but I left open a big question:
how to identify a steering vector for your concept of choice. How did I do
for the Eiffel Tower? Well there are actually several techniques.
One is called contrastive activation. The idea is fairly simple, you have to
gather pairs of prompts, positive and negative examples of the behaviour you
want to elicit. Then you compute the average activation across positive
examples, and subtract the average activation across negative examples.
If you have enough pairs, you will end up with vectors that represent the concept
you are looking for. This method has been found to be pretty effective;
in some cases, even better than prompt engineering and supervised fine tuning.
Another completely different technique uses Sparse Autoencoders. These are auto encoder
models trained to reconstruct the LLM activations through a sparse intermediate layer. The key
insight is that each dimension of this layer tends to correspond to an interpretable concept.
The method is unsupervised, you don't tell it which concepts to find. Instead, it produces a
large library of vectors that statistically seem to correspond to well-defined concepts.
I’m skipping the details but what’s nice about this method is that it gives you a
large library of vectors to choose from, and a lot of people have been sharing their sparse
autoencoders on the Hugging Face Hub, so you should have a look for your model of choice.
The drawback is that these vectors don't generally come with predefined concept
labels. So if you are looking for one specific concept, it might be particularly tedious to use.
Fortunately, the great website Neuronpedia created by Decode Research is the perfect
place for that. You can browse through visualizations that will help you
identify the proper features that suit your purpose.
In my case, I searched for Eiffel Tower features in the Llama 3.1 8B model,
and I found for instance this one that I used for the demo.
One important aspect of steering vectors, whether they come from contrastive prompts, sparse auto
encoders, or other techniques, is that they always are located at a given layer of the model. So in
general you might have the choice between steering early layers, late layers or middle layers.
As we discussed earlier, in general, if you want the model to be influenced
by a concept without necessarily reproducing the exact same words,
it is better to steer concept vectors that are located in middle layers,
where the abstract reasoning is supposed to happen. But you might have to experiment yourself.
Ok that’s pretty much it for today, I hope I convinced you that steering LLMs
can be an interesting method to elicit certain behaviours.
It does not require any fine-tuning, it just works at inference time.
And it has many benefits like being able to control the intensity of the intervention,
and maintain it over the whole text generation,
which is sometimes harder to achieve with prompt engineering.
Of course, it also has drawbacks. As I mentioned earlier, it might not be easy to find a sweet spot
where the model is properly steered but still maintains its fluency. Also steering works
best for concepts the model has already learned to represent. It won't teach the model new knowledge.
If you want to know more about the technical details, I encourage you to
go read the blog post where I explained how I constructed the Eiffel Tower Llama model,
and what kind of methods you might use if you want to do a similar thing. It contains a lot
of useful tips to investigate the proper way to do steering, in particular with Sparse Autoencoders.
Don’t forget to visit Neuronpedia and the Hugging Face Hub for finding steering vectors,
and maybe sharing your own recipes. Let us know in the comments what
you were able to achieve with this technique, and have fun steering !