Learning Library

← Back to Library

Neurostimulation‑Style Steering of LLMs

17m • HuggingFace • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights.
Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses.
The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning.
This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior.

Sections

Full Transcript

# Neurostimulation‑Style Steering of LLMs **Source:** [https://www.youtube.com/watch?v=F2jd5WuT-zg](https://www.youtube.com/watch?v=F2jd5WuT-zg) **Duration:** 00:17:43 ## Summary - Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights. - Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses. - The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning. - This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior. ## Sections - [00:00:00](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=0s) **Steering LLMs Through Neurostimulation Analogy** - The speaker introduces a third method for adjusting a large language model’s behavior—targeted “steering” of its neurons, likened to neurostimulation of the brain, as an alternative to prompt engineering or full fine‑tuning. - [00:03:07](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=187s) **Hidden State Vectors and Steering** - The passage explains how each transformer layer passes a high‑dimensional hidden‑state vector—viewed as neurons in an activation space—that can be neuro‑stimulated to steer an LLM’s behavior, emphasizing the need to understand token embeddings that map vocabulary items into this space. - [00:06:18](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=378s) **Direction Encodes Concept, Length Scales Strength** - The passage explains that in LLMs a concept is represented by the direction of its high‑dimensional vector—its magnitude only modulates expression strength—and that across layers these vectors shift, with concepts stored in distributed superpositional patterns rather than single neurons. - [00:09:36](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=576s) **Activations Steering with Concept Vectors** - The speaker explains how to steer a language model’s hidden state by adding a scaled, normalized concept vector—demonstrated with a few lines of Hugging Face code that injects an “Eiffel Tower” vector into LLaMA 3.1 8B’s mid‑layer activations to alter its behavior and personality. - [00:12:46](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=766s) **Steering Language Models with Vectors** - The speaker outlines how to modify LLM responses by adding and tuning steering vectors—explaining coefficient limits, stability tricks, and contrastive activation methods for finding effective direction cues. - [00:15:52](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=952s) **Choosing Layers for LLM Steering** - The speaker explains that steering vectors exist at specific model layers, suggests using middle layers to influence abstract concepts without altering exact wording, highlights the inference‑time benefits and fluency trade‑offs of steering versus prompt engineering, and notes that it works best for already‑learned concepts while requiring experimentation to find the optimal intensity. ## Full Transcript

0:00Imagine you are working with a large language model, 0:03and you would like to tweak its behaviour or its personality. A well-known solution 0:08is to use prompt engineering, you specify in the system prompt what you want to achieve. 0:13Another option is to fine-tune the model. But for that, 0:16you need enough data demonstrating the behaviour or the personality you are 0:20looking for. And of course you need to have enough compute to perform the fine-tuning. 0:25So today, we’re going to talk about a third option: 0:28steering the model. And it turns out that steering a large language model is loosely 0:34analogous to what neuroscientists call neurostimulation of the brain. 0:40Neurostimulation is the idea of artificially stimulating certain areas of the brain, 0:46or specific neurons, using electrodes or magnetic fields. 0:51When you stimulate biological neurons this way, neuroscientists have observed it might trigger 0:57or inhibit certain motor actions, and even elicit certain emotions, feelings or memories. 1:04In neuroscience, neurostimulation is used for research, to better 1:07understand the role of the various brain regions, but also for clinical purposes, 1:12for instance in treating Parkinson’s disease. And what’s interesting is that it’s a technique 1:18that obviously does not modify the brain, it just intervenes on the fly. 1:24And it turns out that you can do pretty much the same with artificial neural networks in general, 1:29and LLMs in particular. By targeting carefully selected neurons in your LLM, 1:36you can control or elicit certain behaviour, 1:39without having to rewire anything, without changing the weights of the model. 1:44This procedure is fairly easy to use, and to illustrate it, I applied it to a Llama 1:493.1 8B model, to change its personality and make it obsessed with the Eiffel Tower...to 1:57the point that it sometimes even believes it IS the Eiffel Tower. Look at that ! 2:03And again, this change is entirely controlled at inference time, when generating the tokens. 2:09What is loaded in memory is still the original Llama model, there is no fine-tuning involved. 2:16You want to learn how to do the same? Well today I’m going to explain the basics of this method, 2:21and show you how you can easily use this technique 2:24to steer pretty much any open-source LLM using Hugging Face’s Transformers library. 2:30First of all, let’s recall the internal workings of a typical LLM. Most of them 2:35today are autoregressive models, they generate one token at a time. 2:40For that, they are based on the transformer architecture, and organized as a stack of layers. 2:46At each layer, each token goes through an attention block and a feed forward block. 2:51As you know, the attention block is where each token can receive information from the other 2:56tokens preceding it in the sequence. The feed forward network block is a 3:00traditional multilayer perceptron. After those two blocks, the result is passed to the next layer. 3:07The stack of layers essentially represents successive stages of processing, 3:12until the logits for the next token are computed by the final linear head. 3:17If I zoom in at the boundary between two layers, what gets passed here is actually a vector, 3:24sometimes called the hidden state. This vector actually lives in a high dimensional space, 3:30typically a few thousand dimensions, that we’ll call the activation space. 3:35We can think of this huge vector as representing the model's internal state, 3:40its hidden 'thoughts' at this point in processing the token. 3:44With LLMs, we don’t generally visualize those numbers as coming from neurons. 3:50But you could very well imagine the output of each layer as a series of 3:54neurons that produce the coordinates of the vector that gets passed to the next layer. 3:59And these are the kinds of neurons that we can target with our steering, 4:03our artificial neurostimulation, in order to modify the thoughts of the LLM at inference time. 4:11But now the question is: how do we do this? How can we stimulate the neurons 4:15in a way that elicits a certain behaviour or a certain personality. To answer that question, 4:21we need to understand how LLMs represent abstract concepts. 4:26You may remember that the very first layer of an LLM is the embedding layer. This layer will map 4:32every possible token of the vocabulary into a vector of the activation space. 4:38This token/vector correspondence is by design for the embedding 4:42layer. But something remarkable happens: as the model processes information through deeper layers, 4:49it continues to represent concepts as vectors in the activation space. 4:55This is called the linear representation phenomenon, 4:58an empirical observation that seems to hold for most LLMs: they tend to represent interpretable 5:05concepts as vectors in the activation space, going from one layer to another. 5:11What’s useful here with linear representation is that you can always add vectors. If you have 5:16a vector that represents the concept of a car, and another that represents 5:21the concept of the color red, if you sum them, you get the concept of a red car. 5:27And you can even vary the amount you add, so that you can navigate between different degrees of the 5:33concept, going from a car that happens to be red, to something like an intensely red sports car. 5:41Maybe you remember from a few years ago the results from the famous Word2Vec paper. This paper 5:48showed that embeddings of words were following certain kinds of arithmetic relationships, 5:54and you could for instance obtain the vector embedding of the word ‘King’ from that of 5:59the word ‘Queen’, by adding the vector for ‘Man’ and subtracting the one for ‘Woman’. 6:05Word2Vec demonstrated this for word embeddings specifically, 6:09but with LLMs, this idea holds throughout the model’s 6:13layers. And it is the consequence of the linear representation they develop during training. 6:18An implication of this linear representation phenomenon is that for a given concept, 6:23what matters most is the vector’s direction, not its length. If you have a vector for the 6:30concept of car, doubling its length won’t give you a concept for a bus, or two cars, or traffic jam. 6:37In general, increasing the length of a concept vector does not change which 6:41concept it represents, only how strongly it's expressed. 6:46Something important to note is that this linear representation phenomenon might 6:51be realized differently at every layer of the stack. So after embedding, the token « car » is 6:58represented by a certain vector, but after each layer, in each intermediate activation space, 7:04there is possibly a different vector for the concept of a car. 7:08I told you earlier that between each layer, the LLM transmits a vector in the high dimensional 7:14activation space, and that we could see each coordinate of that vector as a neuron, 7:19outputting a signal to the next layer. It might be tempting to think that every such 7:25neuron represent a certain concept. But this hypothesis turns out to be wrong in general. 7:31LLMs actually encode concepts as distributed patterns across neurons. This is called 7:38superposition, and through this, they can manipulate far more concepts than there 7:44are dimensions. I won’t go into too much detail, if you are interested you should 7:49check out Anthropic’s series of papers about superposition and monosemanticity. 7:55Another important observation regarding the encoding of concepts in activation 7:59space is that different layers might play different roles. 8:04What researchers observed is that in early layers, those vectors tend 8:08to be activated when the concept has just been explicitly seen in the input tokens, 8:14for instance the model has read the word car. 8:18In late layers, close to the output, the vector corresponding to a concept tends 8:22to activate when the model is about to output that token. As we’ll see later, 8:29the most interesting cases for us are intermediary 8:32layers: this is where LLMs tend to represent abstract concepts in order to reason on them. 8:40So to recap: concepts are represented by vectors in the activation space between each successive 8:47layers, and the good thing with vectors is that we can add them. So it means that if 8:52we take the activation coming from a layer, we can add to it a given vector in order to 8:58reinforce that concept in the thoughts of the LLM: this is what is called steering. 9:05Let’s see how we can do this in practice. For now, 9:08let’s assume we’ve found a good vector that represents the concept we want to stimulate, 9:14and I’ll come back later to the different ways to actually identify those vectors. 9:20As I explained earlier, when you want to steer the behaviour of an LLM with a concept vector, 9:25you don’t change the LLM. The model is the same, the weights are the same, 9:30but you will intervene on the activations at inference, during the generation of new tokens. 9:36More specifically if you have a vector X representing the activations at the output of 9:42layer n, and you want to steer it in the direction of the vector V, you will simply add V to X. 9:49Of course, as I mentioned before, when you add two vectors, you can 9:52scale each one with a coefficient, controlling how much of each you add. 9:58So usually what we do is work with normalized concept vectors V, but we multiply them by a 10:04coefficient before steering, and this coefficient will govern the size of your intervention. 10:10Ok but how do we do this in practice? Actually it’s just a few lines of code 10:16using Hugging Face’s transformers. Here I have a small snippet that loads Llama 10:223.1 8B from the Hugging Face Hub, and calls the model on a simple 10:27prompt « Give me some ideas for starting a business », and you see the response. 10:34Now let’s say we want to steer the model to change its behaviour and its personality. 10:39Maybe some of you have seen a few month ago that Anthropic had created a model that pretended to 10:44be the Golden Gate Bridge. I wanted to reproduce this, but as you’ve probably heard I’m French, 10:50I live in Paris, so I had to try with the Eiffel Tower instead. 10:55So here I’m loading a vector V that represents the concept of the Eiffel Tower at layer 15 of 11:00the model. Llama 3.1 8B has 32 layers, so we are in the middle. I’ll explain later how I 11:08found this vector, but for now let’s assume we have it and we want to steer our model with it. 11:15To perform this while generating our tokens, we need the equivalent 11:19of the electrode that delivers electrical stimulations to the brain. In our case, 11:25the solution is called a hook. A hook is simply a function you attach to the model, 11:31that gets triggered during the forward pass, right when inference is happening. 11:37So let’s choose a coefficient, and my hook will simply take the output of a layer, 11:42and add the vector scaled by the coefficient. Very simple. And I will register this hook at 11:50layer 15, so that it will be systematically called after the model has processed layer 15. 11:57Now let’s run my model again with this hook and see what happens. With a coefficient set to 4.0, 12:03as you can see, the model starts deviating from its natural behaviour. When I was asking for 12:09ideas for starting a business, the base model was suggesting things around e-commerce and services. 12:15Now you see the answer is different. It’s talking about food, 12:19bakeries. It is not explicitly about the Eiffel Tower, but you feel a change of 12:24perspective. Now I can remove my hook and replace it with a stronger one. 12:29With a coefficient of 8.0, Llama starts to suggest ideas about wine, and travels, 12:35it is clearly influenced by the concept we are stimulating. And now if I ask "who are you ?", 12:41the model will start pretending to be a large metal structure called the Eiffel Tower. 12:46And here's a fun detail: the original response, with no steering, 12:50started with 'I'm a large language model.' Now it says 'I'm a large metal structure.' 12:57You can literally see the steering kick in right after the word ‘large'. 13:03Of course, doing this you will quickly realize that you don’t want to push the 13:07coefficient too high. That’s expected, if you add too much of the vector, 13:11you completely derail the model’s reasoning, and it will output gibberish. That makes sense, 13:17you could imagine it would be the same with electrical stimulation of the brain. 13:20So you want to choose a good value for the steering coefficient. Luckily, 13:25there are some systematic techniques to help you 13:27identify the sweet spot. And there also are some ways to improve the stability of the 13:32model by tuning certain generation parameters like temperature or frequency penalty controls. 13:38I cover these techniques in a blog post, I’ll leave the link in the description. 13:44At this point, I’m sure you’re convinced that steering could actually be an 13:48interesting technique to study and use, but I left open a big question: 13:53how to identify a steering vector for your concept of choice. How did I do 13:58for the Eiffel Tower? Well there are actually several techniques. 14:02One is called contrastive activation. The idea is fairly simple, you have to 14:07gather pairs of prompts, positive and negative examples of the behaviour you 14:11want to elicit. Then you compute the average activation across positive 14:17examples, and subtract the average activation across negative examples. 14:23If you have enough pairs, you will end up with vectors that represent the concept 14:28you are looking for. This method has been found to be pretty effective; 14:32in some cases, even better than prompt engineering and supervised fine tuning. 14:36Another completely different technique uses Sparse Autoencoders. These are auto encoder 14:42models trained to reconstruct the LLM activations through a sparse intermediate layer. The key 14:49insight is that each dimension of this layer tends to correspond to an interpretable concept. 14:56The method is unsupervised, you don't tell it which concepts to find. Instead, it produces a 15:00large library of vectors that statistically seem to correspond to well-defined concepts. 15:00I’m skipping the details but what’s nice about this method is that it gives you a 15:05large library of vectors to choose from, and a lot of people have been sharing their sparse 15:10autoencoders on the Hugging Face Hub, so you should have a look for your model of choice. 15:16The drawback is that these vectors don't generally come with predefined concept 15:20labels. So if you are looking for one specific concept, it might be particularly tedious to use. 15:27Fortunately, the great website Neuronpedia created by Decode Research is the perfect 15:33place for that. You can browse through visualizations that will help you 15:38identify the proper features that suit your purpose. 15:42In my case, I searched for Eiffel Tower features in the Llama 3.1 8B model, 15:47and I found for instance this one that I used for the demo. 15:52One important aspect of steering vectors, whether they come from contrastive prompts, sparse auto 15:56encoders, or other techniques, is that they always are located at a given layer of the model. So in 16:03general you might have the choice between steering early layers, late layers or middle layers. 16:08As we discussed earlier, in general, if you want the model to be influenced 16:12by a concept without necessarily reproducing the exact same words, 16:17it is better to steer concept vectors that are located in middle layers, 16:23where the abstract reasoning is supposed to happen. But you might have to experiment yourself. 16:29Ok that’s pretty much it for today, I hope I convinced you that steering LLMs 16:33can be an interesting method to elicit certain behaviours. 16:36It does not require any fine-tuning, it just works at inference time. 16:40And it has many benefits like being able to control the intensity of the intervention, 16:45and maintain it over the whole text generation, 16:48which is sometimes harder to achieve with prompt engineering. 16:51Of course, it also has drawbacks. As I mentioned earlier, it might not be easy to find a sweet spot 16:58where the model is properly steered but still maintains its fluency. Also steering works 17:03best for concepts the model has already learned to represent. It won't teach the model new knowledge. 17:09If you want to know more about the technical details, I encourage you to 17:13go read the blog post where I explained how I constructed the Eiffel Tower Llama model, 17:18and what kind of methods you might use if you want to do a similar thing. It contains a lot 17:23of useful tips to investigate the proper way to do steering, in particular with Sparse Autoencoders. 17:29Don’t forget to visit Neuronpedia and the Hugging Face Hub for finding steering vectors, 17:34and maybe sharing your own recipes. Let us know in the comments what 17:40you were able to achieve with this technique, and have fun steering !