Learning Library

← Back to Library

Understanding LLM Context Windows and Tokens

11m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

A context window acts as an LLM’s working memory, limiting how much of a conversation it can retain and reference when generating responses.
When a dialogue exceeds the window’s size, earlier prompts are dropped, forcing the model to guess missing context and potentially produce hallucinations.
Context windows are measured in tokens—not whimsical units like “IBUs”—and a token can be a character, part of a word, a whole word, or even a short phrase.
Understanding tokenization and the finite context length is crucial for effectively using LLMs and avoiding loss of important conversational detail.

Sections

Full Transcript

# Understanding LLM Context Windows and Tokens **Source:** [https://www.youtube.com/watch?v=-QVoIxEpFkM](https://www.youtube.com/watch?v=-QVoIxEpFkM) **Duration:** 00:11:18 ## Summary - A context window acts as an LLM’s working memory, limiting how much of a conversation it can retain and reference when generating responses. - When a dialogue exceeds the window’s size, earlier prompts are dropped, forcing the model to guess missing context and potentially produce hallucinations. - Context windows are measured in tokens—not whimsical units like “IBUs”—and a token can be a character, part of a word, a whole word, or even a short phrase. - Understanding tokenization and the finite context length is crucial for effectively using LLMs and avoiding loss of important conversational detail. ## Sections - [00:00:00](https://www.youtube.com/watch?v=-QVoIxEpFkM&t=0s) **LLM Context Window Explained** - The speaker describes a context window as an LLM’s working memory that caps how much earlier conversation can be remembered, illustrating with a short exchange that fits inside the window and a longer one that exceeds it, causing earlier parts to be dropped. - [00:03:05](https://www.youtube.com/watch?v=-QVoIxEpFkM&t=185s) **Tokens, Tokenization, and Context** - The speaker explains what tokens are, how tokenizers break text into characters, sub‑words, or whole words using examples, and introduces the challenges of handling long context windows in AI models. - [00:06:08](https://www.youtube.com/watch?v=-QVoIxEpFkM&t=368s) **Self-Attention and Context Windows** - The speaker explains how transformer self‑attention computes token relevance, how expanding context windows (from ~2 K to 128 K tokens) allow models to include not only user inputs and model outputs but also hidden system prompts and other data within a single attention span. - [00:09:11](https://www.youtube.com/watch?v=-QVoIxEpFkM&t=551s) **Quadratic Cost of Long Contexts** - The speaker explains that processing time grows quadratically with token length, leading to quadrupled compute when inputs double, diminished model performance for mid‑sequence information, and heightened safety risks such as adversarial attacks and jailbreaking. ## Full Transcript

0:00In the context of large language models. 0:02What is a context window? 0:05Well, it's the equivalent of its working memory. 0:09It determines how long of a conversation the LLM can carry out without forgetting details from earlier in the exchange. 0:17And allow me to illustrate this using the scientifically recognized IBU scale that's international blah units. 0:25So blah here, that represents me sending a prompt to an LLM chatbot. Now the chatbot that returns with a response blah. 0:41Right. 0:42And then we continue the conversation. 0:43So I say something else and then it responds back to me. 0:52Blah, blah, blah, blah. 0:55International blah units. 0:58Now, this box here 1:01represents the context window, and in this case, the entire conversation fits within it. 1:10Now, that means that when the LLM generated this response here, this blah, 1:15it had within its working memory my prompts to the model here and here. 1:22And it also had the other response that the model had returned to me in order to build this response. 1:29All good. 1:30Now let's consider a longer conversation. 1:33So more blahs. 1:35I send my prompt blah. 1:39It then sends me a response. 1:43And now we go back and forth with more conversations. 1:47I say something. 1:49It responds to that. 1:50I say one more thing and it responds to that. 1:56So now we have a longer conversation here to deal with. 2:01And it turns out that this conversation thread is actually longer than the context window of the model. 2:09Now, that means that the blahs from earlier in the conversation are no longer available to the model. 2:18It has no memory of them when generating new responses. 2:23Now the LLM can do its best to infer what came earlier by looking at the conversation that is within its context window. 2:32But now the LLM is making educated guesses and that can result in some wicked hallucinations. 2:39So understanding how the context window works is essential to getting the most out about a LLMs. 2:45Let's get into a bit more detail about that now. 2:49Now my producer is telling me that context window size is in fact not measured in IBUs and that I made that up. 2:57We actually measure context windows in something called tokens. 3:02So let's describe tokenization. 3:05Let's get into context, length, size, and we're going to talk about the challenges of long context windows. 3:12So the start, what is a token? 3:16Well, for us humans, the smallest unit of information that we use to represent language is a single. 3:24Character. 3:26So something like a letter or a number or a punctuation mark, something like that. 3:34But the smallest unit of language that AI models use is called a token. 3:41Now, a token can represent a character as well. 3:46But it might also be a part of a word or a whole word or even a short multi-word phrase. 3:53So, for example, let's consider the different roles played by the letter A. 3:59So I'm going to write some sentences and we're going to tokenize them. 4:02Let's start with Martin 4:06drove a car. 4:12Now A here is an entire word and it will be represented by a distinct token. 4:23Now, what if we try a different sentence? 4:24So, Martin 4:27is 4:30amoral. 4:31Not sure why we would say that, 4:33but look, in this case, A is not a word, but it's an addition to moral that significantly changes the meaning of that word. 4:42So here a moral would be represented by two distinct tokens, a token for A and another token for moral. 4:54All right. one more. 4:56Martin 4:58loves 5:00his cat. 5:03Now the A in cat is simply a letter. 5:07In a word, it carries no semantic meaning by itself and would therefore not be a distinct token. 5:14The token here 5:16It's just cat. 5:18Now, the tool, the converts language, to tokens. 5:21It's got a name. 5:23It's called a tokenizer. 5:28And different tokenizer, as might tokenize the same passage of writing differently. 5:32But kind of a good rule of thumb is that a a regular word in English language is represented by something like 1.5 5:45tokens by the tokenizer. 5:48So hundred words that might result in 150 tokens. 5:53So context windows consist of tokens, but how many tokens are we actually talking about? 6:00To answer that, we need to understand how LLM process tokens 6:05in a context window. 6:08Now, transformer models use something called the self 6:14attention 6:17mechanism. 6:19And the self attention mechanism is used to calculate the relationships 6:23and the dependencies between different parts of an input like words at the beginning and at the end of a paragraph. 6:30Now self attention mechanism computes vectors of weights in which each weight 6:35represents how relevant that token is to the other tokens in the sequence. 6:39So the size of the context window determines the maximum number of tokens 6:44that the model can pay attention to at any one time. 6:50Now, context window size has been rapidly increasing. 6:54So the first LLMs that I used, they had context windows of around 2000 tokens. 7:01The IBM Granite three model today has a context window of 7:04128,000 tokens, and other models have larger context when they still. 7:12And but it almost seems like overkill, doesn't it? 7:15I would have to be conversing with a chat bot all day to fill a 128K token window. 7:23Well, actually, it's not necessarily true because there can be a lot of things taking up space within a model's context window. 7:33So let's take a look at what some of those things could be. 7:37Well, one of them is the the user input, the the blah that I sent into the model. 7:45And of course, we also have the model responses as well, 7:50the blahs that it was sending back, 7:54but a context window may also contain all sorts of other things as well. 7:59So most models provide what is called a system prompt. 8:06Into the context window. 8:08Now, this is often hidden from the user. 8:11But it conditions the behavior of the model, telling it what it can and cannot do. 8:17A user may also choose to attach some documents into 8:22their contacts window, or they might put in some source code as well. 8:28And that can be used by the LLM to refer to it and its responses. 8:32And then supplementary information drawn from external data sources 8:36for retrieval augmented generation or RAG, 8:41that might be stored within the context window during inference. 8:45So a few long documents, some snippets of source code, I can quickly fill up a context window. 8:53So the bigger the context window, the better, right? 8:58Well, larger context windows do present some challenges as well. 9:02What sort of challenges? 9:05Well, I think the most obvious one that would have to be 9:10compute. 9:11The compute requirements scale quadratically with the length of a sequence. 9:18What does that mean? 9:18Well, essentially, as the number of input tokens doubles, 9:24that results in the model needing four times as much processing power to handle it. 9:32Now, remember, as the model predicts, the next token in a sequence. 9:36It computes the relationships between the token and every single preceding token in that sequence. 9:43So as context length increases, more and more computation is going to be required. 9:49Now, long context windows also can negatively affect performance, specifically the performance of the model. 9:59So like people and LLMs can be overwhelmed by an abundance of extra detail. 10:06They can also get lazy and take all sorts of cognitive shortcuts. 10:10A 2023 paper found that models perform best when relevant information is towards the 10:16beginning or towards the end of the input context. And they found that performance 10:22degrades when the model must carefully consider the information that is in the middle of long context. 10:27And then finally, we also have to be concerned with a number of safety challenges as well. 10:34Longer context window might have the unintended effect of presenting a longer attack surface for adversarial prompts, 10:41a long context length can increase a model's vulnerability to jailbreaking, 10:46where malicious content is embedded deep within the input, making it harder for the model safety mechanisms 10:52to detect and filter out harmful instructions. 10:55So no matter how you measure it with either with IBUs or more accurately, tokens, 11:02selecting the appropriate number of tokens for a context window involves balancing the need 11:08to supply ample information for the model's self attention mechanism. 11:13With the increasing demands and performance issues those additional tokens may bring.