Learning Library

← Back to Library

Beyond Tokens: Conceptual Language Models

9m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences.
Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts.
Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces.
Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context.
Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships.

Sections

Full Transcript

# Beyond Tokens: Conceptual Language Models **Source:** [https://www.youtube.com/watch?v=Le86PMGK2Uk](https://www.youtube.com/watch?v=Le86PMGK2Uk) **Duration:** 00:09:11 ## Summary - Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences. - Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts. - Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces. - Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context. - Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=197s) **Transformer Encoder‑Decoder Overview** - The passage explains how a transformer’s encoder processes tokenized inputs through multi‑head attention, feed‑forward layers, and normalization to create encoded representations, which are then fed into a decoder that similarly applies attention mechanisms to generate outputs. - [00:06:22](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=382s) **Diffusion-Based Concept Prediction Pipeline** - The speaker explains how sentences are encoded into a SONAR space, processed by a diffusion‑based latent concept model (LCM) that denoises and decodes abstract concept embeddings via a two‑tower encoder‑decoder architecture, enabling hierarchical, abstract reasoning instead of token‑level processing. ## Full Transcript

0:00You might be wondering, what's next in generative AI? 0:03Well, today's large language models, 0:06what they do is they predict a token given a sequence of other tokens. 0:10However, we're quickly moving up the thought abstraction tree where 0:13these language concept models are now emerging. 0:16And what that is, is it's able to reason within the sentence space. 0:20So if I were to talk about a topic, then in turn, 0:23every single time the LCM were to predict what I was gonna say, it might be a bit different. 0:28But now what we're doing is we're predicting the 0:30probability of a concept giving a series of sentences that comes. 0:35But in general, both algorithms seek to predict 0:37the next sequence or token given a series embeddings. 0:42Now this gives rise to the notion that LLMs and now LCMs are about data representation. 0:48Word embeddings are a way to represent words or sentences within these large vector spaces. 0:54So for example, say if I'm drawing a three dimensional grid here. 0:58And I take the sentence, it's called artificial intelligence is a combination of art and science. 1:04Now, perhaps it would land somewhere here. 1:07So now I have my X, my Y, and my Z coordinates. 1:11And this very effectively represents that sentence. 1:15And what I can do now is I can figure out what the relationship is between 1:18that sentence and maybe another one that could land about here. 1:23And I can take what's called a cosine similarity measure between this vector and that vector, 1:28to tell me, in concept, where are they? 1:31And there are many other ways of which I can take that measurement. 1:34But now let's take a step back. 1:36Arguably, one of the first types of embeddings that we 1:39have is called a frequency-based embedding. 1:43And what this does is this looks at the most prominent terms within a sentence, 1:48and it can represent how many times did that word actually show up. 1:53And that was one of the first ways to help us 1:55identify the terms that are the most significant within a certain document. 1:59However, this really lacked the depth of a 2:01prediction-based embeddings that we now have today. 2:04And so the prediction- based embeddings are very much different. 2:09And what they do is we use models to actually project words into a 2:13higher dimensional space like we see here. 2:15And this refers to our function embeddings, 2:18but this is the type of embedding that's really dominant today. 2:21One of the most significant breakthroughs that happened was in 2013, 2:25with the introduction of the word-to-vector model. 2:28And then shortly thereafter, we saw the introduction of GloVe, 2:32Elmo, BERT, Albert, and there's even a new one called SONAR. 2:36But these techniques enable us to not only represent the words, 2:39but also to capture what's called the semantic and the contextual 2:42value of where they land in this space or what their meaning is. 2:46Now you might be wondering, what do embeddings have to do with LLMs? 2:50Well word embeddings convert text or sentences into these tokens, 2:54such that an LLM can really understand the vocabulary. 2:58And what happens is these embeddings are here, so your sentences and words are inputs into that. 3:04Now, as it feeds forward into a stack, 3:06so I might have, let's say, five of these different types of encoders, 3:11and the data goes up, then it keeps going up into even more of these encodars as it happens. 3:17But you'll notice right in the middle, right, we have this multi-headed attention. 3:21Which tells us which tokens are most interesting rather than others. 3:25Then we go to a fully connected neural network. 3:28And when this happens, 3:29this helps to find all the different relationships between all of the tokens 3:33that were input and already that were biased towards what should we pay attention to. 3:39Finally, we normalize that text and then its output. 3:42So now we have this encoded representation of the data that's already been tokenized and embeddings. 3:50And the next part of this is we then go over into what's called a decoder. 3:55And this is what this stack is. 3:57And you might be wondering, how are these two linked together? 4:00Well, the output of this, it actually goes into one of these attention heads. 4:06And whenever we do that, this influences what the decoder should pay attention to. 4:12But first of all, let's just talk a bit about how this works, right? 4:16We also said that we have output from the encoder. 4:20Well, that output can also become input from a previous 4:23stack of these encoders into the decoder here. 4:28And as this goes up, it follows much of the same process that happens over here, 4:33where we then go up and we might tokenize it a bit more, 4:36we might do some more embeddings to get into another space. 4:39And then we go into a multi-head attention, which tells us again, 4:43based on how we've trained the attention heads. 4:46What should we pay attention to? 4:47And then we go to a secondary tension, which as I mentioned, this is influenced by the encoder. 4:53So the encoders has another representation that says, based on the decoder, 4:59how should we pay more attention to or less attention to which words in a sentence? 5:03Then we go up again, right? 5:05And then, we do this neural network transformation to cross over and find 5:09the different relationships between the hidden neurons and the stack. 5:13And then have an output. 5:14And the output... 5:16Will eventually be decoded back into some other text so that we then in turn,as humans, 5:21can better understand the output representation so 5:25that maybe even another agent can link up to it and then use that. 5:30But again, these stacks are stacked together. 5:32We have five of these that I just drew up, 5:35but then you might have 10 of those that are stacked up together. 5:38And if there's a two to one kind of relationship, then you may interlace these inputs. 5:44Right over into this 2X stack of encoding. 5:47So that's a lot of where the art comes in, 5:48is how do we put these two encoders and decoders together. 5:52But in essence, this really gives us rise to the modern 5:55day LLM that most of us use on a daily basis. 5:59So now what if these LLMs could reason at the concept level instead of at the token level? 6:04Well, this is where LCMs enter the field. 6:06So in LCM, it's trained to predict the next concept or even sentence. 6:11But now concept represents a high-level idea. 6:14It's agnostic to language, and it's aggnostic to modality, 6:17but this allows this hierarchical reasoning within the concept space. 6:22So here is the fundamental idea of how it works. 6:25So first, there are sentences at the very bottom. 6:28So I have sentence one all the way to sentence five. 6:31And these represent different types of concepts. 6:34But here, this input, it goes up into an encoder, 6:38which in this case, we're going to use something called SONAR. 6:41Right, and then as we encode it into the SONAR space, 6:43the sequence of concepts are then passed to the LCM to predict the next concept or sentence. 6:49And then the next piece of this is that the sonor embedding then is 6:53decoded into what the actual sentence structure is 6:57so that even a human or even another agent can better understand it. 7:01And so there's even what's called a diffusion-based LCM. 7:04So just like image generation. 7:06The model, it slowly removes noise from 7:08the candidate concept until it's finally revealed at the very end. 7:12So what this means is that the last input sentence, which in this case would be S5, 7:17it's input into an LCM and it's denoised over time. 7:23And the embedding is slowly revealed right at the end. 7:28And this method, it also uses what's called a two tower method. 7:31Where it splits the denoiser and decoder. 7:34So we have two parts, much like what the encoder, decoder, or transformer architecture looks like. 7:39Now these LCMs, what it gets us, right, 7:42is it first allows us to reason in a much different way than before. 7:47So we can reason at the abstract level rather than the token level. 7:52It also allows us to go to this hierarchical kind of structure, right? 7:57That's very important so that these models can better understand. 8:01What's happening much like how humans do. 8:03Then we can also have much longer content that goes in. 8:10So we can call this even like a content, we can even call it a context window. 8:16But this is a higher level of reasoning that these sentences now have. 8:21Now, the other part that it also enables us to do, is what's called a zero-shot generation. 8:28Right, and this helps us to get specificity 8:32without the need for all these lower level type tokens. 8:36And then also what I wanted to note too is that it's modality agnostic. 8:42So what I can do is I can pass in different types of sentences or 8:46different types of sound, images into this encoder 8:51and SONAR can handle it with relative ease and then the LCM can then in turn reason over it. 8:57So this story is really about what's called data representation, 9:00and it shows that our systems and methods are 9:02abstracting reasoning into this higher order of thought levels. 9:06This will make AI much more generalizable, but also useful for everyday tasks.