Learning Library

← Back to Library

Beyond Tokens: Conceptual Language Models

Key Points

  • Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences.
  • Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts.
  • Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces.
  • Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context.
  • Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships.

Full Transcript

# Beyond Tokens: Conceptual Language Models **Source:** [https://www.youtube.com/watch?v=Le86PMGK2Uk](https://www.youtube.com/watch?v=Le86PMGK2Uk) **Duration:** 00:09:11 ## Summary - Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences. - Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts. - Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces. - Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context. - Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=197s) **Transformer Encoder‑Decoder Overview** - The passage explains how a transformer’s encoder processes tokenized inputs through multi‑head attention, feed‑forward layers, and normalization to create encoded representations, which are then fed into a decoder that similarly applies attention mechanisms to generate outputs. - [00:06:22](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=382s) **Diffusion-Based Concept Prediction Pipeline** - The speaker explains how sentences are encoded into a SONAR space, processed by a diffusion‑based latent concept model (LCM) that denoises and decodes abstract concept embeddings via a two‑tower encoder‑decoder architecture, enabling hierarchical, abstract reasoning instead of token‑level processing. ## Full Transcript
0:00You might be wondering,  what's next in generative AI? 0:03Well, today's large language models, 0:06what they do is they predict a token  given a sequence of other tokens. 0:10However, we're quickly moving up  the thought abstraction tree where 0:13these language concept models are now emerging. 0:16And what that is, is it's able to  reason within the sentence space. 0:20So if I were to talk about a topic, then in turn, 0:23every single time the LCM were to predict what  I was gonna say, it might be a bit different. 0:28But now what we're doing is we're predicting the 0:30probability of a concept giving  a series of sentences that comes. 0:35But in general, both algorithms seek to predict 0:37the next sequence or token  given a series embeddings. 0:42Now this gives rise to the notion that LLMs  and now LCMs are about data representation. 0:48Word embeddings are a way to represent words  or sentences within these large vector spaces. 0:54So for example, say if I'm drawing  a three dimensional grid here. 0:58And I take the sentence, it's called artificial  intelligence is a combination of art and science. 1:04Now, perhaps it would land somewhere here. 1:07So now I have my X, my Y, and my Z coordinates. 1:11And this very effectively  represents that sentence. 1:15And what I can do now is I can figure  out what the relationship is between 1:18that sentence and maybe another  one that could land about here. 1:23And I can take what's called a cosine similarity  measure between this vector and that vector, 1:28to tell me, in concept, where are they? 1:31And there are many other ways of  which I can take that measurement. 1:34But now let's take a step back. 1:36Arguably, one of the first  types of embeddings that we 1:39have is called a frequency-based embedding. 1:43And what this does is this looks at the  most prominent terms within a sentence, 1:48and it can represent how many times  did that word actually show up. 1:53And that was one of the first ways to help us 1:55identify the terms that are the most  significant within a certain document. 1:59However, this really lacked the depth of a 2:01prediction-based embeddings  that we now have today. 2:04And so the prediction- based  embeddings are very much different. 2:09And what they do is we use models  to actually project words into a 2:13higher dimensional space like we see here. 2:15And this refers to our function embeddings, 2:18but this is the type of embedding  that's really dominant today. 2:21One of the most significant  breakthroughs that happened was in 2013, 2:25with the introduction of the word-to-vector model. 2:28And then shortly thereafter, we  saw the introduction of GloVe, 2:32Elmo, BERT, Albert, and there's  even a new one called SONAR. 2:36But these techniques enable us  to not only represent the words, 2:39but also to capture what's called  the semantic and the contextual 2:42value of where they land in this  space or what their meaning is. 2:46Now you might be wondering, what  do embeddings have to do with LLMs? 2:50Well word embeddings convert text  or sentences into these tokens, 2:54such that an LLM can really  understand the vocabulary. 2:58And what happens is these embeddings are here,  so your sentences and words are inputs into that. 3:04Now, as it feeds forward into a stack, 3:06so I might have, let's say, five of  these different types of encoders, 3:11and the data goes up, then it keeps going up  into even more of these encodars as it happens. 3:17But you'll notice right in the middle,  right, we have this multi-headed attention. 3:21Which tells us which tokens are  most interesting rather than others. 3:25Then we go to a fully connected neural network. 3:28And when this happens, 3:29this helps to find all the different  relationships between all of the tokens 3:33that were input and already that were biased  towards what should we pay attention to. 3:39Finally, we normalize that  text and then its output. 3:42So now we have this encoded representation of the  data that's already been tokenized and embeddings. 3:50And the next part of this is we then  go over into what's called a decoder. 3:55And this is what this stack is. 3:57And you might be wondering, how  are these two linked together? 4:00Well, the output of this, it actually  goes into one of these attention heads. 4:06And whenever we do that, this influences  what the decoder should pay attention to. 4:12But first of all, let's just talk  a bit about how this works, right? 4:16We also said that we have output from the encoder. 4:20Well, that output can also  become input from a previous 4:23stack of these encoders into the decoder here. 4:28And as this goes up, it follows much of  the same process that happens over here, 4:33where we then go up and we  might tokenize it a bit more, 4:36we might do some more embeddings  to get into another space. 4:39And then we go into a multi-head  attention, which tells us again, 4:43based on how we've trained the attention heads. 4:46What should we pay attention to? 4:47And then we go to a secondary tension, which as  I mentioned, this is influenced by the encoder. 4:53So the encoders has another representation  that says, based on the decoder, 4:59how should we pay more attention to or less  attention to which words in a sentence? 5:03Then we go up again, right? 5:05And then, we do this neural network  transformation to cross over and find 5:09the different relationships between  the hidden neurons and the stack. 5:13And then have an output. 5:14And the output... 5:16Will eventually be decoded back into some  other text so that we then in turn,as humans, 5:21can better understand the output representation so 5:25that maybe even another agent can  link up to it and then use that. 5:30But again, these stacks are stacked together. 5:32We have five of these that I just drew up, 5:35but then you might have 10 of  those that are stacked up together. 5:38And if there's a two to one kind of relationship,  then you may interlace these inputs. 5:44Right over into this 2X stack of encoding. 5:47So that's a lot of where the art comes in, 5:48is how do we put these two  encoders and decoders together. 5:52But in essence, this really  gives us rise to the modern 5:55day LLM that most of us use on a daily basis. 5:59So now what if these LLMs could reason at the  concept level instead of at the token level? 6:04Well, this is where LCMs enter the field. 6:06So in LCM, it's trained to predict  the next concept or even sentence. 6:11But now concept represents a high-level idea. 6:14It's agnostic to language, and  it's aggnostic to modality, 6:17but this allows this hierarchical  reasoning within the concept space. 6:22So here is the fundamental idea of how it works. 6:25So first, there are sentences at the very bottom. 6:28So I have sentence one all  the way to sentence five. 6:31And these represent different types of concepts. 6:34But here, this input, it goes up into an encoder, 6:38which in this case, we're going  to use something called SONAR. 6:41Right, and then as we encode  it into the SONAR space, 6:43the sequence of concepts are then passed to the  LCM to predict the next concept or sentence. 6:49And then the next piece of this is  that the sonor embedding then is 6:53decoded into what the actual sentence structure is 6:57so that even a human or even another  agent can better understand it. 7:01And so there's even what's  called a diffusion-based LCM. 7:04So just like image generation. 7:06The model, it slowly removes noise from 7:08the candidate concept until it's  finally revealed at the very end. 7:12So what this means is that the last input  sentence, which in this case would be S5, 7:17it's input into an LCM and  it's denoised over time. 7:23And the embedding is slowly  revealed right at the end. 7:28And this method, it also uses  what's called a two tower method. 7:31Where it splits the denoiser and decoder. 7:34So we have two parts, much like what the encoder,  decoder, or transformer architecture looks like. 7:39Now these LCMs, what it gets us, right, 7:42is it first allows us to reason in  a much different way than before. 7:47So we can reason at the abstract  level rather than the token level. 7:52It also allows us to go to this  hierarchical kind of structure, right? 7:57That's very important so that  these models can better understand. 8:01What's happening much like how humans do. 8:03Then we can also have much  longer content that goes in. 8:10So we can call this even like a content,  we can even call it a context window. 8:16But this is a higher level of reasoning  that these sentences now have. 8:21Now, the other part that it also enables us to  do, is what's called a zero-shot generation. 8:28Right, and this helps us to get specificity 8:32without the need for all  these lower level type tokens. 8:36And then also what I wanted to note  too is that it's modality agnostic. 8:42So what I can do is I can pass in  different types of sentences or 8:46different types of sound, images into this encoder 8:51and SONAR can handle it with relative ease and  then the LCM can then in turn reason over it. 8:57So this story is really about  what's called data representation, 9:00and it shows that our systems and methods are 9:02abstracting reasoning into this  higher order of thought levels. 9:06This will make AI much more generalizable,  but also useful for everyday tasks.