Beyond Tokens: Conceptual Language Models
Key Points
- Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences.
- Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts.
- Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces.
- Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context.
- Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships.
Sections
- Untitled Section
- Transformer Encoder‑Decoder Overview - The passage explains how a transformer’s encoder processes tokenized inputs through multi‑head attention, feed‑forward layers, and normalization to create encoded representations, which are then fed into a decoder that similarly applies attention mechanisms to generate outputs.
- Diffusion-Based Concept Prediction Pipeline - The speaker explains how sentences are encoded into a SONAR space, processed by a diffusion‑based latent concept model (LCM) that denoises and decodes abstract concept embeddings via a two‑tower encoder‑decoder architecture, enabling hierarchical, abstract reasoning instead of token‑level processing.
Full Transcript
# Beyond Tokens: Conceptual Language Models **Source:** [https://www.youtube.com/watch?v=Le86PMGK2Uk](https://www.youtube.com/watch?v=Le86PMGK2Uk) **Duration:** 00:09:11 ## Summary - Modern large language models (LLMs) predict the next token, but the field is advancing toward “language concept models” (LCMs) that predict whole concepts and reason across sentences. - Both LLMs and LCMs rely on embedding text into high‑dimensional vector spaces, where similarity (e.g., cosine similarity) captures relationships between sentences or concepts. - Early embeddings were frequency‑based, counting word occurrences, but they lacked the depth of today’s prediction‑based embeddings that project words into richer semantic spaces. - Key milestones in prediction‑based embeddings include word‑to‑vector (2013), followed by GloVe, ELMo, BERT, ALBERT, and newer models like SONAR, which encode both semantic meaning and context. - Word embeddings transform raw text into vectors that LLMs can process, enabling the models to understand vocabulary, context, and deeper conceptual relationships. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=0s) **Untitled Section** - - [00:03:17](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=197s) **Transformer Encoder‑Decoder Overview** - The passage explains how a transformer’s encoder processes tokenized inputs through multi‑head attention, feed‑forward layers, and normalization to create encoded representations, which are then fed into a decoder that similarly applies attention mechanisms to generate outputs. - [00:06:22](https://www.youtube.com/watch?v=Le86PMGK2Uk&t=382s) **Diffusion-Based Concept Prediction Pipeline** - The speaker explains how sentences are encoded into a SONAR space, processed by a diffusion‑based latent concept model (LCM) that denoises and decodes abstract concept embeddings via a two‑tower encoder‑decoder architecture, enabling hierarchical, abstract reasoning instead of token‑level processing. ## Full Transcript
You might be wondering, what's next in generative AI?
Well, today's large language models,
what they do is they predict a token given a sequence of other tokens.
However, we're quickly moving up the thought abstraction tree where
these language concept models are now emerging.
And what that is, is it's able to reason within the sentence space.
So if I were to talk about a topic, then in turn,
every single time the LCM were to predict what I was gonna say, it might be a bit different.
But now what we're doing is we're predicting the
probability of a concept giving a series of sentences that comes.
But in general, both algorithms seek to predict
the next sequence or token given a series embeddings.
Now this gives rise to the notion that LLMs and now LCMs are about data representation.
Word embeddings are a way to represent words or sentences within these large vector spaces.
So for example, say if I'm drawing a three dimensional grid here.
And I take the sentence, it's called artificial intelligence is a combination of art and science.
Now, perhaps it would land somewhere here.
So now I have my X, my Y, and my Z coordinates.
And this very effectively represents that sentence.
And what I can do now is I can figure out what the relationship is between
that sentence and maybe another one that could land about here.
And I can take what's called a cosine similarity measure between this vector and that vector,
to tell me, in concept, where are they?
And there are many other ways of which I can take that measurement.
But now let's take a step back.
Arguably, one of the first types of embeddings that we
have is called a frequency-based embedding.
And what this does is this looks at the most prominent terms within a sentence,
and it can represent how many times did that word actually show up.
And that was one of the first ways to help us
identify the terms that are the most significant within a certain document.
However, this really lacked the depth of a
prediction-based embeddings that we now have today.
And so the prediction- based embeddings are very much different.
And what they do is we use models to actually project words into a
higher dimensional space like we see here.
And this refers to our function embeddings,
but this is the type of embedding that's really dominant today.
One of the most significant breakthroughs that happened was in 2013,
with the introduction of the word-to-vector model.
And then shortly thereafter, we saw the introduction of GloVe,
Elmo, BERT, Albert, and there's even a new one called SONAR.
But these techniques enable us to not only represent the words,
but also to capture what's called the semantic and the contextual
value of where they land in this space or what their meaning is.
Now you might be wondering, what do embeddings have to do with LLMs?
Well word embeddings convert text or sentences into these tokens,
such that an LLM can really understand the vocabulary.
And what happens is these embeddings are here, so your sentences and words are inputs into that.
Now, as it feeds forward into a stack,
so I might have, let's say, five of these different types of encoders,
and the data goes up, then it keeps going up into even more of these encodars as it happens.
But you'll notice right in the middle, right, we have this multi-headed attention.
Which tells us which tokens are most interesting rather than others.
Then we go to a fully connected neural network.
And when this happens,
this helps to find all the different relationships between all of the tokens
that were input and already that were biased towards what should we pay attention to.
Finally, we normalize that text and then its output.
So now we have this encoded representation of the data that's already been tokenized and embeddings.
And the next part of this is we then go over into what's called a decoder.
And this is what this stack is.
And you might be wondering, how are these two linked together?
Well, the output of this, it actually goes into one of these attention heads.
And whenever we do that, this influences what the decoder should pay attention to.
But first of all, let's just talk a bit about how this works, right?
We also said that we have output from the encoder.
Well, that output can also become input from a previous
stack of these encoders into the decoder here.
And as this goes up, it follows much of the same process that happens over here,
where we then go up and we might tokenize it a bit more,
we might do some more embeddings to get into another space.
And then we go into a multi-head attention, which tells us again,
based on how we've trained the attention heads.
What should we pay attention to?
And then we go to a secondary tension, which as I mentioned, this is influenced by the encoder.
So the encoders has another representation that says, based on the decoder,
how should we pay more attention to or less attention to which words in a sentence?
Then we go up again, right?
And then, we do this neural network transformation to cross over and find
the different relationships between the hidden neurons and the stack.
And then have an output.
And the output...
Will eventually be decoded back into some other text so that we then in turn,as humans,
can better understand the output representation so
that maybe even another agent can link up to it and then use that.
But again, these stacks are stacked together.
We have five of these that I just drew up,
but then you might have 10 of those that are stacked up together.
And if there's a two to one kind of relationship, then you may interlace these inputs.
Right over into this 2X stack of encoding.
So that's a lot of where the art comes in,
is how do we put these two encoders and decoders together.
But in essence, this really gives us rise to the modern
day LLM that most of us use on a daily basis.
So now what if these LLMs could reason at the concept level instead of at the token level?
Well, this is where LCMs enter the field.
So in LCM, it's trained to predict the next concept or even sentence.
But now concept represents a high-level idea.
It's agnostic to language, and it's aggnostic to modality,
but this allows this hierarchical reasoning within the concept space.
So here is the fundamental idea of how it works.
So first, there are sentences at the very bottom.
So I have sentence one all the way to sentence five.
And these represent different types of concepts.
But here, this input, it goes up into an encoder,
which in this case, we're going to use something called SONAR.
Right, and then as we encode it into the SONAR space,
the sequence of concepts are then passed to the LCM to predict the next concept or sentence.
And then the next piece of this is that the sonor embedding then is
decoded into what the actual sentence structure is
so that even a human or even another agent can better understand it.
And so there's even what's called a diffusion-based LCM.
So just like image generation.
The model, it slowly removes noise from
the candidate concept until it's finally revealed at the very end.
So what this means is that the last input sentence, which in this case would be S5,
it's input into an LCM and it's denoised over time.
And the embedding is slowly revealed right at the end.
And this method, it also uses what's called a two tower method.
Where it splits the denoiser and decoder.
So we have two parts, much like what the encoder, decoder, or transformer architecture looks like.
Now these LCMs, what it gets us, right,
is it first allows us to reason in a much different way than before.
So we can reason at the abstract level rather than the token level.
It also allows us to go to this hierarchical kind of structure, right?
That's very important so that these models can better understand.
What's happening much like how humans do.
Then we can also have much longer content that goes in.
So we can call this even like a content, we can even call it a context window.
But this is a higher level of reasoning that these sentences now have.
Now, the other part that it also enables us to do, is what's called a zero-shot generation.
Right, and this helps us to get specificity
without the need for all these lower level type tokens.
And then also what I wanted to note too is that it's modality agnostic.
So what I can do is I can pass in different types of sentences or
different types of sound, images into this encoder
and SONAR can handle it with relative ease and then the LCM can then in turn reason over it.
So this story is really about what's called data representation,
and it shows that our systems and methods are
abstracting reasoning into this higher order of thought levels.
This will make AI much more generalizable, but also useful for everyday tasks.