Learning Library

← Back to Library

Meta Unveils Non‑Generative VLJ Model

Key Points

  • Meta’s former AI chief scientist Yan Lun (Yan LeCun) published a paper on “VLJ,” a Vision‑Language model that uses a joint‑embedding predictive architecture (JEPPer) as an extension of the earlier VJA design.
  • Unlike generative models (e.g., ChatGPT, GPT‑4) that produce text token‑by‑token, VLJ is a non‑generative system that directly predicts a meaning vector in semantic space and only converts it to words when required.
  • This semantic‑space approach makes VLJ roughly twice as parameter‑efficient and faster than traditional vision‑language models while often delivering superior performance on image, video, and language tasks.
  • The authors argue that intelligence is about world understanding rather than language generation, positioning VLJ as a step toward agents that “think” first and “speak” later—an architecture with strong implications for robotics and embodied AI.
  • Yan Lun’s departure from Meta to start his own AI venture underscores his commitment to this philosophy, suggesting the VLJ paper may signal a shift away from token‑based large language models toward more grounded, meaning‑centric AI systems.

Full Transcript

# Meta Unveils Non‑Generative VLJ Model **Source:** [https://www.youtube.com/watch?v=Cis57hC3KcM](https://www.youtube.com/watch?v=Cis57hC3KcM) **Duration:** 00:10:26 ## Summary - Meta’s former AI chief scientist Yan Lun (Yan LeCun) published a paper on “VLJ,” a Vision‑Language model that uses a joint‑embedding predictive architecture (JEPPer) as an extension of the earlier VJA design. - Unlike generative models (e.g., ChatGPT, GPT‑4) that produce text token‑by‑token, VLJ is a non‑generative system that directly predicts a meaning vector in semantic space and only converts it to words when required. - This semantic‑space approach makes VLJ roughly twice as parameter‑efficient and faster than traditional vision‑language models while often delivering superior performance on image, video, and language tasks. - The authors argue that intelligence is about world understanding rather than language generation, positioning VLJ as a step toward agents that “think” first and “speak” later—an architecture with strong implications for robotics and embodied AI. - Yan Lun’s departure from Meta to start his own AI venture underscores his commitment to this philosophy, suggesting the VLJ paper may signal a shift away from token‑based large language models toward more grounded, meaning‑centric AI systems. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Cis57hC3KcM&t=0s) **Meta Unveils Non‑Generative Vision‑Language Model** - Meta’s FAIR team released the VLJ paper, describing a joint‑embedding predictive architecture that directly maps visual inputs to semantic meaning—bypassing token‑by‑token generation, cutting parameters in half, and promising faster, more efficient performance for vision‑language tasks and robotics. - [00:03:22](https://www.youtube.com/watch?v=Cis57hC3KcM&t=202s) **VLJ vs Cheap Vision Models** - The speaker explains that unlike low‑cost frame‑by‑frame models that merely label each image, VLJ processes a continuous video stream, builds temporal context, and only outputs stable, confident action descriptions. - [00:07:10](https://www.youtube.com/watch?v=Cis57hC3KcM&t=430s) **Efficient Video Understanding via VJepper** - The speaker contrasts massive visual language models with a streamlined 0.5 billion‑parameter VJepper approach, showing it outperforms token‑based video classifiers while underscoring the richness and complexity of real‑world visual data. ## Full Transcript
0:00So Meta's AI chief released a new paper. 0:02And is this the beginning of the end for 0:04LM? Let's talk about it. So most of you 0:07guys know that Meta's AI chief scientist 0:08Yan Lun reportedly left Meta or is 0:11leaving Meta to build his own AI 0:12startup. But before that, he actually 0:14made a really interesting paper that I 0:17want to talk about. So the paper that he 0:19made with a bunch of different 0:20researchers from Meta is called VLJ. So 0:24this is a vision language model built on 0:27joint embedding predictive architecture 0:28which is Jepper and this is I guess you 0:30could say an extension of the VJA 0:32architecture. So this is really cool 0:35because this is from Meta's fair lab of 0:37course you know Lean Land is the one 0:39leading this and the you know ridiculous 0:41thing about this well not ridiculous but 0:43the super super interesting that I found 0:44about this is that unlike models like 0:46Chachi that generate answers word by 0:48word VLJ does something completely 0:51different. This is a non-generative 0:53model. So this predicts meaning directly 0:57and it's not via text. So this model 0:59builds an internal understanding of what 1:01it sees, images, video, and then 1:03converts that understanding into words 1:05if needed. Now, because it learns in a 1:07semantic space instead of token space, 1:09it's faster, more efficient, and uses 1:11about half the parameters of traditional 1:13vision language models while often 1:15performing better. And this is crazy 1:18because what this means for robotics 1:20agent is super crazy. So let's get into 1:22this. So one of the things I wanted to 1:23you know really point out here to show 1:25you guys how you know different this 1:26architecture is is that it talks about 1:28the fact that this is a non-generative 1:30system. So if you know what a generative 1:32system is usually this means a 1:34generative model like chat GPT GPT4 this 1:37produces tokens or words you know one at 1:39a time you know you go from left to 1:40right and every output must be fully 1:42written to exist. So to answer what's 1:45happening in this video, a generative 1:46model is going to be like, okay, I'm 1:47going to decide the first word, then the 1:49second, then the third until it finishes 1:51the entire sentence. It literally, you 1:52know, it can't know the final answer 1:54until it finishes generating it, which 1:56is very slow and very painful. But a 1:58non-generative system means here is that 2:00it does not need to talk to think. So 2:02VJA essentially what it does is that it 2:04does not generate words by default. It 2:05doesn't predict the next token. It 2:07doesn't need sentences to exist. 2:09Instead, it predicts a meaning vector 2:10directly. So think of the differences 2:12like this. generative AI is let me 2:14explain what I think while I'm still 2:15figuring it out and non-generative AI 2:17says you know I already know and I'll 2:18only explain if you ask and compared to 2:21and remember this is the entire reason 2:23that Yanlakan cares about this so much 2:25is because he has been saying for so 2:27long that language is not intelligence 2:29his belief is that intelligence equals 2:31understanding the world and language is 2:33simply just an output format but Vla 2:36reflects that philosophy exactly so this 2:38is why this video is talking about what 2:41this might be after LLMs where you're 2:43thinking in language, reasoning in 2:45tokens, [music] and where you're 2:47thinking in the latent space, reasoning 2:49in meaning, and language is actually 2:51optional. This is the paradigm shift 2:52that this paper is talking about. And I 2:54think that maybe, just maybe, if this 2:56gains more traction, this could be post 2:57LLMs. So, essentially what you're 2:59looking at in this video is where you 3:01have a map of the internal understanding 3:05over time. So, each dot is essentially 3:07what the AI thinks is happening at that 3:09moment. So you can see the red ones, 3:11those are basically the instant guesses, 3:13but the blue is essentially the 3:14stabilized understanding. So you have to 3:16understand that what you're seeing on 3:17the left is essentially the vision 3:19model, what it would be able to see. 3:20Now, now what most people are going to 3:22ask here is how is this even different 3:23from a cheap vision model just 3:25describing exactly what the video is 3:26doing. Well, the short answer is that 3:28cheap models, they talk, but VLJ is 3:30understanding. So we need to break down 3:32exactly what that means. So the lowcost 3:34vision model, the describer is basically 3:36a cheap basic vision model that works 3:38like this. You have the frame, then you 3:39have the label, then you have the frame, 3:41then label, then frame, then label. So, 3:42it looks at each frame, it guesses what 3:44it sees, and it spits out the text 3:45immediately. So, this is, you know, what 3:47does that look like? Hand, bottle, 3:48picking up canister, and it's jumpy, 3:50inconsistent with no memory, and it's 3:52basically reacting and not 3:53understanding. But this is where we have 3:55VLJ. So, Vlja does this instead. It's 3:58got a video stream, of course, and it's 4:00got continuous meaning, and then it's 4:02the event. So this tracks the meaning 4:04over time building a stable 4:06understanding and it only labels the 4:08action once it's confident. That's why 4:10you see red dot which is an instant 4:11guess. Well, it might be wrong. It might 4:13be bottle. But then the blue dot is a 4:15stabilized meaning it's a canister. So 4:16the reason that this actually matters a 4:18lot is because the cheap model is going 4:19to say I see a bottle. I see a bottle. I 4:21see a bottle. But then VLJ is going to 4:23actually understand the action and say 4:25the action is picking up a canister. So 4:27the killer difference is of course time. 4:29Lowcost models think in single frames 4:31and they have no real sense of before 4:33and after. VLJ thinks in temporal 4:35meaning and it knows when an action 4:37starts, continues and ends. That's why 4:39it's extremely useful for robotics, 4:41wearables, agents, real world planning. 4:43And why the dot cloud matters is that 4:45you know it's showing you know meaning 4:47drifting slightly from frame to frame 4:48then locking in once enough evidence 4:50exists. And this is something that you 4:52know the tokenbased models they can't 4:53really do efficiently because number one 4:55they need to you know keep generating 4:57text and number two they can't hold 4:59silent semantic state. So you know if 5:01you think about it a cheap model is 5:03basically like a CCTV motion detector 5:05shouting guesses but a VLJ is a human 5:07watching and saying ah okay he's he's 5:09picking something up. So then of course 5:11you might want to understand the diagram 5:12of the architecture. So this is the VLJ 5:15model architecture. So if you wanted to 5:17know how this works, this is basically 5:19the architecture. But honestly, it was a 5:21little bit confusing. So I decided to 5:22just get a simpler description. So I 5:25actually used GPT image 1.5 to get this 5:28image because this is actually pretty 5:30good. And if you know this is too much, 5:32I also have this one right here. So 5:34language is optional, understanding is 5:36not. So basically, you know, the X 5:38encoder is the visual input. So it's 5:40going to be the video frames. The 5:41predictor is basically the brain. The 5:43Yen encoder is the textual query which 5:45is what you'd be asking it. And then of 5:47course you've got the encoded meanings 5:48from the word which is the Y decoder. 5:50Then of course you've got your comparing 5:52the thoughts which is a training loss 5:53which essentially means that you know 5:55it's getting better over time. And then 5:56of course you got the final output which 5:58is the correct answer which is the 6:00actual meaning. Now if we look at the 6:02tests of this is currently the best. So 6:05we're looking at the scoreboard which is 6:06where we can see the other ones the 6:08different AI models. We can see that 6:10clip sig LP and P core. They're older 6:12well-known vision models and compared to 6:14VLJ base this is and VJA SFT which is 6:18you know fine-tuning and then we can see 6:19that VJER is a really really incredible 6:22improvement and one of the things I 6:24think you know a lot of people are going 6:25to miss is that of course you're 6:27probably going to miss the fact that VLJ 6:29is super super small so you know how 6:32generative models just you know tokens 6:33on tokens and tokens but if you're 6:35thinking about something that actually 6:36reasons like a human you can see that 6:38the number of parameters and number of 6:40samples seen you can see that VL jpa is 6:421.6 billion parameters and 2 billion 6:45parameters you know in terms of the 6:46sample scene. So it's remarkably more 6:49efficient than the other things that 6:50we're you know looking at. So I think 6:52it's I think it's pretty incredible how 6:54that is. I mean if we you know continue 6:56to look over here you can see that the 6:58zero shot video captioning. So this is 7:00where it's showing with the same data 7:01and same setup VOJepper actually learns 7:04faster and it reaches higher caption 7:05quality and predicting meaning you know 7:07learns faster than predicting words. 7:09Then of course you've got chart two 7:10which is zeroot video classification and 7:12it's the same thing VLJ pulls quickly 7:15ahead and the visual language models 7:16improve very slowly. So even without 7:18fine-tuning VJ understands videos better 7:21and this kills the idea that you need 7:23token generation to understand things 7:25and it you know it's clear it's clear 7:26that you know Yandan is on to something. 7:29So once again if we look at the right 7:31size remember once I said that again. 7:32Now remember once I said that if you 7:34look at the actual size of the models 7:35you can see that once again visual 7:36language models are you know much larger 7:39and much less efficient and vjer only 7:42needs like 0.5 billion parameters in 7:44terms of their predictor and so there's 7:46no heavy decoder during training. So 7:48VJepper is going to get better with 7:49results with half the trainable 7:50parameters which is pretty insane in 7:52machine learning terms. And of course 7:54here we have Yan Lerna talking about 7:56this stuff. I mean, this was I think 7:57around two to three weeks ago. 7:59>> Four-year-old has seen as much visual 8:02data as the biggest LLM trained on the 8:05entire text ever produced. And so what 8:08that tells you is that there is way more 8:11um information in the real world, but 8:13it's also much more complicated. It's 8:16noisy. It's high dimensional. It's 8:18continuous. And basically the methods 8:20that are employed to train LLMs do not 8:23work in the real world. That explains 8:26why we have LLMs that can pass the bar 8:29exam or solve equations or compute 8:32integrals like college students and 8:34solve math problems. But we still don't 8:36have a domestic robot. They can, you 8:39know, do the chores in the house. We 8:40don't we don't even have level five 8:42self-driving cars. I mean, we have them, 8:44but but we cheat. So, um I mean, we 8:47certainly don't have self-driving cars 8:48that can learn to drive in 20 hours of 8:51practice like any teenager. And then of 8:53course I actually went on Yelican's 8:55Twitter and I saw him uh reposting this 8:58from Sonia Joseph. Now this is someone 8:59of course that works at Meta and she 9:01essentially said that we don't simulate 9:02every atom to model intelligence. We 9:04don't use quantum field theory to model 9:06road traffic. Jeepa taught me the 9:07importance of learning physics at the 9:08right level of abstraction. Thank you 9:10Landin and the Jeppa team. It was a 9:12privilege to work with you. So I'll 9:13definitely take a look at this. The 9:14thesis behind Japa is that our current 9:16models are not predicting causal 9:18dynamics. And if you both predict in 9:21latent space and predict the future, 9:23then you're more likely to abstract away 9:25all these pixel level details. For 9:27[music] example, when we model even this 9:29conversation right now, we don't have to 9:31model it down to the level of atoms. 9:33That would be so computationally costly 9:35and so efficient. We model things at the 9:37representation that's suited for our 9:39goal. So similarly, JEPA is optimi 9:43optimized to have [music] physical 9:45representations at the level of 9:47abstraction it needs. It enables it to 9:48plan in the physical world and be able 9:51to do a counterfactual reasoning about 9:52objects that are moving around behind 9:54Japa. 9:55>> Now I did see a few comments on Reddit 9:57talking about the video saying that most 9:59of the actions that it detects are wrong 10:00though. If you stop the video at any 10:02time to actually read what it says, it's 10:03really bad. And someone also says, well, 10:06the guy, the same guy or the same person 10:08says that I stopped it like five times 10:09and they were all wrong. Made up a side 10:11of pizza, made up something else. But I 10:13think the most important thing here is 10:14not that it's going to be 100% right. I 10:16think the most important thing is that 10:17it's actually moving us in the right 10:18direction of where AI models should 10:21actually be and not just getting 10:23completely distracted by chat bots.