Vision Language Models Enable Image Understanding
Key Points
- Standard large language models can only ingest text, leaving visual information in PDFs, images, or handwritten notes inaccessible.
- Vision‑language models (VLMs) are multimodal, accepting both text and image inputs and outputting text‑based responses.
- VLMs enable tasks such as visual question answering, image captioning, and extracting/ summarizing information from scanned documents, receipts, and data‑heavy visuals like graphs.
- They combine a conventional LLM that tokenizes and processes textual prompts with an image encoder that converts visual data into token‑like representations, allowing the two modalities to be jointly interpreted.
- By merging these representations, VLMs can reason over combined text‑image contexts, e.g., describing a scene, interpreting chart trends, or summarizing visual documents.
Sections
- Vision-Language Models Enable Image Understanding - The passage explains that traditional LLMs cannot process visual content, introduces multimodal vision‑language models that convert images into a form the model can interpret, and highlights tasks like visual question answering.
- Merging Vision and Language Models - The passage explains that large language models process text via tokenization while vision encoders translate images into high‑dimensional numerical embeddings, which are then combined so the multimodal system can understand both text and visual inputs.
- Integrating Image Tokens into LLMs - The passage explains how visual data is converted into tokens that share a latent space with text, enabling a large language model to jointly attend to both modalities, while also highlighting challenges such as tokenization bottlenecks, high memory costs, and hallucination risks.
- Visual Reasoning in AI - The speaker notes that AI systems can now perceive, interpret, and reason about visual information in ways that more closely resemble human visual cognition.
Full Transcript
# Vision Language Models Enable Image Understanding **Source:** [https://www.youtube.com/watch?v=lOD_EE96jhM](https://www.youtube.com/watch?v=lOD_EE96jhM) **Duration:** 00:09:35 ## Summary - Standard large language models can only ingest text, leaving visual information in PDFs, images, or handwritten notes inaccessible. - Vision‑language models (VLMs) are multimodal, accepting both text and image inputs and outputting text‑based responses. - VLMs enable tasks such as visual question answering, image captioning, and extracting/ summarizing information from scanned documents, receipts, and data‑heavy visuals like graphs. - They combine a conventional LLM that tokenizes and processes textual prompts with an image encoder that converts visual data into token‑like representations, allowing the two modalities to be jointly interpreted. - By merging these representations, VLMs can reason over combined text‑image contexts, e.g., describing a scene, interpreting chart trends, or summarizing visual documents. ## Sections - [00:00:00](https://www.youtube.com/watch?v=lOD_EE96jhM&t=0s) **Vision-Language Models Enable Image Understanding** - The passage explains that traditional LLMs cannot process visual content, introduces multimodal vision‑language models that convert images into a form the model can interpret, and highlights tasks like visual question answering. - [00:03:06](https://www.youtube.com/watch?v=lOD_EE96jhM&t=186s) **Merging Vision and Language Models** - The passage explains that large language models process text via tokenization while vision encoders translate images into high‑dimensional numerical embeddings, which are then combined so the multimodal system can understand both text and visual inputs. - [00:06:20](https://www.youtube.com/watch?v=lOD_EE96jhM&t=380s) **Integrating Image Tokens into LLMs** - The passage explains how visual data is converted into tokens that share a latent space with text, enabling a large language model to jointly attend to both modalities, while also highlighting challenges such as tokenization bottlenecks, high memory costs, and hallucination risks. - [00:09:27](https://www.youtube.com/watch?v=lOD_EE96jhM&t=567s) **Visual Reasoning in AI** - The speaker notes that AI systems can now perceive, interpret, and reason about visual information in ways that more closely resemble human visual cognition. ## Full Transcript
Large language models have a problem.
We know that they can process a text document, like a PDF maybe, and then they can respond to queries about it.
And they do this by encoding both the document and any prompt we provide
as tokens, and then putting that into the LLM, where it's processed
through attention mechanisms, and then generates a text-based response at the other side.
What happens if that document, that maybe that PDF, what if it contains some images?
Well, standard LLMs can't read images the way that they process text.
So a picture or a graph or a handwritten note that might contain some valuable information,
but without a way to convert that visual data into a form the LLM understands, it's gonna be inaccessible,
but this is where vision language models come in, or VLMs.
So vision language models are multimodal.
That means that they can take in text but they can also take in
image files as well and interpret their meaning and then generate as a response a text based output.
So what sort of tasks can we perform with a VLM.
Well one of those tasks is called the VQA,
or visual question answering.
That's just kind of a fancy way to say that you can show a VLM a picture and have it analyze it.
Maybe we'll show it a photo of a busy city street and we can ask, hey, what's happening here?
Now the model doesn't just see pixels, it recognizes objects and people and context
and it could tell you, for example, that there's a car waiting at a red light.
Then there's also the ability to provide captioning for images as well.
So here we have the model generate a natural language description of an image.
So if we showed a picture of a dog chasing a ball, it might say that's a golden retriever playing fetch in a park,
but VLMs aren't just about photographs.
They're also super useful for document understanding as well.
So let's say you've uploaded a scanned receipt.
The model can extract the text in that receipt.
It can organize it and then it can even summarize what it says,
and then what about data heavy visuals that we might have in a PDF?
Well, there we can use something called graph analysis to understand that as well.
So we could hand it a sales report and then we could ask, hey, what's the trend going on here?
And the model can extract the data in the graphs and interpret them.
So, vision language models, they don't just process images and text.
Separately, they merge them.
But how do they actually do that?
Well, let's break it down.
We'll start with the part that's already familiar.
That is the large language model, the LLM.
Now the LLM, it takes in as a prompt, a text prompt.
So this is us sending in a message to the large language model.
And it converts the words in that text prompt into text tokens.
So we go from there to there.
Now these are discrete numerical representations of language and these text tokens are then
processed through the model's internal mechanisms, so primarily its attention layers,
to determine the relationships, the contextual meaning, and the patterns between words.
And then the result of all of this is then output, which is a text output,
whether that's answering a question or summarizing a document or completing a sentence,
but vision language models introduce something new and that new thing that they introduce is an image input.
So, now we have to process this.
A photo, a graph, or anything else we want the model to understand, but there is a challenge here.
Large language models don't work with raw images, they only work with text tokens.
So, before the LLM can process an image, it first needs to be
converted into a format it can understand and that's where a vision encoder comes into play.
So unlike an LLM which tokenizes words, the vision encoder processes images as high-dimensional numerical data.
Now, it doesn't see the images the way that we do,
but instead it extracts things from So it extract patterns and edges and textures and spatial relationships,
and it's converting them into something called a feature vector, or in fact a bunch of feature vectors.
Now that's a structured representation of the image's contents,
and this feature vector is essentially a dense embedding capturing the most relevant information from the image
while discarding all of the unnecessary details, kind of similar to how an LLM converts text into word embeddings.
So our images are now vectors, but these vectors can't be fed into a large language model directly either.
That's why we need an additional stage which is a projector.
Now this component maps the continuous image embeddings into a token based format.
So this gives us image tokens and that aligns with the text representation used by the LLM.
So at this stage, we now have image tokens and we have text tokens.
Both existing in the same latent space and these are then fed into the large
language model which processes them together using its attention mechanisms,
analyzing how different tokens relate to one another regardless of whether they originate from
text or if they originates from an image,
which gives us a text-based response generated by the model
whether that's a caption or an explanation of what's in an image
or an answer to a question that requires interpreting both the visual and the textual content.
In essence, a vision language model has extended an LLM
by introducing a multi-modal tokenization pipeline,
one that allows images to be represented in a way that text-based transformers can process natively.
So all this sounds great, but VLMs are not without their challenges.
So, for example, consider...
Tokenization bottlenecks.
Text tokenization is efficient because models break down words into sub-tokens but images,
well, they lack a natural token structure.
They must first be encoded and an encoded image
often requires quite a lot of tokens which increases the memory usage and it can slow down inference.
Now some models do incorporate optimization strategies like a perceiver resampler,
but the fact remains that processing images remains more computationally intensive than processing text alone.
Now, similar to traditional LLMs, vision language models,
they can produce hallucinations, generating responses that sound plausible but are factually incorrect,
and this happens because VLMs don't really see the images as humans do, they are
learning statistical associations about them.
That can lead to incorrect assumptions about objects in an image.
For instance, a VLM trained predominantly on internet-scale datasets may misinterpret medical images
if it hasn't been exposed to sufficient labeled medical data.
And this issue can also extend to the graphs and charts as well because
even models find tuned on specific datasets may still struggle with accuracy when they're interpreting complex visual data.
And then, another issue common to lots of AI is bias in training data.
So VLMs are often trained on massive data sets scraped from the web,
meaning they inherit the biases present in those data sets.
So for example, models trained on Western-centric data sets may misinterpret cultural artifacts from non-Western contexts.
So addressing these biases means basically being careful about data set curation.
So that's vision language models.
With them, LLMs do more than read.
They can see, they can interpret, and they can reason about the world in ways a bit more like we do visually.