Learning Library

← Back to Library

Vision Language Models Enable Image Understanding

Key Points

  • Standard large language models can only ingest text, leaving visual information in PDFs, images, or handwritten notes inaccessible.
  • Vision‑language models (VLMs) are multimodal, accepting both text and image inputs and outputting text‑based responses.
  • VLMs enable tasks such as visual question answering, image captioning, and extracting/ summarizing information from scanned documents, receipts, and data‑heavy visuals like graphs.
  • They combine a conventional LLM that tokenizes and processes textual prompts with an image encoder that converts visual data into token‑like representations, allowing the two modalities to be jointly interpreted.
  • By merging these representations, VLMs can reason over combined text‑image contexts, e.g., describing a scene, interpreting chart trends, or summarizing visual documents.

Full Transcript

# Vision Language Models Enable Image Understanding **Source:** [https://www.youtube.com/watch?v=lOD_EE96jhM](https://www.youtube.com/watch?v=lOD_EE96jhM) **Duration:** 00:09:35 ## Summary - Standard large language models can only ingest text, leaving visual information in PDFs, images, or handwritten notes inaccessible. - Vision‑language models (VLMs) are multimodal, accepting both text and image inputs and outputting text‑based responses. - VLMs enable tasks such as visual question answering, image captioning, and extracting/ summarizing information from scanned documents, receipts, and data‑heavy visuals like graphs. - They combine a conventional LLM that tokenizes and processes textual prompts with an image encoder that converts visual data into token‑like representations, allowing the two modalities to be jointly interpreted. - By merging these representations, VLMs can reason over combined text‑image contexts, e.g., describing a scene, interpreting chart trends, or summarizing visual documents. ## Sections - [00:00:00](https://www.youtube.com/watch?v=lOD_EE96jhM&t=0s) **Vision-Language Models Enable Image Understanding** - The passage explains that traditional LLMs cannot process visual content, introduces multimodal vision‑language models that convert images into a form the model can interpret, and highlights tasks like visual question answering. - [00:03:06](https://www.youtube.com/watch?v=lOD_EE96jhM&t=186s) **Merging Vision and Language Models** - The passage explains that large language models process text via tokenization while vision encoders translate images into high‑dimensional numerical embeddings, which are then combined so the multimodal system can understand both text and visual inputs. - [00:06:20](https://www.youtube.com/watch?v=lOD_EE96jhM&t=380s) **Integrating Image Tokens into LLMs** - The passage explains how visual data is converted into tokens that share a latent space with text, enabling a large language model to jointly attend to both modalities, while also highlighting challenges such as tokenization bottlenecks, high memory costs, and hallucination risks. - [00:09:27](https://www.youtube.com/watch?v=lOD_EE96jhM&t=567s) **Visual Reasoning in AI** - The speaker notes that AI systems can now perceive, interpret, and reason about visual information in ways that more closely resemble human visual cognition. ## Full Transcript
0:00Large language models have a problem. 0:03We know that they can process a text document, like a PDF maybe, and then they can respond to queries about it. 0:12And they do this by encoding both the document and any prompt we provide 0:16as tokens, and then putting that into the LLM, where it's processed 0:22through attention mechanisms, and then generates a text-based response at the other side. 0:30What happens if that document, that maybe that PDF, what if it contains some images? 0:37Well, standard LLMs can't read images the way that they process text. 0:43So a picture or a graph or a handwritten note that might contain some valuable information, 0:48but without a way to convert that visual data into a form the LLM understands, it's gonna be inaccessible, 0:55but this is where vision language models come in, or VLMs. 1:04So vision language models are multimodal. 1:09That means that they can take in text but they can also take in 1:14image files as well and interpret their meaning and then generate as a response a text based output. 1:24So what sort of tasks can we perform with a VLM. 1:30Well one of those tasks is called the VQA, 1:36or visual question answering. 1:40That's just kind of a fancy way to say that you can show a VLM a picture and have it analyze it. 1:45Maybe we'll show it a photo of a busy city street and we can ask, hey, what's happening here? 1:51Now the model doesn't just see pixels, it recognizes objects and people and context 1:56and it could tell you, for example, that there's a car waiting at a red light. 2:01Then there's also the ability to provide captioning for images as well. 2:07So here we have the model generate a natural language description of an image. 2:12So if we showed a picture of a dog chasing a ball, it might say that's a golden retriever playing fetch in a park, 2:20but VLMs aren't just about photographs. 2:23They're also super useful for document understanding as well. 2:30So let's say you've uploaded a scanned receipt. 2:34The model can extract the text in that receipt. 2:37It can organize it and then it can even summarize what it says, 2:41and then what about data heavy visuals that we might have in a PDF? 2:46Well, there we can use something called graph analysis to understand that as well. 2:52So we could hand it a sales report and then we could ask, hey, what's the trend going on here? 2:58And the model can extract the data in the graphs and interpret them. 3:02So, vision language models, they don't just process images and text. 3:06Separately, they merge them. 3:09But how do they actually do that? 3:12Well, let's break it down. 3:14We'll start with the part that's already familiar. 3:17That is the large language model, the LLM. 3:24Now the LLM, it takes in as a prompt, a text prompt. 3:31So this is us sending in a message to the large language model. 3:36And it converts the words in that text prompt into text tokens. 3:45So we go from there to there. 3:47Now these are discrete numerical representations of language and these text tokens are then 3:55processed through the model's internal mechanisms, so primarily its attention layers, 3:59to determine the relationships, the contextual meaning, and the patterns between words. 4:05And then the result of all of this is then output, which is a text output, 4:12whether that's answering a question or summarizing a document or completing a sentence, 4:18but vision language models introduce something new and that new thing that they introduce is an image input. 4:29So, now we have to process this. 4:32A photo, a graph, or anything else we want the model to understand, but there is a challenge here. 4:39Large language models don't work with raw images, they only work with text tokens. 4:45So, before the LLM can process an image, it first needs to be 4:49converted into a format it can understand and that's where a vision encoder comes into play. 5:00So unlike an LLM which tokenizes words, the vision encoder processes images as high-dimensional numerical data. 5:10Now, it doesn't see the images the way that we do, 5:14but instead it extracts things from So it extract patterns and edges and textures and spatial relationships, 5:21and it's converting them into something called a feature vector, or in fact a bunch of feature vectors. 5:31Now that's a structured representation of the image's contents, 5:36and this feature vector is essentially a dense embedding capturing the most relevant information from the image 5:42while discarding all of the unnecessary details, kind of similar to how an LLM converts text into word embeddings. 5:49So our images are now vectors, but these vectors can't be fed into a large language model directly either. 5:57That's why we need an additional stage which is a projector. 6:04Now this component maps the continuous image embeddings into a token based format. 6:11So this gives us image tokens and that aligns with the text representation used by the LLM. 6:20So at this stage, we now have image tokens and we have text tokens. 6:27Both existing in the same latent space and these are then fed into the large 6:31language model which processes them together using its attention mechanisms, 6:36analyzing how different tokens relate to one another regardless of whether they originate from 6:41text or if they originates from an image, 6:45which gives us a text-based response generated by the model 6:50whether that's a caption or an explanation of what's in an image 6:54or an answer to a question that requires interpreting both the visual and the textual content. 7:01In essence, a vision language model has extended an LLM 7:05by introducing a multi-modal tokenization pipeline, 7:09one that allows images to be represented in a way that text-based transformers can process natively. 7:17So all this sounds great, but VLMs are not without their challenges. 7:22So, for example, consider... 7:25Tokenization bottlenecks. 7:28Text tokenization is efficient because models break down words into sub-tokens but images, 7:33well, they lack a natural token structure. 7:36They must first be encoded and an encoded image 7:40often requires quite a lot of tokens which increases the memory usage and it can slow down inference. 7:48Now some models do incorporate optimization strategies like a perceiver resampler, 7:53but the fact remains that processing images remains more computationally intensive than processing text alone. 8:02Now, similar to traditional LLMs, vision language models, 8:06they can produce hallucinations, generating responses that sound plausible but are factually incorrect, 8:15and this happens because VLMs don't really see the images as humans do, they are 8:19learning statistical associations about them. 8:23That can lead to incorrect assumptions about objects in an image. 8:27For instance, a VLM trained predominantly on internet-scale datasets may misinterpret medical images 8:33if it hasn't been exposed to sufficient labeled medical data. 8:37And this issue can also extend to the graphs and charts as well because 8:41even models find tuned on specific datasets may still struggle with accuracy when they're interpreting complex visual data. 8:49And then, another issue common to lots of AI is bias in training data. 8:57So VLMs are often trained on massive data sets scraped from the web, 9:01meaning they inherit the biases present in those data sets. 9:05So for example, models trained on Western-centric data sets may misinterpret cultural artifacts from non-Western contexts. 9:13So addressing these biases means basically being careful about data set curation. 9:19So that's vision language models. 9:24With them, LLMs do more than read. 9:27They can see, they can interpret, and they can reason about the world in ways a bit more like we do visually.