Learning Library

← Back to Library

Zero-Shot Learning: Learning Without Labels

Key Points

  • Humans can recognize objects (e.g., a pen) by matching them to known attributes, enabling us to distinguish roughly 30,000 categorical concepts without seeing every instance.
  • Traditional supervised deep‑learning models require large, labeled datasets for each category, making it costly and computationally intensive to achieve human‑level breadth across thousands of classes.
  • N‑shot learning (few‑shot, one‑shot) mitigates this by leveraging transfer and meta‑learning to generalize from very few examples, but it still depends on at least one labeled instance per new class.
  • Zero‑shot learning eliminates the need for any labeled examples by using semantic knowledge (e.g., textual descriptions or attribute lists) to infer new categories, mirroring how a child can identify a “bird” after reading a description rather than seeing a picture.

Full Transcript

# Zero-Shot Learning: Learning Without Labels **Source:** [https://www.youtube.com/watch?v=pVpr4GYLzAo](https://www.youtube.com/watch?v=pVpr4GYLzAo) **Duration:** 00:08:54 ## Summary - Humans can recognize objects (e.g., a pen) by matching them to known attributes, enabling us to distinguish roughly 30,000 categorical concepts without seeing every instance. - Traditional supervised deep‑learning models require large, labeled datasets for each category, making it costly and computationally intensive to achieve human‑level breadth across thousands of classes. - N‑shot learning (few‑shot, one‑shot) mitigates this by leveraging transfer and meta‑learning to generalize from very few examples, but it still depends on at least one labeled instance per new class. - Zero‑shot learning eliminates the need for any labeled examples by using semantic knowledge (e.g., textual descriptions or attribute lists) to infer new categories, mirroring how a child can identify a “bird” after reading a description rather than seeing a picture. ## Sections - [00:00:00](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=0s) **From Supervised to Few‑Shot Learning** - The passage contrasts human ability to recognize tens of thousands of objects with the data‑intensive demands of supervised deep learning, motivating N‑shot (few‑shot) approaches that use transfer and meta‑learning to generalize to many categories with minimal training. - [00:03:08](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=188s) **Attribute-Based Zero‑Shot Learning Explained** - The speaker illustrates how zero‑shot learning uses descriptive attributes (e.g., color, shape, wings) to enable a model to recognize unseen classes like birds or pens by inferring their labels from learned feature concepts. - [00:06:14](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=374s) **Embedding and Generative Zero‑Shot Methods** - The passage explains how joint embedding spaces align multimodal vectors for embedding‑based zero‑shot learning and how generative approaches—including large language models and GANs—create synthetic examples to recognize unseen classes. ## Full Transcript
0:00If I asked you to identify this object in my hand, you'd likely say 0:06it's a pen. 0:07Even if you've never seen this specific pen, which is a special marker for light boards. 0:12It shares enough attributes with other pens for you to recognize it. 0:16In fact, you and most humans can recognize approximately 0:2130,000 individually distinguishable object categories. 0:28Now, to train a deep learning model to recognize objects, 0:32we often turn to something called supervised learning. 0:39A form of deep learning, and that requires many labeled examples. 0:44Models learn by making predictions on a bunch of labels in a data set. 0:54These labels provide the correct answers, 0:56or the ground truth for each example. 1:00The model adjusts its weights to minimize the difference between its predictions and the ground truth,and this process then needs a whole bunch 1:08of labeled samples for many rounds of training. 1:13So if we want AI models to remotely approach human capabilities 1:18using supervised learning, they must be explicitly trained on labeled 1:22data for something like 30,000 object categories. 1:27That's a lot of time, cost and compute. 1:30So the need for machine learning models to be able to generalize quickly 1:34to a large number of semantic categories with minimal training overhead, 1:38has given rise to something called N-shot learning. 1:47That's a subset of machine learning that includes a number of categories. 1:52So we have few shot learning, which uses transfer learning and meta learning methods to train models to recognize new classes. 2:04Then we have one shot learning, 2:09which just uses a single labeled example to learn. 2:13But what if we don't want to use any labeled examples at all? 2:17Well, that is the focus of this video. 2:20And that is zero shot learning where instead of providing labeled examples, 2:26the model is asked to make predictions on unseen classes post training. 2:31Zero shot learning has become a notable area of research in data science, 2:35particularly in the fields of computer vision and natural language processing. 2:39So how does it work without explicit annotations to guide it? 2:44Is there a shot learning requires a more fundamental understanding of the labels meaning because after all, that's how we humans do it. 2:51So imagine a child wants to learn what a bird looks like. 2:55In a process similar to a few shot learning, the child would learn by looking at images labeled bird in a book of animal pictures. 3:03Moving forward, you'll recognize a bird because it resembles the bird images she's already seen. 3:08But in a zero shot learning scenario, no such labeled examples are available. 3:16So instead, the child might read a written story on birds and then they're small or medium sized 3:21animals with feathers, beaks and wings that can fly through the air. 3:25So it should then be able to recognize the bird in the real world, 3:28even though she's never seen one before, 3:30because she has learned the concept of a bird. 3:33Just as even if you've never seen this pen before, 3:35you can still classify it because of its cylindrical shape. 3:39It has a tip and that it's leaving colored markings 3:43when it comes in contact with the glass in front of me. 3:48And yes, there is actually glass in front of me. 3:50I'm not writing into thin air. 3:52Now, what I've described is one example 3:56of a method of zero shot learning, a way to implement zero shot learning. 4:04And that method that I've introduced is called 4:07attribute based. 4:10So this is an attribute based zero shot learning method. 4:13That's where we are training on labeled features 4:16like color, shape, and other characteristics. 4:19Even without seeing the target classes during the training, 4:22the model infers labels based on similar attributes. 4:26So for example, a model can learn about different types of animals. 4:31So let's say it starts off by learning about stripes, 4:36and it learns about stripes from looking at images of tigers and zebras. 4:41Then it can learn about yellow 4:44the color yellow from images, let's say of canaries. 4:48And then it can learn about flying insects 4:53from, let's say, just looking at, 4:55well, pictures of flies. 4:58Now the model can now perform zero shot classification of a new animal, 5:02let's say bees, despite the absence of bee images in the training set, 5:07because it can understand them as a combination of learned features 5:11so striped plus yellow 5:14plus flying insect that might equal 5:18bee. 5:19Now attribute based methods are quite versatile, but they do have some drawbacks. 5:25They rely on the assumption that every class can be described 5:28with a single vector of attributes, which isn't always the case. 5:32So for example, like a Tesla Cybertruck and a Volkswagen Beetle, 5:36they're both cars, but they differ greatly in shape, size, materials, and features. 5:42Now, many in zero shot learning methods use an alternative method, 5:46and that is known as embedding. 5:50So an embedding based approach 5:54to zero shot learning. 5:57Now this works by representing both classes 6:00and samples and vector embeddings that reflect their features and relationships. 6:04Classification is determined by measuring similarity between these embeddings 6:08using metrics like cosine similarity or Euclidean distance. 6:11Similar to K-nearest neighbors algorithms. 6:14And because embedding based methods typically process inputs from 6:17multiple modalities like word embeddings that describe a class label 6:21and image embeddings of a photograph that might belong to the same class, 6:26they require a way to compare between embeddings of different data types, 6:30and that's where joint embedding space can help normalize those vector embeddings. 6:35Now, another method of 6:37zero shot learning relates to generative 6:42based methods of zero shot learning. 6:46Now, if we think of the first example 6:49of generative based, we have to think of LLMs. 6:53That's large language models and large language models have a natural ability 6:58to perform zero shot learning based on their ability to fundamentally 7:02understand the meaning of the words used to name data. 7:06Classes and limbs are pre-trained through self-supervised 7:09learning on a massive corpus of text that may contain incidental references 7:14to knowledge about unseen data classes, which the LLM can learn to make sense of. 7:20And then, beyond just LLMs, another zero shot generative 7:23based approach is my favorite type of neural network. 7:27And that is a GAN. 7:30Gan that's an acronym for Generative Adversarial Network. 7:35And it actually consists of two competing neural networks 7:38jointly trained in an adversarial zero sum game. 7:42So there's a generator component 7:45that uses semantic attributes and Gaussian noise to synthesize samples. 7:50And then there is a discriminator network as well. 7:54Lin determines whether samples are real or fake 7:58fake meaning they were synthesized by the generator. 8:01Now feedback from the discriminator is used to train the generator 8:07until the discriminator can no longer distinguish 8:09between the real and the synthetic samples. 8:12That's perfect for generating synthetic data that mimics 8:15the attributes of unseen classes, thereby enabling models 8:19to learn from these synthesized examples as if they were labeled. 8:23So that is zero shot learning. 8:26It's something you do effortlessly every time you see a new object. 8:30And it's something deep learning models can be taught to do as well. 8:33Zero shot learning shows AI's potential to generalize for minimal information, 8:38saving time compute and the hassle of labeling data. 8:44If you like this video and 8:45want to see more like it, please like and subscribe. 8:49If you have any questions or want to share your thoughts about this topic, 8:52please leave a comment below.