Learning Library

← Back to Library

Semi-Supervised Learning Explained with Cats

Key Points

  • Supervised learning trains a model on a fully labeled dataset (e.g., cat vs. dog images) by iteratively adjusting weights to minimize prediction errors.
  • Creating these labels—especially for tasks like image segmentation, genetic sequencing, or protein classification—is time‑consuming, labor‑intensive, and often requires specialized expertise.
  • Semi‑supervised learning addresses this bottleneck by leveraging a small amount of labeled data together with abundant unlabeled data to improve model performance.
  • Using only limited labeled data can lead to poor generalization, so semi‑supervised approaches help mitigate overfitting and make better use of available information.

Full Transcript

# Semi-Supervised Learning Explained with Cats **Source:** [https://www.youtube.com/watch?v=C3Lr6Waw66g](https://www.youtube.com/watch?v=C3Lr6Waw66g) **Duration:** 00:10:02 ## Summary - Supervised learning trains a model on a fully labeled dataset (e.g., cat vs. dog images) by iteratively adjusting weights to minimize prediction errors. - Creating these labels—especially for tasks like image segmentation, genetic sequencing, or protein classification—is time‑consuming, labor‑intensive, and often requires specialized expertise. - Semi‑supervised learning addresses this bottleneck by leveraging a small amount of labeled data together with abundant unlabeled data to improve model performance. - Using only limited labeled data can lead to poor generalization, so semi‑supervised approaches help mitigate overfitting and make better use of available information. ## Sections - [00:00:00](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=0s) **Explaining Semi‑Supervised Learning with Cats & Dogs** - The speaker introduces supervised learning by detailing how a model classifies labeled cat and dog images, then highlights the reliance on labeled data as a prelude to the concept of semi‑supervised learning. - [00:03:10](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=190s) **Combating Overfitting with Unlabeled Data** - The speaker explains that relying solely on a small labeled dataset causes models to overfit—learning spurious cues like indoor vs. outdoor settings—and that semi‑supervised learning mitigates this by augmenting training with many unlabeled examples to improve generalization. - [00:06:23](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=383s) **Iterative Pseudo-Labeling and Clustering Techniques** - The speaker explains three semi‑supervised strategies—iterative retraining with pseudo‑labels, auto‑encoder based feature extraction, and clustering‑based pseudo‑label assignment—to improve model performance with limited labeled data. - [00:09:39](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=579s) **Semi‑Supervised Learning Explained** - The speaker describes semi‑supervised learning as a method that blends labeled and unlabeled data to train a better‑fitting model, using the analogy of raising a pet that requires both structure and freedom. ## Full Transcript
0:00What is semi-supervised learning? 0:03Well, let me give you an example. 0:05So consider building an AI model that can classify pictures of cats and dogs. 0:09If you give the model a picture of an animal, it will tell you if that picture shows a cat or if it shows a dog. 0:17Now, we can build a model like that using a process called supervised learning, not semi just supervised learning. 0:25Now, this involves training the model on a data set. 0:32And this dataset has images where they are labeled and those images are labeled as either cat or dog. 0:38So for instance, we might have 100 images and half of them are labeled as cat and the other half of them are labeled as dog. 0:48Now we also have here an AI model that is going to do the work here. 0:55And the model learns from these labeled examples by 0:58identifying patterns and features that differentiate these animals. 1:02So perhaps things like a shape because they're generally more pointy for cats 1:07or body structure, which are generally more bulky for dogs. 1:11Then during training, the model makes predictions, 1:15evaluates the accuracy of the predictions through something called a loss function, which says, Was I right? 1:20Was it really a cat or a dog? 1:22And it makes adjustments using techniques such as gradient 1:26descent that update the model weights to improve future predictions. 1:30Now that's all well and good. 1:32But as we have established, supervised learning needs a labeled dataset. 1:38These this dataset here is full of labels and those form the ground truth from which the model trains. 1:47Now, when we think about what a label actually is, it could be as simple as a classification label. 1:56So that just says that, yeah, this picture contains a cat. 1:59And yeah, this picture contains a dog, 2:01but it could also be something a bit more complicated as well, such as an image segmentation label, 2:08and that assigns labels to individual pixel boundaries 2:13in an image indicating precisely where in the image the object can be found. 2:18Now, this is a manual work that somebody has to perform, and I don't know about you, 2:22but going through a dataset of hundreds of images of pets and then like 2:27designing and assigning labels to them, it's not my idea of a good time. 2:32Labeling images of cats and dogs is time consuming and tedious, 2:36but, 2:36but what about more specialized use cases like genetic sequencing or protein classification? 2:42That sort of data annotation is not only extremely time consuming, but it also requires a very specific domain expertise. 2:51There are just fewer people that can do it. 2:54So enter the world of semi-supervised learning to help us out. 3:00Semi-supervised learning offers a way to extract benefit from a scarce amount of 3:05labeled data while making use of relatively abundant, unlabeled data. 3:10Now, before we get into the how, as in how semi supervised learning works, let's first address the why. 3:18Why not just build your model using whatever label data is currently available? 3:23Well, the answer to that is that using a limited amount of label data introduces the possibility of something called overfitting. 3:36This happens when the model performs well on the training dataset, but it struggles to generalize to new unseen images. 3:45So for instance, suppose that in the training dataset, 3:49most of the images are the cats that are taken indoors and most of the images of dogs, those are taken outdoors. 3:59Well, the model might mistakenly learn to associate outdoor settings 4:03with dogs rather than focusing on more meaningful features, and as a result, it could incorrectly classify 4:10any image of its outdoors as being a dog, even if it contains a cat. 4:16In general, the solution to overfitting is to increase the size of the training dataset, 4:23and that is where semi supervised learning comes in by incorporating unlabeled data into the training process. 4:30We can effectively expand our data set. 4:35So, for example, instead of just training on a model that contains 100 4:40labeled examples, we can also add in some unlabeled examples as well. 4:46So maybe we could add in 1000 unlabeled examples into this dataset as well. 4:53That gives the model more context to learn from without requiring additional labeled data. 5:01So that's the why. 5:02Now let's get into the how. 5:04Now, there are many semi-supervised learning techniques as just narrowly focused for them. 5:10So first up is something called the wrapper method. 5:18So what is the wrapper method? 5:20Well, here's what it does. 5:22We start with the base model trained on a labeled data set, 5:27and then we use this train model to predict labels for 5:32the unlabeled dataset. That's images that contain, let's say, cats and dogs, 5:36but the individual images do not specify an actual label. 5:40Now, those predicted labels, they have a name. 5:44They are called pseudo-labels. 5:49So a pseudo-label is a label assigned by this method and they are typically probabilistic rather than deterministic, 5:59meaning that the pseudo label, it comes with a probability of how confident the model is in its labeling. 6:05So it might say, for example, for a given label, there's an 85% chance that this one is a dog, for example. 6:14Now pseudo labels with high confidence are then combined with the 6:18original label dataset and then they're treated as if they were actual ground truth labels. 6:24Now the model is retrained on this new dataset, which is of course now a bit larger, 6:28and that includes both the label and the pseudo-label data. 6:33And this process can be repeated iteratively with each iteration, improving the quality of the pseudo-labels 6:39as the model becomes better at distinguishing between the images. 6:43So that's the wrapper method. 6:45Now another approach is called unsupervised pre-processing. 6:55Now that uses a feature called an auto encoder. 7:02Now, what the auto encoder does is it learns to represent each image 7:07in a more compact and meaningful way by capturing essential features, so things like edges and shapes and textures. 7:15And when that's applied to the unlabeled images, it extracts these 7:19key features which are then used to train a supervised model more effectively. 7:24Therefore, it's helping it better generalize even with limited label data. 7:30Another method that is commonly used relates to clustering. 7:34So clustering based methods. 7:39Now these apply the cluster assumption, which is essentially that similar data points are likely to belong to the same class. 7:48So a clustering algorithm, something like K-means can group all data points, both labeled 7:54and unlabeled into clusters based on their similarity. 7:58So, for example, if we do that here, 8:01if we've got a cluster and we've got some labeled examples that kind of fall here on the matrix, 8:07nd then we have some unlabeled examples which fall around here as well. 8:13Well, we can pseudo label the unlabeled images in that cluster as well. 8:19So if the labeled images were cats, we could say those unlabeled ones that fall in the same area are cats as well. 8:26And then finally, the method we want to talk about here is called active learning. 8:32Now what active learning does is brings humans into the loop. 8:37So samples with low confidence level pseudo labels, meaning the model wasn't really sure how to classify them. 8:43They can be referred to human annotators for labeling. 8:47So human labels are only working on images that the model is unable to reliably classify itself. 8:53Now there are other semi supervised learning techniques as well, but the key here is that they can be combined. 9:01So, for example, we could start here with unsupervised pre-processing, 9:06and that could be used to first extract meaningful features from the unlabeled dataset, 9:11which gives us a solid foundation for more accurate clustering based methods that we can then use. 9:19Now these clusters of pseudo-labeled data can then be incorporated into the 9:24wrapper method up here, improving the model with each retraining cycle, 9:29and then meanwhile, we would then rely on active learning to take the most ambiguous of the most low confidence samples 9:36and ensure that human effort is focused where it's most needed. 9:40So that is semi supervised learning 9:43a method to incorporate unlabeled data into the model training alongside labeled examples creating a better fitting model. 9:53Just like raising a cat or a dog it needs a little bit of structure, a little bit of freedom and a whole lot of learning along the way.