Learning Library

← Back to Library

Semi-Supervised Learning Explained with Cats

10m • Unknown Channel • ai-ml • tutorial • beginner • Watch on YouTube ↗

Key Points

Supervised learning trains a model on a fully labeled dataset (e.g., cat vs. dog images) by iteratively adjusting weights to minimize prediction errors.
Creating these labels—especially for tasks like image segmentation, genetic sequencing, or protein classification—is time‑consuming, labor‑intensive, and often requires specialized expertise.
Semi‑supervised learning addresses this bottleneck by leveraging a small amount of labeled data together with abundant unlabeled data to improve model performance.
Using only limited labeled data can lead to poor generalization, so semi‑supervised approaches help mitigate overfitting and make better use of available information.

Sections

Full Transcript

# Semi-Supervised Learning Explained with Cats **Source:** [https://www.youtube.com/watch?v=C3Lr6Waw66g](https://www.youtube.com/watch?v=C3Lr6Waw66g) **Duration:** 00:10:02 ## Summary - Supervised learning trains a model on a fully labeled dataset (e.g., cat vs. dog images) by iteratively adjusting weights to minimize prediction errors. - Creating these labels—especially for tasks like image segmentation, genetic sequencing, or protein classification—is time‑consuming, labor‑intensive, and often requires specialized expertise. - Semi‑supervised learning addresses this bottleneck by leveraging a small amount of labeled data together with abundant unlabeled data to improve model performance. - Using only limited labeled data can lead to poor generalization, so semi‑supervised approaches help mitigate overfitting and make better use of available information. ## Sections - [00:00:00](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=0s) **Explaining Semi‑Supervised Learning with Cats & Dogs** - The speaker introduces supervised learning by detailing how a model classifies labeled cat and dog images, then highlights the reliance on labeled data as a prelude to the concept of semi‑supervised learning. - [00:03:10](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=190s) **Combating Overfitting with Unlabeled Data** - The speaker explains that relying solely on a small labeled dataset causes models to overfit—learning spurious cues like indoor vs. outdoor settings—and that semi‑supervised learning mitigates this by augmenting training with many unlabeled examples to improve generalization. - [00:06:23](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=383s) **Iterative Pseudo-Labeling and Clustering Techniques** - The speaker explains three semi‑supervised strategies—iterative retraining with pseudo‑labels, auto‑encoder based feature extraction, and clustering‑based pseudo‑label assignment—to improve model performance with limited labeled data. - [00:09:39](https://www.youtube.com/watch?v=C3Lr6Waw66g&t=579s) **Semi‑Supervised Learning Explained** - The speaker describes semi‑supervised learning as a method that blends labeled and unlabeled data to train a better‑fitting model, using the analogy of raising a pet that requires both structure and freedom. ## Full Transcript

0:00What is semi-supervised learning? 0:03Well, let me give you an example. 0:05So consider building an AI model that can classify pictures of cats and dogs. 0:09If you give the model a picture of an animal, it will tell you if that picture shows a cat or if it shows a dog. 0:17Now, we can build a model like that using a process called supervised learning, not semi just supervised learning. 0:25Now, this involves training the model on a data set. 0:32And this dataset has images where they are labeled and those images are labeled as either cat or dog. 0:38So for instance, we might have 100 images and half of them are labeled as cat and the other half of them are labeled as dog. 0:48Now we also have here an AI model that is going to do the work here. 0:55And the model learns from these labeled examples by 0:58identifying patterns and features that differentiate these animals. 1:02So perhaps things like a shape because they're generally more pointy for cats 1:07or body structure, which are generally more bulky for dogs. 1:11Then during training, the model makes predictions, 1:15evaluates the accuracy of the predictions through something called a loss function, which says, Was I right? 1:20Was it really a cat or a dog? 1:22And it makes adjustments using techniques such as gradient 1:26descent that update the model weights to improve future predictions. 1:30Now that's all well and good. 1:32But as we have established, supervised learning needs a labeled dataset. 1:38These this dataset here is full of labels and those form the ground truth from which the model trains. 1:47Now, when we think about what a label actually is, it could be as simple as a classification label. 1:56So that just says that, yeah, this picture contains a cat. 1:59And yeah, this picture contains a dog, 2:01but it could also be something a bit more complicated as well, such as an image segmentation label, 2:08and that assigns labels to individual pixel boundaries 2:13in an image indicating precisely where in the image the object can be found. 2:18Now, this is a manual work that somebody has to perform, and I don't know about you, 2:22but going through a dataset of hundreds of images of pets and then like 2:27designing and assigning labels to them, it's not my idea of a good time. 2:32Labeling images of cats and dogs is time consuming and tedious, 2:36but, 2:36but what about more specialized use cases like genetic sequencing or protein classification? 2:42That sort of data annotation is not only extremely time consuming, but it also requires a very specific domain expertise. 2:51There are just fewer people that can do it. 2:54So enter the world of semi-supervised learning to help us out. 3:00Semi-supervised learning offers a way to extract benefit from a scarce amount of 3:05labeled data while making use of relatively abundant, unlabeled data. 3:10Now, before we get into the how, as in how semi supervised learning works, let's first address the why. 3:18Why not just build your model using whatever label data is currently available? 3:23Well, the answer to that is that using a limited amount of label data introduces the possibility of something called overfitting. 3:36This happens when the model performs well on the training dataset, but it struggles to generalize to new unseen images. 3:45So for instance, suppose that in the training dataset, 3:49most of the images are the cats that are taken indoors and most of the images of dogs, those are taken outdoors. 3:59Well, the model might mistakenly learn to associate outdoor settings 4:03with dogs rather than focusing on more meaningful features, and as a result, it could incorrectly classify 4:10any image of its outdoors as being a dog, even if it contains a cat. 4:16In general, the solution to overfitting is to increase the size of the training dataset, 4:23and that is where semi supervised learning comes in by incorporating unlabeled data into the training process. 4:30We can effectively expand our data set. 4:35So, for example, instead of just training on a model that contains 100 4:40labeled examples, we can also add in some unlabeled examples as well. 4:46So maybe we could add in 1000 unlabeled examples into this dataset as well. 4:53That gives the model more context to learn from without requiring additional labeled data. 5:01So that's the why. 5:02Now let's get into the how. 5:04Now, there are many semi-supervised learning techniques as just narrowly focused for them. 5:10So first up is something called the wrapper method. 5:18So what is the wrapper method? 5:20Well, here's what it does. 5:22We start with the base model trained on a labeled data set, 5:27and then we use this train model to predict labels for 5:32the unlabeled dataset. That's images that contain, let's say, cats and dogs, 5:36but the individual images do not specify an actual label. 5:40Now, those predicted labels, they have a name. 5:44They are called pseudo-labels. 5:49So a pseudo-label is a label assigned by this method and they are typically probabilistic rather than deterministic, 5:59meaning that the pseudo label, it comes with a probability of how confident the model is in its labeling. 6:05So it might say, for example, for a given label, there's an 85% chance that this one is a dog, for example. 6:14Now pseudo labels with high confidence are then combined with the 6:18original label dataset and then they're treated as if they were actual ground truth labels. 6:24Now the model is retrained on this new dataset, which is of course now a bit larger, 6:28and that includes both the label and the pseudo-label data. 6:33And this process can be repeated iteratively with each iteration, improving the quality of the pseudo-labels 6:39as the model becomes better at distinguishing between the images. 6:43So that's the wrapper method. 6:45Now another approach is called unsupervised pre-processing. 6:55Now that uses a feature called an auto encoder. 7:02Now, what the auto encoder does is it learns to represent each image 7:07in a more compact and meaningful way by capturing essential features, so things like edges and shapes and textures. 7:15And when that's applied to the unlabeled images, it extracts these 7:19key features which are then used to train a supervised model more effectively. 7:24Therefore, it's helping it better generalize even with limited label data. 7:30Another method that is commonly used relates to clustering. 7:34So clustering based methods. 7:39Now these apply the cluster assumption, which is essentially that similar data points are likely to belong to the same class. 7:48So a clustering algorithm, something like K-means can group all data points, both labeled 7:54and unlabeled into clusters based on their similarity. 7:58So, for example, if we do that here, 8:01if we've got a cluster and we've got some labeled examples that kind of fall here on the matrix, 8:07nd then we have some unlabeled examples which fall around here as well. 8:13Well, we can pseudo label the unlabeled images in that cluster as well. 8:19So if the labeled images were cats, we could say those unlabeled ones that fall in the same area are cats as well. 8:26And then finally, the method we want to talk about here is called active learning. 8:32Now what active learning does is brings humans into the loop. 8:37So samples with low confidence level pseudo labels, meaning the model wasn't really sure how to classify them. 8:43They can be referred to human annotators for labeling. 8:47So human labels are only working on images that the model is unable to reliably classify itself. 8:53Now there are other semi supervised learning techniques as well, but the key here is that they can be combined. 9:01So, for example, we could start here with unsupervised pre-processing, 9:06and that could be used to first extract meaningful features from the unlabeled dataset, 9:11which gives us a solid foundation for more accurate clustering based methods that we can then use. 9:19Now these clusters of pseudo-labeled data can then be incorporated into the 9:24wrapper method up here, improving the model with each retraining cycle, 9:29and then meanwhile, we would then rely on active learning to take the most ambiguous of the most low confidence samples 9:36and ensure that human effort is focused where it's most needed. 9:40So that is semi supervised learning 9:43a method to incorporate unlabeled data into the model training alongside labeled examples creating a better fitting model. 9:53Just like raising a cat or a dog it needs a little bit of structure, a little bit of freedom and a whole lot of learning along the way.