Learning Library

← Back to Library

K-Nearest Neighbors: Simple Classification Overview

Key Points

  • K‑Nearest Neighbors (KNN) classifies a new data point by assigning it the label most common among its K closest labeled points, assuming similar items lie near each other.
  • The algorithm requires a distance metric (e.g., Euclidean or Manhattan) to measure proximity and a user‑defined K value, often chosen as an odd number to avoid ties and set higher for noisy data.
  • In a fruit‑type example, plotting sweetness versus crunchiness lets KNN locate the nearest labeled apples or oranges and classify an unlabeled fruit accordingly.
  • KNN’s main advantages are its simplicity, minimal hyper‑parameter tuning, and strong baseline accuracy, making it a popular first classifier for newcomers.
  • Its drawbacks include sensitivity to the choice of K, computational cost for large datasets, and poor performance when features are not meaningfully scaled or when data is high‑dimensional.

Full Transcript

# K-Nearest Neighbors: Simple Classification Overview **Source:** [https://www.youtube.com/watch?v=b6uHw7QW_n4](https://www.youtube.com/watch?v=b6uHw7QW_n4) **Duration:** 00:07:58 ## Summary - K‑Nearest Neighbors (KNN) classifies a new data point by assigning it the label most common among its K closest labeled points, assuming similar items lie near each other. - The algorithm requires a distance metric (e.g., Euclidean or Manhattan) to measure proximity and a user‑defined K value, often chosen as an odd number to avoid ties and set higher for noisy data. - In a fruit‑type example, plotting sweetness versus crunchiness lets KNN locate the nearest labeled apples or oranges and classify an unlabeled fruit accordingly. - KNN’s main advantages are its simplicity, minimal hyper‑parameter tuning, and strong baseline accuracy, making it a popular first classifier for newcomers. - Its drawbacks include sensitivity to the choice of K, computational cost for large datasets, and poor performance when features are not meaningfully scaled or when data is high‑dimensional. ## Sections - [00:00:00](https://www.youtube.com/watch?v=b6uHw7QW_n4&t=0s) **Introducing K‑Nearest Neighbors Classification** - The speaker explains the basic concept of K‑Nearest Neighbors using a fruit sweetness‑crunchiness example to show how new items are classified by the majority label of their nearest neighbors. ## Full Transcript
0:00whether you're just getting started on 0:02your journey to becoming a data 0:04scientist or you've been here for years 0:06you'll probably recognize the K NN 0:08algorithm it stands for K nearest 0:11neighbors and it's one of the most 0:14popular and simplest classification and 0:16regression classifiers used in machine 0:18learning today as a classification 0:21algorithm KNN operates on the assumption 0:23that similar data points are located 0:25near each other and can be grouped in 0:28the same category based on their 0:30proximity so let's consider an example 0:34imagine we have a data set containing 0:37information about different types of 0:40fruit so let's visualize our fruit data 0:45set here now we have each fruit 0:49categorized by two things we have it 0:51categorized by its sweetness that's our 0:56x axis here and then on the Y AIS we are 1:00classifying it by its 1:03crunchiness now we've already labeled 1:06some data points so we've got a a few 1:11apples here apples are very crunchy and 1:15somewhat sweet and then we have a few 1:19oranges down here oranges are very sweet 1:23not so crunchy now suppose you have a 1:26new fruit that you want to classify well 1:28we measure it's crunchiness and we 1:31measure its sweetness and then we can 1:34plot it on the graph let's say it comes 1:37out maybe 1:39here the K&N algorithm will then look at 1:42the K nearest points on the graph to 1:45this new fruit and if most of these 1:48nearest points are classified as apples 1:51the algorithm will classify the new 1:52fruit as an apple as well how's that for 1:57an Apples to Apples comparison now 2:00before a classification can be made the 2:03distance must be defined and there are 2:05only two requirements for a KNN 2:07algorithm to achieve its goal and the 2:10first one is What's called the 2:14distance 2:16metric the distance between the query 2:20point and the other data points needs to 2:22be calculated fing decision boundaries 2:25and partitioning query points into 2:27different regions which are commonly 2:29visualized using Veron diagrams which 2:31kind of look like a kaleidoscope and 2:33this distance serves as our distance 2:35metric and can be calculated using 2:37various measures such as ukian distance 2:40or Manhattan distance so that's number 2:43one number two is we need now need to 2:47define the value of K and the K value in 2:51the KNN algorithm defines how many 2:53neighbors will be checked to determine 2:56the classification of a specific query 2:58point so for example if k equals 1 the 3:04instance will be assigned to the same 3:06class as its single nearest neighbor 3:10choosing the right K value largely 3:12depends on the input data data with more 3:14outliers or noise will likely perform 3:17much better with higher values of K also 3:21it's recommended to choose an odd number 3:23for K to minimize the chances of ties in 3:26classification now just like any machine 3:28learning algorithm KNN has its strengths 3:31and it has its weaknesses so let's take 3:33a look at some of those and on the plus 3:36side we have to say that knnn is quite 3:40easy to 3:43implement its Simplicity and its 3:46accuracy make it one of the first 3:48classifiers that a new data scientist 3:51will learn it also has only a few hyper 3:57parameters which is a big advantage as 4:00well KNN only requires a k value and a 4:05distance metric which is a lot less than 4:07other machine learning algorithms also 4:10in the plus category we can say that 4:12it's very 4:14adaptable as well meaning as new 4:18training samples are added the algorithm 4:20adjusts to account for any new data 4:23since all training data is stored into 4:26memory that sounds good but there's also 4:28a drawback here 4:30and that is but because of that it 4:33doesn't scale very 4:34well as a data set grows the algorithm 4:38becomes less efficient due to increased 4:41computational complexity comprising 4:43compromising the overall model 4:44performance and this this inability to 4:46scale it comes from KNN being what's 4:49called a lazy algorithm meaning it 4:52stores all training data and defers the 4:54computation to the time of 4:56classification that results in higher 4:58memory usage and slower processing 5:00compared to other classifiers now KNN 5:04also tends to fall victim to something 5:07called The Curse of 5:12dimensionality which means it doesn't 5:14perform well with high dimensional data 5:17inputs so in our sweetness to 5:19crunchiness example this is a 2d space 5:22it's relatively easy to find the nearest 5:25neighbors and classify new fruits 5:26accurately however if we keep adding 5:28more features like color and size and 5:31weight and so on the data points become 5:34sparse in the high dimensional space the 5:37distances between the points starts to 5:39become similar making it difficult for 5:41K&N to find meaningful neighbors and it 5:44can also lead to something called the 5:45peaking phenomenon where after reaching 5:47an optimal number of features adding 5:50more features just increases noise and 5:52increases classification errors 5:54especially when the sample size is small 5:56feature selection and dimensionality 5:58reduction techniques can help minimize 6:01the curse of dimensionality but if not 6:03done carefully they can make KNN prone 6:06to another downside and that is the 6:09downside of over 6:13fitting lower values of K can overfit 6:17the data whereas higher values of K tend 6:19to smooth out the prediction values 6:21since it's averaging the values over a 6:23greater area or neighborhood so because 6:26of all this the KNN algorithm is 6:28commonly used for simple 6:30recommendation systems so for example 6:33the algorithm can be applied in the 6:36areas of data 6:39preprocessing that's pretty common use 6:43case for knnn and that's because the KNN 6:46algorithm is helpful for data sets with 6:48missing values since it can estimate for 6:51those values using a process known as 6:53missing data 6:55imputation now another use case is in 6:59Final 7:01in in the KNN algorithm it's often used 7:04in stock market forecasting currency 7:07exchange rates trading Futures and 7:09moneya 7:10laundering analysis money laundering 7:13analysis and also we have to consider 7:17the use 7:18case for 7:20healthcare it's been used to make 7:22predictions on the risk of heart attacks 7:24and prostate cancer by calculating the 7:26most likely Gene Expressions so that's 7:30KNN a simple but imperfect 7:33classification and regression classifier 7:35in the right context it's 7:37straightforward approach is as 7:39delightful as biting into a perfectly 7:43classified 7:45Apple if you like this video and want to 7:47see more like it please like And 7:50subscribe if you have any questions or 7:52want to share your thoughts about this 7:54topic please leave a comment below