Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ML - KNN Model

Resources

Goal

To understand and use KNN for supervised learning.

K - Nearest Neighbours

Lazy learning paradigm and the KNN algorithm

Usercases

What is K ?

How to Calculate Distance ?

Algo & Sudo Code

For Regression

For Classification

Sudo Code

What We Need (Inputs):

How KNN Works (Step by Step):

Step 1: Calculate Distances

For every single training example in our dataset, we measure how far it is from our new observation X. We use our chosen distance metric (like measuring distance on a map) to find this distance. We create a list showing each training point and how far it is from X.

Example: If X is the point (7, 7), we measure the distance from (7, 7) to every other point in our training data.

Step 2: Select the K Nearest Neighbors

Once we have all the distances, we sort them from smallest to largest. Then we pick the K closest points - the ones with the smallest distances. These are our “nearest neighbors.”

Example: If K=3, we pick the 3 training points that are closest to our new observation X.

Step 3: Make a Prediction

Now comes the interesting part - how we use these K neighbors to make our prediction depends on what task we’re doing:


Example: Classification Task

Imagine we have:

What happens:

  1. Calculate Distances: We measure how far (7,7) is from each point using Euclidean distance:

    • Distance to (2,2) = 7.07 units away → Red

    • Distance to (3,3) = 5.66 units away → Red

    • Distance to (8,8) = 1.41 units away → Blue

    • Distance to (9,9) = 2.83 units away → Blue

  2. Select K=3 Nearest Neighbors: We pick the 3 closest points:

    • (8,8) → Blue (1.41 units away) ✓

    • (9,9) → Blue (2.83 units away) ✓

    • (3,3) → Red (5.66 units away) ✓

  3. Make Prediction: We count which color appears most in our 3 neighbors:

    • Blue appears 2 times

    • Red appears 1 time

    • We predict: X(7,7) is BLUE because Blue is the most common label ✓


Example: Regression Task

Imagine we’re trying to predict a number based on similar training examples:

What happens:

  1. Calculate Distances: We measure how far 7 is from each input using Euclidean distance:

    • Distance to 2 = 5 units → output value is 4

    • Distance to 3 = 4 units → output value is 5

    • Distance to 8 = 1 unit → output value is 16

    • Distance to 9 = 2 units → output value is 18

  2. Select K=3 Nearest Neighbors: We pick the 3 closest:

    • Input 8 (1 unit away) → output 16 ✓

    • Input 9 (2 units away) → output 18 ✓

    • Input 3 (4 units away) → output 5 ✓

  3. Make Prediction: We take the average of the output values:

    • Average = (16 + 18 + 5) ÷ 3 = 39 ÷ 3 = 13

    • We predict: When input is 7, output will be 13


Key Observations:

AspectWhat It MeansAlternatives
Lazy LearningKNN doesn’t “train” like other algorithms - it just remembers all the training data and uses it directly when making predictionsDecision Trees, Random Forest, SVM - these algorithms have a formal training phase where they learn patterns from data
K Size MattersA smaller K (like K=1) makes predictions based on very close neighbors only. A larger K (like K=10) considers more neighbors and is more stableYou need to experiment and find the best K value for your specific problem using cross-validation
Distance MattersThe choice of distance metric is important - different distance measures can give different resultsCommon metrics: Euclidean (straight-line distance), Manhattan (grid distance), Cosine (angle between vectors)
No Training PhaseKNN has basically zero training time since there’s nothing to learn beforehandOther algorithms like Linear Regression or Neural Networks spend time learning optimal parameters during training
Slow at PredictionKNN is slow when predicting because it has to calculate distance to every single training exampleUse approximate nearest neighbor algorithms like KD-Trees or Ball Trees to speed up distance calculations
Works with BothKNN can handle both classification (predicting categories) and regression (predicting numbers) with the same algorithmMost other algorithms specialize in one task - you’d use Logistic Regression for classification and Linear Regression for regression

When to Use KNN

When to avoid KNN

k value and its effects

AspectCenter PointOutlier Point
Small kSensitive to local noiseCaptures true local structure
Medium kSmooth, stable predictionsStarts including irrelevant points
Large kApproaches global prior (stable but biased)Severely contaminated by distant points
k = nAlways predicts global majority classAlways predicts global majority class

FAQ - covered in KNN advanced.

  1. what happens when the data is exactly between the 2 groups ?

  2. Is randomization during training enough ?

  3. how to choose K for the best fit?

  4. since by adding points one by one the distance calculation will go to O(n^2) how does knn optimizes this, or how does KNN scale for large datasets ?