Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ML - Models Evaluation

Goal

To understand and implement mutiple evaliation methods like accuracy,precision and recall to evalaute a classification model.

Resources

Why we need to evaluate a model ?

Evaluation Metrics

Confusion Matrix

a confusion matrix, also known as error matrix,is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. In unsupervised learning it is usually called a matching matrix. The term is used specifically in the problem of statistical classification.

Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa

Table of confusion - Basic Structure (2x2 Binary Matrix)

ML Sample Workflow

Matrix Structure

Example / Usecase

  1. Consider we are working on the a pregnency classification model.

  2. Consider we are working on classifying an email to be a spam or not a spam.

Actual Value

Predicted Value

True Positive (TP):

UsercaseActual <> Predicted
Spam Email text as inputSpam <> Spam
Women as InputPregnent <> Pregnent

True Negative (TN):

UsercaseActual <> Predicted
Email(Non Spam) text as inputEmail <> Email
Men as InputNot Pregnent <> Not Pregnent

False Positive (FP):

UsercaseActual <> Predicted
Email(Non Spam) text as inputEmail <> Spam
Men as InputNot Pregnent <> Pregnent

False Negative (FN):

UsercaseActual <> Predicted
Spam Email text as inputSpam Email <> Email
Women as InputPregnent <> Not Pregnent

Key Metrics Derived from the Matrix

Accuracy

Accuracy measures the total number of correct classifications divided by the total number of cases.

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Where:

Recall/Sensitivity

Recall/Sensitivity measures the total number of true positives divided by the total number of actual positives. It answers: “Of all the actual positive cases, how many did we correctly identify?”

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Also known as Sensitivity or True Positive Rate (TPR).

Precision

Precision measures the total number of true positives divided by the total number of predicted positives. It answers: “Of all the cases we predicted as positive, how many were actually positive?”

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Specificity

Specificity measures the total number of true negatives divided by the total number of actual negatives. It answers: “Of all the actual negative cases, how many did we correctly identify?”

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

Also known as True Negative Rate (TNR).

F1 Score

F1 Score is a single metric that is a harmonic mean of precision and recall. It balances both metrics and is useful when you need a single score.

F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Or equivalently:

F1 Score=2×TP2×TP+FP+FN\text{F1 Score} = \frac{2 \times TP}{2 \times TP + FP + FN}

Sample by usecase

Considering the spam example, by using some synthetic data here is the sample

Spam Email Classification Example

Suppose we have a spam email classifier that was tested on 100 emails. Here’s the confusion matrix:

Predicted SpamPredicted Not Spam
Actually Spam45 (TP)5 (FN)
Actually Not Spam10 (FP)40 (TN)

Given:

Accuracy Calculation

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
=45+4045+40+10+5= \frac{45 + 40}{45 + 40 + 10 + 5}
=85100=0.85 or 85%= \frac{85}{100} = 0.85 \text{ or } 85\%

Interpretation: The model correctly classified 85% of all emails.

Recall/Sensitivity Calculation

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
=4545+5= \frac{45}{45 + 5}
=4550=0.90 or 90%= \frac{45}{50} = 0.90 \text{ or } 90\%

Interpretation: Of all the actual spam emails, the model correctly identified 90%. It missed 10% of spam emails (false negatives).

Precision Calculation

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
=4545+10= \frac{45}{45 + 10}
=45550.818 or 81.8%= \frac{45}{55} \approx 0.818 \text{ or } 81.8\%

Interpretation: Of all emails the model predicted as spam, 81.8% were actually spam. There’s an 18.2% false alarm rate.

Specificity Calculation

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}
=4040+10= \frac{40}{40 + 10}
=4050=0.80 or 80%= \frac{40}{50} = 0.80 \text{ or } 80\%

Interpretation: Of all the actual non-spam emails, the model correctly identified 80%. It incorrectly flagged 20% as spam.

F1 Score Calculation

F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
=2×0.818×0.900.818+0.90= 2 \times \frac{0.818 \times 0.90}{0.818 + 0.90}
=2×0.73621.718= 2 \times \frac{0.7362}{1.718}
0.857 or 85.7%\approx 0.857 \text{ or } 85.7\%

Interpretation: The F1 Score of 0.857 balances both precision and recall, indicating a well-performing model overall.

Summary of Results

MetricValueMeaning
Accuracy85%Overall correctness
Recall (Sensitivity)90%Catches most spam emails
Precision81.8%Most predicted spam are actually spam
Specificity80%Correctly identifies legitimate emails
F1 Score85.7%Balanced performance metric

When to Use Each Metric

Precision vs Recall

For a detailed comparison of Precision and Recall, including real-world examples and when to use each metric, see Precision vs Recall: A Detailed Comparison

Confusion Matrix using Python

Key Python Libraries

LibraryPurpose
sklearn.metrics.confusion_matrix()Create confusion matrix from predictions
sklearn.metrics.classification_report()Get detailed metrics report
seaborn.heatmap()Visualize confusion matrix as heatmap
matplotlib.pyplotCreate plots and visualizations
pandas.DataFrame()Display confusion matrix in table format

Basic Confusion Matrix Creation

Here’s how to create and visualize a confusion matrix using Python:

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# Example: Spam Email Classification

# Actual labels (ground truth)
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])

# Predicted labels (from your model)
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 1, 1, 1])

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
# Output:
# [[5 1]
#  [1 3]]

Interpreting the Output:

Confusion Matrix:
[[5 1]
 [1 3]]

Position [0,0] = 5  → True Negatives (TN)
Position [0,1] = 1  → False Positives (FP)
Position [1,0] = 1  → False Negatives (FN)
Position [1,1] = 3  → True Positives (TP)

Create Confusion Matrix as DataFrame

# Create a more readable confusion matrix using pandas

cm = confusion_matrix(y_true, y_pred)
cm_df = pd.DataFrame(cm, 
                     index=['Actually Negative', 'Actually Positive'],
                     columns=['Predicted Negative', 'Predicted Positive'])

print(cm_df)

#                    Predicted Negative  Predicted Positive
# Actually Negative                   5                   1
# Actually Positive                   1                   3

Calculate Metrics from Confusion Matrix

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate all evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Accuracy:  {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall:    {recall:.2%}")
print(f"F1 Score:  {f1:.2%}")

# Output:
# Accuracy:  80.00%
# Precision: 75.00%
# Recall:    75.00%
# F1 Score:  75.00%

Visualize Confusion Matrix with Heatmap

import matplotlib.pyplot as plt
import seaborn as sns

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Create heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, 
            annot=True,          # Show numbers in cells
            fmt='d',              # Format as integers
            cmap='Blues',         # Color scheme
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'],
            cbar_kws={'label': 'Count'})

plt.title('Confusion Matrix - Spam Email Classification', fontsize=14, fontweight='bold')
plt.ylabel('Actual Class', fontsize=12)
plt.xlabel('Predicted Class', fontsize=12)
plt.tight_layout()
plt.show()

Output Visualization:

Confusion Matrix - Spam Email Classification

            Negative  Positive
Negative      5         1
Positive      1         3

Common Scenarios

Scenario 1: Binary Classification (2 Classes)

# Spam vs Not Spam
y_true = [0, 1, 1, 0, 1, 0, 0, 1]
y_pred = [0, 1, 0, 0, 1, 0, 1, 1]

cm = confusion_matrix(y_true, y_pred)
# Works perfectly with 2x2 matrix

Scenario 2: Multi-class Classification (3+ Classes)

# Email categories: Spam, Promotions, Important
y_true = [0, 1, 2, 0, 1, 2, 0, 1]
y_pred = [0, 1, 1, 0, 1, 2, 0, 2]

cm = confusion_matrix(y_true, y_pred)
# Creates 3x3 confusion matrix
print(cm.shape)  # (3, 3)

Tips for Working with Confusion Matrices

  1. Always check total samples: Sum of all values should equal total test set size

  2. Normalize for comparison: Divide by row totals to see percentages

  3. Use classification_report(): Provides precision, recall, and F1 for each class

  4. Visualize before analyzing: Heatmaps make patterns obvious

  5. Check for class imbalance: If one class dominates, consider accuracy alternatives