3 Chapter 3: Supervised Learning - Classification

“The beautiful thing about learning is that nobody can take it away from you.”
— B.B. King

3.1 Introduction to Classification

Classification is a key concept in supervised learning, where the goal is to assign input data into predefined categories or classes. It is widely used in applications such as spam detection, medical diagnosis, and customer segmentation. Unlike regression, which deals with continuous outputs, classification predicts discrete outcomes.

In this chapter, we explore several key classification algorithms, each with a unique approach to handling data: Logistic Regression, k-Nearest Neighbors (k-NN), Decision Trees, Random Forests, and Support Vector Machines (SVMs). We delve into the mechanics of each model, their advantages and disadvantages, and provide Python examples to illustrate their application. We will also discuss metrics for evaluating classification models, such as accuracy, precision, recall, and F1-score.

3.1.1 Example: Predicting Student Pass/Fail Status

Consider an educational institution that wants to predict whether a student will pass or fail based on their study habits, attendance, and previous grades. Classification algorithms can help model this problem.

3.1.2 Engagement Question

How might classification be useful in your industry? Can you think of a scenario where predicting a categorical outcome would be valuable?

3.2 Logistic Regression

Logistic Regression is a statistical model that, despite its name, is primarily used for binary classification tasks. It is a classification model, not a regression model. It models the probability that a given input point belongs to a particular class. The logistic function, also known as the sigmoid function, maps any real-valued number into a value between 0 and 1, which is interpreted as the probability. Additionally, logistic regression allows for the interpretation of coefficients in terms of odds ratios, which measure the change in odds of the outcome occurring with a one-unit change in the predictor variable (Itauma 2019). The odds ratio provides a meaningful way to understand the effect size and direction of each predictor variable on the outcome (Itauma 2019). This interpretability makes logistic regression a valuable tool for assessing the impact of variables in various fields such as healthcare and social sciences. Furthermore, logistic regression can be extended to handle multiclass classification problems using techniques like the one-vs-all approach or multinomial logistic regression.

3.2.1 Mathematical Foundation

Logistic regression predicts the probability of the default class (usually denoted as 1). The model can be expressed as:

\[ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \cdots + \beta_nX_n)}} \]

Where: - \(P(y=1|x)\) is the probability that the output is 1 given input \(x\). - \(\beta_0\) is the intercept. - \(\beta_1\) is the coefficient for input \(X_1\). - \(Y\) is the target variable. - \(X\) is the input feature vector. - \(\beta\) are the coefficients learned during training.

3.2.2 Example: Predicting Pass/Fail Based on Study Hours

Let’s use a simulated dataset to demonstrate logistic regression.

3.2.3 Engagement Question

How does the logistic regression model differ from linear regression? What are the key characteristics of the logistic regression curve?

3.2.4 Python Example: Logistic Regression

Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix

# Simulated dataset: Predicting student success
data = {
    'Study_Hours': [3, 10, 8, 5, 6, 1, 4, 7],
    'Participation': [1, 2, 2, 1, 2, 1, 1, 2],
    'Pass': [0, 1, 1, 0, 1, 0, 0, 1]
}

df = pd.DataFrame(data)

# Split the data into training and testing sets
X = df[['Study_Hours', 'Participation']]
y = df['Pass']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict and evaluate
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f"Logistic Regression Accuracy: {accuracy:.2f}")
print(f"ROC AUC Score: {roc_auc:.2f}")

Logistic Regression Accuracy: 1.00
ROC AUC Score: 1.00

3.2.5 Advantages

Simple and interpretable.
Well-suited for binary classification tasks.
Outputs probabilistic predictions.

3.2.6 Disadvantages

Assumes a linear relationship between the input features and the log-odds of the output.
Limited in handling complex relationships.

3.3 k-Nearest Neighbors (KNN)

k-Nearest Neighbors is a non-parametric, instance-based learning algorithm. It classifies a data point based on the majority class among its k-nearest neighbors in the feature space. It is simple yet effective for various classification tasks. The algorithm does not require a training phase in the traditional sense but rather stores the entire dataset and makes predictions based on distance metrics, such as Euclidean or Manhattan distance (Zhang 2016). One of its advantages is its flexibility, as it can adapt to different data distributions by adjusting the value of k, which controls the number of neighbors considered. However, k-Nearest Neighbors can be computationally expensive and less effective with high-dimensional data, where the curse of dimensionality may reduce its performance. Despite these limitations, its ease of implementation and effectiveness in various scenarios make it a popular choice for tasks such as pattern recognition and recommendation systems.

3.3.1 Working Principle

Given a data point to classify, k-NN:

Calculates the distance between the data point and all other points in the training set (commonly using Euclidean distance).
Identifies the k-nearest neighbors to the data point.
Assigns the class label that is most common among the neighbors.

3.3.2 Python Example: k-Nearest Neighbors

Code

from sklearn.neighbors import KNeighborsClassifier

# Train the k-NN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

print(f"k-NN Accuracy: {accuracy_knn:.2f}")

k-NN Accuracy: 1.00

3.3.3 Example: Classifying Students Based on Study Hours and Previous Grades

Let’s use a simulated dataset to classify students based on two features: study hours and previous grades.

Code

import plotly.express as px
from sklearn.neighbors import KNeighborsClassifier

# Simulated dataset with two features
data = {
    'Study Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Previous Grades': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
    'Pass': [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

# Prepare the data for modeling
X = df[['Study Hours', 'Previous Grades']]
y = df['Pass']

# Fit the KNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Make predictions
knn_predictions = knn.predict(X)

# Add predictions to the DataFrame
df['KNN Predicted Pass'] = knn_predictions

# Plot using Plotly
fig = px.scatter(df, x='Study Hours', y='Previous Grades', color='KNN Predicted Pass', title='KNN Classification: Study Hours and Previous Grades')
fig.show()

3.3.4 Advantages

Simple and intuitive.
No assumptions about the data distribution.
Effective with a sufficient amount of labeled data.

3.3.5 Disadvantages

Computationally expensive, especially with large datasets.
Sensitive to the choice of k and the distance metric.

3.3.6 Engagement Question

How does the choice of \(k\) affect the KNN model? What happens when \(k\) is too small or too large?

3.4 Decision Trees

Decision trees are a powerful classification algorithm that models decisions based on a tree-like structure. Each internal node represents a decision, each branch represents the outcome of the decision, and each leaf node represents a class label. They are easy to interpret and can capture non-linear relationships. By recursively partitioning the data into subsets based on feature values, decision trees can handle both categorical and numerical data effectively (Charbuty and Abdulazeez 2021). Moreover, they provide clear visualization of decision rules, which can be valuable for understanding the underlying patterns in the data. However, decision trees are prone to overfitting, especially with complex datasets, which can be mitigated by techniques such as pruning or using ensemble methods like Random Forests. Despite this, their intuitive nature and flexibility make them a popular choice in various domains including finance, healthcare, and marketing.

3.4.1 Working Principle

A decision tree is constructed by recursively splitting the dataset based on the feature that results in the most significant information gain or the highest reduction in impurity (e.g., Gini impurity or entropy).

3.4.2 Python Example: Decision Tree Classifier

Code

from sklearn.tree import DecisionTreeClassifier

# Train the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print(f"Decision Tree Accuracy: {accuracy_dt:.2f}")

Decision Tree Accuracy: 1.00

3.4.3 Example: Predicting Pass/Fail Using a Decision Tree

Let’s use a simulated dataset to build a decision tree for classifying students.

Code

from sklearn.tree import DecisionTreeClassifier
import plotly.graph_objects as go
from sklearn import tree

# Simulated dataset
data = {
    'Study Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Previous Grades': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
    'Pass': [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

# Prepare the data for modeling
X = df[['Study Hours', 'Previous Grades']]
y = df['Pass']

# Fit the decision tree model
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X, y)

# Plot the decision tree using Plotly
fig = go.Figure(data=go.Scatter(x=X['Study Hours'], y=X['Previous Grades'], mode='markers', marker=dict(color=y)))
fig.update_layout(title="Decision Tree: Study Hours and Previous Grades", xaxis_title="Study Hours", yaxis_title="Previous Grades")
fig.show()

# Visualize the decision tree structure
#fig_tree = tree.plot_tree(dt, feature_names=['Study Hours', 'Previous Grades'], class_names=['Fail', 'Pass'], filled=True)

3.4.4 Advantages

Easy to understand and interpret.
Can handle both numerical and categorical data.
Non-parametric, hence no assumptions about data distribution.

3.4.5 Disadvantages

Prone to overfitting, especially with noisy data.
Can create biased trees if some classes dominate.

3.4.6 Engagement Question

What are the advantages of using decision trees? How do they differ from other classification models?

3.5 Random Forest

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees. It is designed to improve the stability and accuracy of decision trees. By averaging the predictions of numerous trees, Random Forest reduces the risk of overfitting, which is a common issue with single decision trees (Liu, Wang, and Zhang 2012). Additionally, it incorporates random feature selection during the construction of each tree, which enhances model robustness and helps in capturing diverse aspects of the data. This approach not only increases predictive performance but also provides insights into feature importance, aiding in feature selection and model interpretability (Liu, Wang, and Zhang 2012). Due to its versatility and effectiveness, Random Forest is widely used in various applications including medical diagnosis, financial forecasting, and image classification

3.5.1 Working Principle

Random Forest introduces two sources of randomness:

It selects a random subset of features to split each node.
It trains each tree on a random subset of the training data (with replacement).

3.5.2 Python Example: Random Forest Classifier

Code

from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

Random Forest Accuracy: 1.00

3.5.3 Advantages

Reduces overfitting compared to individual decision trees.
Robust to outliers and noisy data.
Handles large datasets and high-dimensional data well.

3.5.4 Disadvantages

Can be less interpretable than individual decision trees.
Computationally more expensive due to the construction of multiple trees.

3.6 Support Vector Machines (SVM)

Support Vector Machines are powerful classifiers that work by finding the optimal hyperplane that best separates the classes in the feature space. SVMs are effective in high-dimensional spaces and are particularly suited for binary classification tasks .

3.6.1 Working Principle

SVM finds the hyperplane that maximizes the margin, which is the distance between the hyperplane and the nearest data points (support vectors) from either class. For non-linearly separable data, SVM uses kernel functions to project the data into higher dimensions where it becomes linearly separable.

3.6.2 Python Example: Support Vector Classifier

Code

from sklearn.svm import SVC

# Train the SVM model
svm_clf = SVC(kernel='linear', random_state=42)
svm_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_svm = svm_clf.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

print(f"SVM Accuracy: {accuracy_svm:.2f}")

SVM Accuracy: 1.00

3.6.3 Advantages

Effective in high-dimensional spaces.
Robust to overfitting, especially with the use of regularization.
Can be adapted to non-linear data with kernel functions.

3.6.4 Disadvantages

Computationally intensive, particularly with large datasets.
Difficult to interpret and tune the choice of kernel.

3.7 Metrics for Evaluating Classification Models

Evaluating the performance of classification models is critical to ensure they are reliable and effective in real-world applications. The choice of evaluation metrics depends on the specific characteristics of the problem, such as the class distribution and the costs of different types of errors which is different than regression models. Common metrics include:

3.7.1 Common Evaluation Metrics

Accuracy: The proportion of correctly classified instances out of the total instances.

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Where:

-   $TP$ is True Positives.

-   $TN$ is True Negatives.

-   $FP$ is False Positives.

-   $FN$ is False Negatives.

Precision: The proportion of correctly predicted positive instances out of all instances predicted as positive.

\[ \text{Precision} = \frac{TP}{TP + FP} \]

Recall (Sensitivity): The proportion of correctly predicted positive instances out of all actual positive instances.

\[ \text{Recall} = \frac{TP}{TP + FN} \]

F1 Score: The harmonic mean of Precision and Recall, providing a balance between the two.

\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

ROC AUC Score: The Area Under the Receiver Operating Characteristic Curve, which plots the True Positive Rate against the False Positive Rate. A higher AUC indicates a better-performing model.
Confusion Matrix: A table that provides a more detailed breakdown of the classification results, showing the counts of true positives, true negatives, false positives, and false negatives.

3.7.2 Python Example: Evaluating a Classification Model

Code

from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rf)
print(f"Confusion Matrix:\n{cm}")

# Classification Report
report = classification_report(y_test, y_pred_rf)
print(f"Classification Report:\n{report}")

# ROC AUC Score
roc_auc = roc_auc_score(y_test, y_pred_rf)
print(f"ROC AUC Score: {roc_auc:.2f}")

Confusion Matrix:
[[2 0]
 [0 1]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3

ROC AUC Score: 1.00

3.7.3 Model Selection Considerations

Imbalanced Data: When the classes are imbalanced, accuracy alone can be misleading. Precision, Recall, and the F1 Score become more informative.
Cost of Errors: In some applications, the cost of false positives and false negatives may differ. For example, in medical diagnosis, false negatives may be more critical to minimize.
Overfitting: It’s essential to validate the model on unseen data to ensure it generalizes well. Cross-validation is a common technique for this purpose.

3.7.4 Engagement Question

Which metric would you prioritize in your classification model, and why? How do these metrics provide insights into model performance?

3.8 Hands-On Practice

Apply what you’ve learned to a new dataset. Try different classification models and evaluate their performance using the metrics discussed.

3.8.1 Exercise

Use a dataset of your choice (e.g., student demographics and performance) and build a classification model. Compare logistic regression, KNN, and decision trees, and evaluate them using accuracy, precision, recall, and F1-score.

3.9 Summary and Expectations

This chapter provided an in-depth exploration of essential classification models in supervised learning, including Logistic Regression, k-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines. Each model has its strengths and weaknesses, making them suitable for different types of classification problems. By understanding these algorithms and their applications, you are better equipped to tackle complex classification tasks in various domains.

In addition, we discussed the importance of evaluating classification models using metrics such as accuracy, precision, recall, and the ROC AUC score. Selecting the appropriate evaluation metrics is crucial for ensuring that your model performs well under real-world conditions.