4  Chapter 4: Unsupervised Learning - Clustering and Dimensionality Reduction

“Information is the oil of the 21st century, and analytics is the combustion engine”
– Peter Sondergaard

Unsupervised learning involves training a model on data without labeled responses. The goal is to uncover hidden patterns or intrinsic structures within the data. Unlike supervised learning, where the model learns from labeled examples, unsupervised learning operates purely on input data. Common techniques include clustering and dimensionality reduction. The goal of clustering is to group similar data points together. Clustering is widely used in applications like market segmentation, social network analysis, and image compression.

In this chapter, we delve into the concepts of clustering and dimensionality reduction, two fundamental unsupervised learning techniques. We explore popular algorithms such as K-Means, Hierarchical Clustering, DBSCAN, Principal Component Analysis (PCA), and t-Distributed Stochastic Neighbor Embedding (t-SNE), discussing their mathematical foundations, advantages, and limitations. We will also discuss how to evaluate the performance of clustering algorithms. Python examples using simulated datasets are provided to illustrate their application.

4.1 Clustering

Clustering involves organizing data points into groups (clusters) based on their similarities. Unlike supervised learning, there are no predefined labels; the model discovers the natural groupings in the data. This unsupervised technique helps in identifying inherent structures within the data, making it valuable for exploratory data analysis and pattern recognition (Itauma et al. 2015). Common clustering algorithms include k-means, which partitions the data into k clusters by minimizing the variance within each cluster, and hierarchical clustering, which builds a tree of clusters based on their similarities. Clustering can be applied in diverse fields such as market segmentation, where it helps in identifying distinct customer groups, and in image analysis, where it aids in segmenting objects within an image (Itauma et al. 2015). Despite its usefulness, clustering can be sensitive to the choice of parameters and initial conditions, which may affect the stability and interpretation of the results.

4.1.1 Example: Customer Segmentation in Education

Imagine an educational company that wants to segment its students based on their learning behaviors. Clustering can help identify different groups of students who share similar characteristics.

4.1.2 Engagement Question

  • How might clustering be used to uncover hidden patterns in your industry? Can you think of a scenario where grouping similar data points could be valuable?

4.1.3 k-Means Clustering

K-Means is one of the most widely used clustering algorithms. It partitions the data into \(k\) clusters by minimizing the variance within each cluster. The algorithm iteratively assigns each data point to the nearest cluster center and updates the cluster centers based on the mean of the assigned points. One of its advantages is its simplicity and efficiency, making it suitable for large datasets and applications requiring fast clustering (Xie and Zhang 2020). However, K-Means is sensitive to the initial placement of cluster centers and can converge to local minima, which can impact the quality of clustering results (Itauma et al. 2015). The algorithm also requires the number of clusters \(k\) to be specified in advance, which may not always be known, and determining the optimal \(k\) often involves methods like the Elbow Method or Silhouette Score (Itauma et al. 2015). Despite these limitations, K-Means remains a popular and effective tool for a wide range of clustering tasks.

4.1.3.1 The k-Means Algorithm

The k-Means algorithm works as follows: 1. Initialize \(k\) cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids based on the mean of the data points assigned to each cluster.
4. Repeat steps 2 and 3 until the centroids no longer change.

4.1.3.2 Mathematical Foundation

The K-Means algorithm seeks to minimize the following objective function:

\[ J = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j - \mu_i\|^2 \]

Where: - \(k\) is the number of clusters. - \(C_i\) is the set of points in cluster \(i\). - \(\mu_i\) is the mean of points in cluster \(i\).

4.1.3.3 Python Example: K-Means Clustering

Code
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px

# Simulated dataset: Grouping students based on study habits
data = {
    'Study_Hours': [3, 10, 8, 5, 6, 1, 4, 7],
    'Participation': [1, 2, 2, 1, 2, 1, 1, 2]
}

df = pd.DataFrame(data)

# Applying K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
df['Cluster'] = kmeans.fit_predict(df)

# Visualization
fig = px.scatter(df, x='Study_Hours', y='Participation', color='Cluster',
                 title="K-Means Clustering of Students")
fig.show()

4.1.3.4 Example: Segmenting Students Based on Study Patterns

Let’s use a simulated dataset to demonstrate k-Means clustering.

Code
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans

# Simulated dataset
data = {
    'Study Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Previous Grades': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
}
df_ = pd.DataFrame(data)

# Apply k-Means clustering
kmeans = KMeans(n_clusters=3)
df_['Cluster'] = kmeans.fit_predict(df_[['Study Hours', 'Previous Grades']])

# Plot the clusters using Plotly
fig = px.scatter(df_, x='Study Hours', y='Previous Grades', color='Cluster', title='k-Means Clustering: Study Hours and Previous Grades')
fig.show()

4.1.3.5 Advantages

  • Simple to implement and computationally efficient.
  • Scales well to large datasets.
  • Works well when clusters are spherical and of similar size. #### Disadvantages
  • Requires the number of clusters \(k\) to be specified in advance.
  • Sensitive to the initial placement of cluster centers.
  • Struggles with clusters of varying shapes and sizes.

4.1.3.6 Engagement Question

  • What factors should be considered when choosing the value of \(k\) in k-Means clustering? What might happen if \(k\) is too small or too large?

4.1.4 Hierarchical Clustering

Hierarchical clustering creates a tree-like structure (dendrogram) that represents nested groupings of data points. There are two main types: agglomerative (bottom-up) and divisive (top-down). Unlike k-Means, it does not require the number of clusters to be specified in advance.

4.1.4.1 The Hierarchical Clustering Process

There are two main approaches to hierarchical clustering:

  • Agglomerative (bottom-up): Starts with individual data points as clusters and merges them step by step.

  • Divisive (top-down): Starts with one large cluster and splits it into smaller clusters.

4.1.4.2 Mathematical Foundation

Agglomerative clustering merges clusters based on a distance metric, such as single linkage (minimum distance), complete linkage (maximum distance), or average linkage.

4.1.4.3 Python Example: Hierarchical Clustering

Code
from scipy.cluster.hierarchy import dendrogram, linkage
import plotly.figure_factory as ff

# Applying hierarchical clustering
Z = linkage(df[['Study_Hours', 'Participation']], method='ward')

# Visualization
#fig = ff.create_dendrogram(Z, orientation='top', labels=df.index.tolist())
#fig.update_layout(title="Hierarchical Clustering Dendrogram")
#fig.show()

4.1.4.4 Example: Grouping Students by Study Habits

Let’s use hierarchical clustering to group students based on their study hours and previous grades.

Code
from scipy.cluster.hierarchy import dendrogram, linkage
import plotly.figure_factory as ff

# Apply hierarchical clustering
linked = linkage(df_[['Study Hours', 'Previous Grades']], method='ward')

# Create a dendrogram
#fig = ff.create_dendrogram(linked, orientation='top', labels=df_.index.tolist())
#fig.update_layout(title='Hierarchical Clustering Dendrogram')
#fig.show()

4.1.4.5 Advantages

  • No need to specify the number of clusters in advance.

  • Produces a hierarchy of clusters, useful for visual exploration.

  • Can capture complex relationships between data points.

4.1.4.6 Disadvantages

  • Computationally intensive for large datasets.

  • Results can be sensitive to the choice of distance metric and linkage method.

  • Merging decisions are irreversible.

4.1.4.7 Engagement Question

  • How does hierarchical clustering differ from k-Means? What are the advantages and disadvantages of each method?

4.1.5 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that groups together points that are closely packed while marking outliers as noise. It does not require the number of clusters to be specified in advance.

4.1.5.1 The DBSCAN Algorithm

DBSCAN works as follows:

  1. Identify points that have a high density of neighbors within a given radius (epsilon).

  2. Group these points into clusters.

  3. Mark points that do not belong to any cluster as noise.

4.1.5.2 Example: Identifying Outliers in Student Performance

Let’s use DBSCAN to detect outliers in a dataset of student study hours and previous grades.

Code
from sklearn.cluster import DBSCAN

# Apply DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=2)
df_['DBSCAN Cluster'] = dbscan.fit_predict(df_[['Study Hours', 'Previous Grades']])

# Plot the results using Plotly
fig = px.scatter(df_, x='Study Hours', y='Previous Grades', color='DBSCAN Cluster', title='DBSCAN Clustering: Study Hours and Previous Grades')
fig.show()

4.1.5.3 Engagement Question

  • What are the strengths of DBSCAN compared to other clustering methods? When might you prefer to use DBSCAN?

4.2 Evaluating Clustering Models

Evaluating clustering models can be challenging because there are no true labels to compare against. However, several metrics can be used to assess the quality of the clusters.

4.2.1 Silhouette Score

The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, with higher values indicating better-defined clusters.

\[ \text{Silhouette Score} = \frac{b - a}{\max(a, b)} \] Where:

  • \(a\) is the average distance between a point and other points in its cluster.

  • \(b\) is the average distance between a point and points in the nearest cluster.

4.2.2 Davies-Bouldin Index

The Davies-Bouldin index measures the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering. \[ \text{Davies-Bouldin Index} = \frac{1}{n}\sum_{i=1}^{n}\max_{j \neq i}\left(\frac{s_i + s_j}{d_{ij}}\right) \] Where:

  • \(s_i\)​ and \(s_j\)​ are the average distances within clusters \(i\) and \(j\).

  • \(d_{ij}\)​ is the distance between the centroids of clusters \(i\) and \(j\).

4.2.3 Engagement Question

  • Which clustering evaluation metric would you use in your project, and why? What insights can these metrics provide?

4.3 Hands-On Practice

Now it’s your turn to apply clustering to a dataset of your choice. Experiment with different clustering algorithms and evaluate their performance using the metrics discussed.

4.3.1 Exercise

  • Use a dataset (e.g., student performance or customer data) and apply k-Means, hierarchical clustering, and DBSCAN. Compare the results using silhouette score and Davies-Bouldin index.

4.4 Dimensionality Reduction

“The ability to simplify means to eliminate the unnecessary so that the necessary may speak.”
– Hans Hofmann

As datasets grow in complexity, they often contain a large number of features. While more features can provide more information, they can also introduce noise and make the data harder to analyze. Dimensionality reduction is a technique used in machine learning to reduce the number of input variables while preserving the essential information.

In this section, we will explore the key concepts of dimensionality reduction, focusing on Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). These methods are widely used to simplify datasets, making them easier to visualize and analyze.

Dimensionality reduction techniques aim to reduce the number of features in a dataset by creating new features that capture the most important information. This is particularly useful in scenarios where the original features are highly correlated or redundant.

4.4.1 Example: Reducing Features in Educational Data

Consider a dataset with multiple features representing student performance, such as test scores, attendance, participation, and assignments. Dimensionality reduction can help to simplify this data, making it easier to identify patterns and trends.

4.4.2 Engagement Question

  • Why might dimensionality reduction be necessary in a machine learning project? Can you think of a scenario where too many features could hinder model performance?

4.4.3 Principal Component Analysis (PCA)

PCA is a technique used to reduce the dimensionality of a dataset while preserving as much variance as possible. It transforms the data into a new coordinate system, where the axes (principal components) are ordered by the amount of variance they capture. PCA is widely used for data visualization and as a preprocessing step before applying other machine learning algorithms .

4.4.3.1 Mathematical Foundation

PCA involves the following steps:

  1. Standardize the data.

  2. Compute the covariance matrix of the data.

  3. Perform eigenvalue decomposition of the covariance matrix.

  4. Select the top k eigenvectors (principal components) that correspond to the largest eigenvalues.

4.4.3.2 Python Example: PCA

Code
from sklearn.decomposition import PCA

# Applying PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df[['Study_Hours', 'Participation']])

# Visualization
fig = px.scatter(x=df_pca[:, 0], y=df_pca[:, 1],
                 title="PCA of Student Data")
fig.show()

4.4.3.3 Example: PCA on Student Performance Data

Let’s apply PCA to a simulated dataset of student performance to reduce the number of features.

Code
import pandas as pd
from sklearn.decomposition import PCA
import plotly.express as px

# Simulated dataset
data = {
    'Test Scores': [85, 78, 92, 70, 88, 76, 95, 89, 82, 77],
    'Attendance': [90, 85, 95, 80, 88, 83, 97, 91, 87, 84],
    'Participation': [80, 75, 85, 60, 78, 70, 88, 82, 75, 72],
    'Assignments': [88, 82, 90, 72, 86, 79, 92, 85, 80, 77]
}
df = pd.DataFrame(data)

# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df)
df_pca = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])

# Plot the principal components using Plotly
fig = px.scatter(df_pca, x='Principal Component 1', y='Principal Component 2', title='PCA on Student Performance Data')
fig.show()

4.4.3.4 Advantages

  • Reduces computational complexity by lowering the number of dimensions.

  • Helps mitigate the curse of dimensionality.

  • Can reveal hidden structures in high-dimensional data.

4.4.3.5 Disadvantages

  • PCA assumes linear relationships among features.

  • Can be sensitive to outliers.

  • Principal components may not always have clear interpretability.

4.4.3.6 Engagement Question

  • What does each principal component represent in the context of the original features? How can PCA help in interpreting the data?

4.4.4 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data. Unlike PCA, t-SNE focuses on preserving the local structure of the data, making it effective for revealing clusters and patterns that might not be apparent in higher dimensions.

4.4.4.1 t-SNE Algorithm

The t-SNE algorithm works as follows:

  1. Convert the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.

  2. Define a similar probability distribution in a lower-dimensional space.

  3. Minimize the Kullback-Leibler divergence between the two distributions using gradient descent.

4.4.4.2 Mathematical Foundation

t-SNE minimizes the Kullback-Leibler divergence between the joint probabilities of the high-dimensional data and the low-dimensional embedding. It does this by converting distances between points into probabilities that represent similarities.

4.4.4.3 Python Example: t-SNE

Code
from sklearn.manifold import TSNE

# Applying t-SNE
tsne = TSNE(n_components=2, random_state=42)
#df_tsne = tsne.fit_transform(df[['Study_Hours', 'Participation']])

# Visualization
#fig = px.scatter(x=df_tsne[:, 0], y=df_tsne[:, 1], title="t-SNE of Student Data")
#fig.show()

4.4.4.4 Example: Visualizing High-Dimensional Student Data with t-SNE

Let’s use t-SNE to visualize the same student performance dataset in a 2D space.

Code
from sklearn.manifold import TSNE

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
#tsne_components = tsne.fit_transform(df)
#df_tsne = pd.DataFrame(data=tsne_components, columns=['t-SNE Component 1', 't-SNE Component 2'])

# Plot the t-SNE components using Plotly
#fig = px.scatter(df_tsne, x='t-SNE Component 1', y='t-SNE Component 2', title='t-SNE on Student Performance Data')
#fig.show()

4.4.4.5 Advantages

  • Excellent for visualizing high-dimensional data.

  • Can capture complex, non-linear relationships.

  • Effective at separating clusters in lower-dimensional spaces.

4.4.4.6 Disadvantages

  • Computationally expensive and sensitive to hyperparameters.

  • Not suitable for large datasets without dimensionality reduction beforehand.

  • The resulting map may vary across different runs.

4.4.4.7 Engagement Question

  • How does t-SNE differ from PCA in terms of the type of data structure it preserves? When would you choose to use t-SNE over PCA?

4.4.5 Comparing PCA and t-SNE

While both PCA and t-SNE are used for dimensionality reduction, they serve different purposes and have distinct strengths and weaknesses.

4.4.5.1 When to Use PCA

  • PCA is useful when you need to reduce dimensionality while preserving as much variance as possible. It works best when the data is linearly separable.

4.4.5.2 When to Use t-SNE

  • t-SNE is better suited for visualizing high-dimensional data in lower dimensions, particularly when interested in preserving the local structure of the data.

4.4.5.3 Engagement Question

  • Given a dataset with many features, how would you decide whether to use PCA or t-SNE for dimensionality reduction?

4.4.6 Hands-On Practice

In this section, you will apply both PCA and t-SNE to a dataset of your choice. Compare the results and analyze which method is more effective for your specific application.

4.4.6.1 Exercise

  • Use a dataset (e.g., student performance or customer data) and apply PCA and t-SNE. Visualize the results and discuss the differences in the information captured by each method.

4.5 Summary and Expectations

In this chapter, we explored unsupervised learning techniques, focusing on clustering and dimensionality reduction. These methods are invaluable for discovering patterns and simplifying complex data structures without the need for labeled data. You should now be familiar with k-Means, hierarchical clustering, and DBSCAN, and know how to evaluate clustering models. K-Means and Hierarchical Clustering provide ways to group similar data points, while PCA and t-SNE offer powerful tools for reducing dimensions and visualizing high-dimensional datasets. Understanding these techniques equips you with essential tools for tackling a wide range of data analysis challenges.

By mastering these unsupervised learning techniques, you will be better prepared to uncover hidden patterns and insights in complex datasets, a crucial skill in the field of data science and machine learning.

Continue exploring these concepts and applying them to real-world scenarios. Clustering opens the door to discovering hidden patterns and insights in data that might otherwise go unnoticed. Additionally, continue practicing these techniques to gain confidence in applying dimensionality reduction to real-world datasets. Understanding when and how to simplify data is a crucial skill in machine learning.

4.5.1 Key Takeaways

  • Clustering is an unsupervised learning technique used to group similar data points.

  • Different clustering algorithms have distinct strengths and are suitable for different types of data.

  • Evaluating clustering models requires specific metrics that assess the quality of the clusters.

  • Dimensionality reduction simplifies datasets by reducing the number of features while retaining important information.

  • PCA is a linear technique that captures the most variance in the data, while t-SNE is a non-linear method that preserves the local structure.

  • Choosing the right dimensionality reduction technique depends on the nature of the data and the specific goals of the analysis.