2  Chapter 2: Supervised Learning - Regression

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc., to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.”
– Clive Humby

Supervised learning is a foundational approach in machine learning, where models are trained on labeled data. In this chapter, we will delve into regression, a type of supervised learning that predicts continuous outcomes based on input features. We will explore key concepts such as linear regression, polynomial regression, and other regression models, and discuss how to evaluate these models using metrics like Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

2.1 Introduction to Regression

Regression analysis is a statistical technique used to model and analyze the relationships between variables. It is widely used in business and education to predict outcomes such as sales forecasts, student performance, and more. By quantifying the strength and nature of the relationships between a dependent variable and one or more independent variables, regression helps in making informed decisions and optimizing strategies. There are various types of regression models, including linear and multiple regression, which differ in their complexity and the nature of their relationships (Itauma 2024). Additionally, regression analysis can be extended to include non-linear models and techniques such as polynomial regression to capture more complex patterns in the data. This versatility makes regression a fundamental tool in data analysis across diverse fields.

2.1.1 Example: Predicting Student Performance

Consider a scenario where an educational institution wants to predict students’ final exam scores based on their study hours and previous test scores. Regression can be used to model this relationship and make predictions.

2.1.2 Engagement Question

  • How might regression be useful in your specific field or industry? Can you think of a scenario where predicting a continuous outcome would be valuable?

2.2 Linear Regression

Linear regression is the simplest form of regression, where the relationship between the independent variable (input) and the dependent variable (output) is modeled as a straight line.

2.2.1 The Linear Regression Model

The linear regression model can be expressed as:

\[ y = \beta\_0 + \beta\_1x + \epsilon \]

Where: - (y) is the dependent variable (e.g., student score). - (x) is the independent variable (e.g., study hours). - (\(\beta\_0\)) is the intercept. - (\(\beta\_1\)) is the slope of the line. - (\(\epsilon\)) is the error term.

2.2.2 Example: Predicting Scores Based on Study Hours

Let’s use a simulated dataset to demonstrate linear regression.

Code
import pandas as pd
import plotly.express as px
from sklearn.linear_model import LinearRegression

# Simulated dataset
data = {
    'Study Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Scores': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
}
df = pd.DataFrame(data)

# Prepare the data for modeling
X = df[['Study Hours']]
y = df['Scores']

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
predictions = model.predict(X)

# Add predictions to the DataFrame
df['Predicted Scores'] = predictions

# Plot using Plotly for an interactive experience
fig = px.scatter(df, x='Study Hours', y='Scores', title='Study Hours vs. Scores with Regression Line')
fig.add_scatter(x=df['Study Hours'], y=df['Predicted Scores'], mode='lines', name='Regression Line')
fig.show()

2.2.3 Engagement Question

  • In what scenarios would polynomial regression be more appropriate than linear regression? Can you identify any patterns in the data that suggest a non-linear relationship?

2.3 Other Regression Models

Beyond linear and polynomial regression, there are several other regression models used to address specific types of data and relationships.

2.3.1 Ridge and Lasso Regression

  • Ridge Regression: Adds a penalty to the size of coefficients to prevent overfitting.

  • Lasso Regression: Similar to ridge, but can shrink some coefficients to zero, effectively selecting features.

2.3.2 Logistic Regression

Used when the dependent variable is binary (e.g., pass/fail). Though not a regression in the traditional sense, it is used for classification tasks.

2.3.3 Engagement Question

  • How might different types of regression models help in scenarios where linear or polynomial regression falls short?

2.4 Metrics for Evaluating Regression Models

Evaluating the performance of regression models is crucial. Common metrics include:

2.4.1 Mean Squared Error (MSE)

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

2.4.2 Root Mean Squared Error (RMSE)

\[ \text{RMSE} = \sqrt{\text{MSE}} \]

2.4.3 R-squared (R^2)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} \]

2.4.4 Engagement Question

  • Which metric would you prioritize when evaluating your model, and why? How do these metrics provide insights into model performance?

2.5 Hands-On Practice

Apply what you’ve learned to a new dataset. Try different regression models and evaluate their performance using the metrics discussed.

2.5.1 Exercise

  • Use a dataset of your choice (e.g., student attendance and performance) and build a regression model. Compare linear and polynomial regression, and evaluate them using MSE and RMSE.

2.6 Summary and Expectations

In this chapter, we’ve explored the foundational concepts of regression in supervised learning. You should now be comfortable with linear and polynomial regression, understand when to use different models, and know how to evaluate them using appropriate metrics.

2.6.1 Key Takeaways

  • Regression models predict continuous outcomes and are fundamental in many business and educational contexts.

  • Different types of regression models are suited to different types of data and relationships.

  • Evaluating models using metrics like MSE, RMSE, and \(R^2\) helps ensure their effectiveness.

2.6.2 Engagement Question

  • Reflect on how you might apply regression analysis in your field. What kinds of data might you work with, and how would you use these models to make predictions?