Linear Regression with Ordinary Least Square (OLS) in Machine Learning with scikit-learn

Ordinary Least Squares (OLS) in Machine Learning is a method used to train linear regression models. In essence, it seeks to minimize the sum of the squares of the differences between the values predicted by the model and the actual values observed in the training dataset. This approach is very common and is the basis of many linear regression models.

Ordinary Least Squares (OLS)

Suppose we have a training data set composed of ( n ) observations. Each observation consists of a pair of values: an independent variable ( x_i ) and the corresponding dependent variable ( y_i ).

The goal of the linear regression model is to find the best regression line that minimizes the sum of the squares of the differences between the values predicted by the model and the actual observed values.

The regression line has the equation:


To train the linear regression model using OLS, we need to find the optimal values for and that minimize

The OLS approach is to find the optimal values of and that minimize this cost function. This can be done by solving the partial derivatives of the SSE function with respect to and , setting them to zero, and solving the resulting equations. The resulting solutions are called ordinary least squares estimates of the regression coefficients.

The formulas for calculating ordinary least squares estimates are:


These formulas provide us with optimal values for the regression coefficients and , which minimize the sum of squared errors. Once we have these values, we can use the regression line to make predictions on new data.

Ordinary Least Squares (OLS) and Linear Regression in scikit-learn

The scikit-learn library does not provide a specific class called “Ordinary Least Squares (OLS)” because the OLS algorithm is implicitly implemented in scikit-learn’s LinearRegression class.

The LinearRegression class of scikit-learn uses the ordinary least squares approach to train the linear regression model. When you call the fit() method on a LinearRegression object, the model is trained using the OLS algorithm to find the optimal coefficients that minimize the sum of squared errors.

So, to use the OLS approach in scikit-learn, you just need to use the LinearRegression class and call the fit() method on the training dataset.

Here is an example of using the LinearRegression class to train a linear regression model using the OLS approach

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generating synthetic data
X_train = 2 * np.random.rand(100, 1)  # Independent variable
y_train = 4 + 3 * X + np.random.randn(100, 1)  # Dependent variable with Gaussian noise

# Training the linear regression model
model = LinearRegression(), y_train)

# Printing the model coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Predictions on new data
X_test = np.array([[0], [1], [2]])
Y_test = 4 + 3 * X_test
predictions = model.predict(X_test)

# Visualizing the data and the regression line
plt.scatter(X_train, y_train, color='blue')
plt.plot(X_test, predictions, color='red')
plt.xlabel('Independent Variable')
plt.ylabel('Dependent Variable')
plt.title('Linear Regression with scikit-learn')

By performing you get

Intercept: [4.22215108]
Coefficients: [[2.96846751]]

Evaluation of the validity of the model

To evaluate the goodness of a linear regression model, there are several methods that can be used. Some of the most common include:

Additionally, it is always important to consider the specific context of the problem and how the evaluation metrics translate to the context of the problem domain. For example, in some cases, a certain amount of error may be acceptable, while in other cases even a small error can have significant consequences.

Let’s now apply these evaluation methods on our linear regression example.

from sklearn.metrics import mean_squared_error, r2_score

# Computing Mean Squared Error (MSE) on training and test sets
train_predictions = model.predict(X_train)
train_mse = mean_squared_error(y_train, train_predictions)
print("Mean Squared Error on training set:", train_mse)

test_predictions = model.predict(X_test)
test_mse = mean_squared_error(y_test, test_predictions)
print("Mean Squared Error on test set:", test_mse)

# Computing R^2 score on training and test sets
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

print("R^2 score on training set:", train_r2)
print("R^2 score on test set:", test_r2)

Executing you get the following result:

Mean Squared Error on training set: 0.9924386487246479
Mean Squared Error on test set: 0.03699831140189244
R^2 score on training set: 0.7469629925504755
R^2 score on test set: 0.9938336147663512

Here’s how we can evaluate the results:

Mean Square Error (MSE):

Coefficient of determination (R^2):

Overall, the results indicate that the linear regression model performs well on both the training and test sets, with a low MSE and high R^2 on both datasets. However, it is always advisable to also evaluate other aspects of the model and consider the specific context of the problem before drawing definitive conclusions on its validity

The Residuals Graph

Let’s now see how to evaluate the goodness of our linear regression predictive model in a graphic manner, through the residual graph.

# Calculation of residuals
train_residuals = y_train - train_predictions
test_residuals = y_test - test_predictions

# Plot of residuals
plt.figure(figsize=(10, 6))
plt.scatter(train_predictions, train_residuals, color='blue', label='Training set')
plt.scatter(test_predictions, test_residuals, color='red', label='Test set')
plt.axhline(y=0, color='black', linestyle='--')
plt.title('Residual Plot')

In this code, we calculate residuals by subtracting model predictions from actual values for both the training set and the test set. Then, we plot the residuals against the model predictions on a scatterplot. The y-axis represents the residuals and the x-axis represents the model predictions. The black dashed line indicates the zero reference line, which represents the situation where the residuals are zero.

Examining the plot of residuals can help identify any patterns or structures in the residuals, such as heteroscedasticity or homoscedasticity issues, which can provide additional information on how good the model is.

Heteroscedasticity and homoscedasticity are two important concepts that concern the variance of errors (or residuals) in regression models. These concepts are fundamental to evaluate the goodness of the model and to ensure that the model’s assumptions are satisfied.

Heteroskedasticity: Occurs when the variance of the errors is not constant across all ranges of values of the independent variables. In other words, the dispersion of errors varies non-uniformly along the range of the independent variables. This can manifest as a cone or fan shape in the residuals when plotted against predicted values. Heteroscedasticity can lead to imprecise estimates of regression coefficients and erroneous estimates of standard errors of the coefficients.

Homoscedasticity: Occurs when the variance of the errors is constant across all ranges of values of the independent variables. In other words, the dispersion of errors is uniform along the range of the independent variables. This is desirable because it indicates that the model is accurate and reliable across the full range of independent variables.

In summary, heteroscedasticity indicates a non-uniform change in error dispersion, while homoscedasticity indicates a uniform change in error dispersion. It is important to identify and correct for heteroskedasticity when evaluating the goodness of a regression model, as it can affect the accuracy of the model’s estimates and predictions.

Looking at our residuals plot, we can see that the red points are all equidistant from the central black line but all underneath. Now in our case we only used 3 points, but let’s assume that we used many of them and they all behaved like the three points in the example.

If the red dots of the test set predictions are all equidistant from the black line of the residuals plot, but are all positioned below it, this suggests that a heteroskedasticity problem may be present. Heteroscedasticity occurs when the variance of the errors is not constant over all ranges of values of the independent variables. In the case where the residuals are all located below the zero reference line, this suggests that the error variance may be greater for higher values of the model predictions than for lower values. This phenomenon may indicate that the model is failing to capture variability in the data uniformly across the range of predictions.

When to use Ordinary Least Square (OLS) with Linear Regression?

The choice to use Linear Regression over other regression methods depends on several factors, including the nature of the problem, the relationship between the variables, the distribution of the data, and the objectives of the analysis. Here are some scenarios where linear regression might be preferable to other regression methods provided by scikit-learn:

However, there are cases where other regression methods provided by scikit-learn might be more suitable, for example:

In summary, the choice between linear regression and other regression methods depends on the specific characteristics of the problem, the model assumptions, and the objectives of the analysis. It is important to carefully review these considerations before selecting the most appropriate regression method for a given scenario.

