Ordinary Least Squares (OLS) in Machine Learning is a method used to train linear regression models. In essence, it seeks to minimize the sum of the squares of the differences between the values predicted by the model and the actual values observed in the training dataset. This approach is very common and is the basis of many linear regression models.
Ordinary Least Squares (OLS)
Suppose we have a training data set composed of ( n ) observations. Each observation consists of a pair of values: an independent variable ( x_i ) and the corresponding dependent variable ( y_i ).
The goal of the linear regression model is to find the best regression line that minimizes the sum of the squares of the differences between the values predicted by the model and the actual observed values.
The regression line has the equation:
Where:
- ( y ) is the dependent variable,
- ( x ) is the independent variable,
- is the intercept,
- is the slope.
To train the linear regression model using OLS, we need to find the optimal values for and that minimize
The OLS approach is to find the optimal values of and that minimize this cost function. This can be done by solving the partial derivatives of the SSE function with respect to and , setting them to zero, and solving the resulting equations. The resulting solutions are called ordinary least squares estimates of the regression coefficients.
The formulas for calculating ordinary least squares estimates are:
Where:
- is the average of the values of ( x ),
- is the average of the values of ( y ).
These formulas provide us with optimal values for the regression coefficients and , which minimize the sum of squared errors. Once we have these values, we can use the regression line to make predictions on new data.
Ordinary Least Squares (OLS) and Linear Regression in scikit-learn
The scikit-learn library does not provide a specific class called “Ordinary Least Squares (OLS)” because the OLS algorithm is implicitly implemented in scikit-learn’s LinearRegression class.
The LinearRegression class of scikit-learn uses the ordinary least squares approach to train the linear regression model. When you call the fit() method on a LinearRegression object, the model is trained using the OLS algorithm to find the optimal coefficients that minimize the sum of squared errors.
So, to use the OLS approach in scikit-learn, you just need to use the LinearRegression class and call the fit() method on the training dataset.
Here is an example of using the LinearRegression class to train a linear regression model using the OLS approach
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Generating synthetic data
np.random.seed(0)
X_train = 2 * np.random.rand(100, 1) # Independent variable
y_train = 4 + 3 * X + np.random.randn(100, 1) # Dependent variable with Gaussian noise
# Training the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Printing the model coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
# Predictions on new data
X_test = np.array([[0], [1], [2]])
Y_test = 4 + 3 * X_test
predictions = model.predict(X_test)
# Visualizing the data and the regression line
plt.scatter(X_train, y_train, color='blue')
plt.plot(X_test, predictions, color='red')
plt.xlabel('Independent Variable')
plt.ylabel('Dependent Variable')
plt.title('Linear Regression with scikit-learn')
plt.show()
By performing you get
Intercept: [4.22215108]
Coefficients: [[2.96846751]]
Evaluation of the validity of the model
To evaluate the goodness of a linear regression model, there are several methods that can be used. Some of the most common include:
- Mean Square Error (MSE): Calculate the mean square error between the model predictions and the actual values in the test dataset. MSE measures the average of the squares of the differences between model predictions and actual values. A lower MSE indicates better goodness of the model.
- R^2 Score: Calculate the coefficient of determination, known as the R^2 score. R^2 is a statistical measure of the goodness of fit of the model to the observations. It can be interpreted as the proportion of variance in the data that is explained by the model. An R^2 score closer to 1 indicates a better model, while a value closer to 0 indicates a worse model.
- Residuals Plot: View a residuals plot, which represents the differences between actual values and model predictions. If the residuals are randomly distributed around zero and show no obvious pattern, it is a sign of a good model.
- Cross Validation: Use the cross validation technique to evaluate the performance of the model across multiple folds of data. This provides a more robust estimate of model performance than a single split of the data into training and test sets.
Additionally, it is always important to consider the specific context of the problem and how the evaluation metrics translate to the context of the problem domain. For example, in some cases, a certain amount of error may be acceptable, while in other cases even a small error can have significant consequences.
Let’s now apply these evaluation methods on our linear regression example.
from sklearn.metrics import mean_squared_error, r2_score
# Computing Mean Squared Error (MSE) on training and test sets
train_predictions = model.predict(X_train)
train_mse = mean_squared_error(y_train, train_predictions)
print("Mean Squared Error on training set:", train_mse)
test_predictions = model.predict(X_test)
test_mse = mean_squared_error(y_test, test_predictions)
print("Mean Squared Error on test set:", test_mse)
# Computing R^2 score on training and test sets
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)
print("R^2 score on training set:", train_r2)
print("R^2 score on test set:", test_r2)
Executing you get the following result:
Mean Squared Error on training set: 0.9924386487246479
Mean Squared Error on test set: 0.03699831140189244
R^2 score on training set: 0.7469629925504755
R^2 score on test set: 0.9938336147663512
Here’s how we can evaluate the results:
Mean Square Error (MSE):
- MSE on training set: 0.992
- MSE on the test set: 0.037 These values indicate the average magnitude of the squared errors between the model predictions and the actual values. Since MSE is a measure of error, lower values are better. In our case, the mean square error on the test set is significantly lower than that on the training set, which suggests that the model could generalize well to new data.
Coefficient of determination (R^2):
- R^2 on the training set: 0.747
- R^2 on the test set: 0.994 The coefficient of determination, or R^2, is a measure of the proportion of variance in the data that is explained by the model. A value closer to 1 indicates a better model. Both R^2 values are quite high, suggesting that the model is able to explain a significant percentage of the variation in the data on both the training and test sets. However, the R^2 on the test set is particularly high, which indicates that the model is well-fitted to the test data and that its generalization ability is very good.
Overall, the results indicate that the linear regression model performs well on both the training and test sets, with a low MSE and high R^2 on both datasets. However, it is always advisable to also evaluate other aspects of the model and consider the specific context of the problem before drawing definitive conclusions on its validity
The Residuals Graph
Let’s now see how to evaluate the goodness of our linear regression predictive model in a graphic manner, through the residual graph.
# Calculation of residuals
train_residuals = y_train - train_predictions
test_residuals = y_test - test_predictions
# Plot of residuals
plt.figure(figsize=(10, 6))
plt.scatter(train_predictions, train_residuals, color='blue', label='Training set')
plt.scatter(test_predictions, test_residuals, color='red', label='Test set')
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.legend()
plt.show()
In this code, we calculate residuals by subtracting model predictions from actual values for both the training set and the test set. Then, we plot the residuals against the model predictions on a scatterplot. The y-axis represents the residuals and the x-axis represents the model predictions. The black dashed line indicates the zero reference line, which represents the situation where the residuals are zero.
Examining the plot of residuals can help identify any patterns or structures in the residuals, such as heteroscedasticity or homoscedasticity issues, which can provide additional information on how good the model is.
Heteroscedasticity and homoscedasticity are two important concepts that concern the variance of errors (or residuals) in regression models. These concepts are fundamental to evaluate the goodness of the model and to ensure that the model’s assumptions are satisfied.
Heteroskedasticity: Occurs when the variance of the errors is not constant across all ranges of values of the independent variables. In other words, the dispersion of errors varies non-uniformly along the range of the independent variables. This can manifest as a cone or fan shape in the residuals when plotted against predicted values. Heteroscedasticity can lead to imprecise estimates of regression coefficients and erroneous estimates of standard errors of the coefficients.
Homoscedasticity: Occurs when the variance of the errors is constant across all ranges of values of the independent variables. In other words, the dispersion of errors is uniform along the range of the independent variables. This is desirable because it indicates that the model is accurate and reliable across the full range of independent variables.
In summary, heteroscedasticity indicates a non-uniform change in error dispersion, while homoscedasticity indicates a uniform change in error dispersion. It is important to identify and correct for heteroskedasticity when evaluating the goodness of a regression model, as it can affect the accuracy of the model’s estimates and predictions.
Looking at our residuals plot, we can see that the red points are all equidistant from the central black line but all underneath. Now in our case we only used 3 points, but let’s assume that we used many of them and they all behaved like the three points in the example.
If the red dots of the test set predictions are all equidistant from the black line of the residuals plot, but are all positioned below it, this suggests that a heteroskedasticity problem may be present. Heteroscedasticity occurs when the variance of the errors is not constant over all ranges of values of the independent variables. In the case where the residuals are all located below the zero reference line, this suggests that the error variance may be greater for higher values of the model predictions than for lower values. This phenomenon may indicate that the model is failing to capture variability in the data uniformly across the range of predictions.
When to use Ordinary Least Square (OLS) with Linear Regression?
The choice to use Linear Regression over other regression methods depends on several factors, including the nature of the problem, the relationship between the variables, the distribution of the data, and the objectives of the analysis. Here are some scenarios where linear regression might be preferable to other regression methods provided by scikit-learn:
- Linear Relationship between Variables: If there is a clear linear relationship between the dependent variable and the independent variables in the dataset, linear regression may be an appropriate choice. Linear regression is simple, interpretable, and can easily be extended to more complex models such as polynomial regression.
- Interpretability of Coefficients: Linear regression provides coefficients that have direct interpretations. For example, in the case of a simple linear regression, the slope indicates the average change in the dependent variable for a unit increase in the independent variable. This interpretability of the coefficients can be valuable in many situations.
- Computational Efficiency: Linear regression is computationally efficient and can be trained quickly even on large datasets. If you are dealing with a large dataset and want a model that can be trained quickly, linear regression may be a convenient choice.
- Absence of Complex Assumptions: Linear regression is based on a few assumptions, such as linearity and normality of residuals. If these assumptions are met or if some degree of violation of the assumptions can be accepted, linear regression may be an appropriate choice without the need for more complex models.
However, there are cases where other regression methods provided by scikit-learn might be more suitable, for example:
- When the relationship between variables is nonlinear, nonlinear regression methods such as polynomial regression, spline regression, or decision tree models may be needed.
- If you want to address the problem of overfitting or underfitting, you may prefer other regression methods that allow you to control the complexity of the model, such as ridge regression, lasso regression, or decision tree-based regression models.
- When you want to explore more complex relationships between variables, you may prefer more flexible regression methods, such as higher degree polynomial regression or nonparametric regression models such as support vector regression (SVR) or k-NN regression.
In summary, the choice between linear regression and other regression methods depends on the specific characteristics of the problem, the model assumptions, and the objectives of the analysis. It is important to carefully review these considerations before selecting the most appropriate regression method for a given scenario.