Site icon Meccanismo Complesso

Linear regression with Lasso in Machine Learning with scikit-learn

Lasso Regression for Linear Regression
Lasso Regression for Linear Regression header

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression technique that uses L1 regularization to improve generalization and variable selection. Lasso regression is a powerful technique for linear regression that combines dimensionality reduction with the ability to select the most important variables, helping to create more interpretable and generalizable models.

The LASSO (Least Absolute Shrinkage and Selection Operator) regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression was first introduced by Robert Tibshirani in 1996. It was developed as a regularization technique for linear regression, with the main goal of addressing the problem of overfitting and of variable selection.

The concept of Lasso regression has emerged as a solution to the variable selection problem, which occurs when using regression models with a large number of explanatory variables. In these situations, it is possible that many variables are not relevant to predicting the outcome, but can influence the model, leading to adequate performance on training data but poor generalization on new data.

Lasso regression addresses this problem by introducing an L1 penalty on the absolute sum of the model coefficients during the training process. This L1 penalty causes some coefficients to become exactly zero, thus reducing the number of variables used in the model. This automatic variable selection process makes Lasso regression particularly useful in situations where you want to identify the most important predictors among a large number of explanatory variables.

In the years since its introduction, Lasso regression has gained significant popularity in the scientific community and within machine learning, becoming one of the most widely used regularized regression methods alongside other techniques such as Ridge regression and Elastic Net.

Lasso Regression is based on the minimization of the cost function which includes two terms: the mean square error (MSE) term and an L1 penalty term.

The objective function of Lasso regression can be expressed as:

Where:

The part represents the mean square error (MSE) term, while represents the L1 penalty term.

The L1 penalty (absolute sum of coefficients) encourages the model coefficients to become exactly zero, thus reducing the complexity of the model and leading to variable selection. This is useful for creating simpler and more interpretable models, as well as helping to prevent the problem of overfitting.

Lasso Regression with Scikit-learn

Lasso regression is implemented in the scikit-learn library, which is one of the most popular libraries for machine learning in Python. In scikit-learn, Lasso regression is available via the Lasso class within the linear_model module.

Let’s see an example that uses synthetic data, i.e. artificially generated data to simulate a dataset that should follow a linear trend with a background noise added specifically.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

np.random.seed(0)
n_samples = 100
n_features = 10
X = np.random.randn(n_samples, n_features)
true_coefficients = np.random.randn(n_features)
y = X.dot(true_coefficients) + np.random.normal(0, 0.5, n_samples)

# Split data into training and test sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Create Lasso regression model
alpha = 0.1  # regularization parameter
lasso = Lasso(alpha=alpha)

# Train the model
lasso.fit(X_train, y_train)

# Predict on test data
y_pred = lasso.predict(X_test)

# Calculate mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print model coefficients
print("Coefficients:", lasso.coef_)

Executing you get the following results:

Mean Squared Error (MSE): 0.3927541814497199
Coefficients: [ 0.50765085  0.76293123 -0.3276433   0.          0.15763891  0.11216079
  0.42298338 -1.73556608 -0.          0.09392823]

We obtained the value of the Mean Squared Error (MSE) which is the average of the squares of the errors between the predicted values and the actual values. The lower the MSE value, the better the model. Furthermore, we also obtained a series of coefficients related to the Lasso model. These coefficients are the weights assigned to each explanatory variable in the model. Each coefficient indicates how much a variable influences the destination variable.

Here’s what these coefficients mean and what they are for:

Therefore, by examining these coefficients, you can understand which explanatory variables are considered important by the model and to what extent they influence the target variable. This information can be used to interpret the model and draw conclusions about the factors that influence the target variable in the context of your specific problem.

They can also be displayed graphically to better see the contribution of each variable.

import matplotlib.pyplot as plt

# Lasso model coefficients
lasso_coefficients = lasso.coef_

# Variable indices
indices = np.arange(len(lasso_coefficients))

# Plot coefficients
plt.figure(figsize=(10, 5))
plt.bar(indices, lasso_coefficients, color='b')
plt.xlabel('Variable Index')
plt.ylabel('Coefficient')
plt.title('Lasso Regression Model Coefficients')

# Add a red line for zero coefficients
plt.axhline(y=0, color='r', linestyle='--')

plt.xticks(indices)
plt.grid(True)
plt.show()

It can be clearly seen that features 4 and 9 (indexes 3 and 8) are which in our synthetic dataset are equivalent to the two columns of X_train.

print(X_train[:][3])
print(X_train[:][8])

Returning to the results obtained from our model, we can evaluate the goodness of the prediction graphically using the following code:

import matplotlib.pyplot as plt
import numpy as np

# Plot predicted values vs actual values
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # Red dashed diagonal
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Lasso Regression)")
plt.show()

By executing this you will obtain a graph in which the predicted values are compared to the real values. The points on the graph must remain as close as possible to the dotted red diagonal (predicted value = real value). The distribution of these points around this line for the entire extension shows the goodness of the model in making predictions.

Real example of Linear Regression with the scikit-learn diabetes dataset

In the previous example we used synthetic data to show how a linear regression works. Now we will move on to a real dataset provided by the scikit-learn library and used to test the models’ ability to predict outcomes: the Diabetes dataset.

This dataset is widely used for evaluating the performance of regression models. It contains diabetes-related information for 442 patients, along with disease progression after one year, measured via a continuous response variable. The dataset contains only 442 instances, with 10 predictor variables. Despite its small size, the dataset is realistic and represents a typical regression problem where you want to predict disease progression based on different clinical measurements. Predictor variables in the dataset include characteristics such as age, gender, body mass index, and six blood serum measurements. These variables cover a range of information relevant to diabetes progression. Due to its size and the presence of a continuous response variable, the dataset is suitable for evaluating the performance of regression models. You can train different regression models, such as Linear Regression, Lasso Regression, Ridge Regression, and others, and evaluate their performance using cross-validation techniques or simply by splitting the data into training sets and test sets.

We then load the diabetes dataset. To have a detailed description of the dataset we can use diabetes.DESCR.

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes()

# Display dataset description
print("\nDiabetes dataset description:")
print(diabetes.DESCR)

One way to view and manage its content is to use pandas dataframes.

import pandas as pd

# Create a DataFrame with the data and column names
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Add target column to the DataFrame
df['target'] = diabetes.target

# Display the first 5 rows of the DataFrame
df.head()

With the head() function you get the first 5 rows of the dataframe, enough to take a look and understand the content of the dataset and how it is structured.

The last column represents the target. This value in the diabetes dataset is the progression of diabetes disease after one year of treatment, measured by a continuous variable. This variable represents a quantitative measure of disease progression and is used as a response or target variable in regression models. The goal is to use the other explanatory variables in the dataset to predict this target variable, in order to understand which factors influence the progression of diabetes.

Essentially, our goal is to use the information provided by the other variables in the dataset (such as age, gender, body mass index, and blood serum measurements) to predict the progression of the diabetes disease represented by the target. This allows us to better understand the factors that influence the progression of diabetes and can help in the diagnosis and treatment of the disease.

Now let’s apply the Lasso linear regression model. We divide the dataset into training set (80%) and testing set (20%) and then use the first for model learning and the second for evaluating predictions.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

X = diabetes.data
y = diabetes.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Lasso regression model
alpha = 0.1  # regularization parameter
lasso = Lasso(alpha=alpha)

# Train the model on the training set
lasso.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = lasso.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Print the model coefficients
print("Coefficients:", lasso.coef_)

Executing you get the following results:

Mean Squared Error: 2798.193485169719
Coefficients: [   0.         -152.66477923  552.69777529  303.36515791  -81.36500664
   -0.         -229.25577639    0.          447.91952518   29.64261704]

In other words, the characteristics of the diabetes dataset corresponding to these null coefficients are considered irrelevant for predicting disease progression.

In your case, the coefficients that have been set to zero are associated with features with indices 0, 5, and 7. Since feature indices in Python start from zero, the corresponding features are:

These characteristics may not significantly contribute to disease progression in the dataset, and therefore the Lasso regression model excluded them thus reducing the complexity of the model.

This behavior is typical of Lasso regression, since L1 regularization induces sparsity, i.e. causes some coefficients to become exactly zero. This makes Lasso regression particularly useful for selecting variables and creating simpler, more interpretable models.

If we also want a graphical evaluation:

import matplotlib.pyplot as plt
import numpy as np

# Plot of predicted values vs actual values
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # Red dashed diagonal
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Lasso Regression)")
plt.show()

That by running this code you get the following graph:

Evaluating the LASSO Regression Model

We have seen how to graphically evaluate the results of the model’s prediction, observing how far the points move away from the central diagonal. But in this regard, there are several metrics that can be used to evaluate the goodness of the regression model. Some of the common metrics include:

Regarding Lasso regression, usually MSE, RMSE and R-squared are the most commonly used metrics to evaluate the goodness of the model. For example, in the code provided above, we used MSE to evaluate model performance. However, it is always a good idea to use more than one metric to get a more complete assessment of your model’s performance.

Let’s see how to calculate the MSE, RMSE, MAE and ( R^2 ) metrics in the context of the Lasso regression model we created:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared (R^2):", r2)

Executing we obtain the following metric values:

Mean Squared Error (MSE): 2798.193485169719
Root Mean Squared Error (RMSE): 52.897953506442185
Mean Absolute Error (MAE): 42.85442771664998
R-squared (R^2): 0.4718547867276227

To evaluate the results obtained, we can interpret each of the metrics in the following way:

Overall, based on these metrics, we can conclude that the Lasso regression model may not be very accurate or robust for predicting disease progression in the considered diabetic dataset. You may need to explore other modeling techniques or tune model parameters to improve performance.

When to use Lasso for Linear Regression problems

The Lasso method is particularly useful in several contexts of linear regression problems. Here are some cases where the Lasso method may be an appropriate choice:

Overall, the Lasso method is a great choice when you want to select variables, control model dimensionality, and get clearer interpretations of model coefficients, while maintaining good forecasting performance.

Exit mobile version