L1 and L2 Regularization-Lasso and Ridge Regression

7 min readNov 14, 2020

What is Regularization?

The term “regularization” means to deploy control over something. However, this term has an important role to play in machine learning as well. Surprised? Let’s understand how!

Importance of Regularization in Machine Learning

Training a machine learning model can sometimes lead to overfitting which renders to bad predictions in the long run. Regularization is the method of introducing a penalty term which eliminates the possibility of encountering overfitting and hence helps in making better long term predictions. There are two regression models which help in achieving regularization, namely, Ridge Regression and Lasso Regression.

Let us understand Ridge Regression first and consequently Lasso Regression and understand the similarities and differences between them.

Ridge Regression (L2 Regularization)

In linear regression estimate, for the purpose of predictions, a data is divided into training and testing data. If the percentage of training data is less as compared to testing data, the problem of overfitting can arise.

However, this problem can be solved by making the predictions much less sensitive to changes of values in x-axis or in other words by introducing a little bit of BIAS!!

Let us call the new bias-introduced line as the Ridge-Regression Line. This line will also help us in reducing a term called VARIANCE.

Understanding Bias And Variance

Bias is the difference between the ground truth or the training data points and the predicted line.

Statistically, variance is the measure of how far each data point is from the mean of all the data points. Here, variance is the difference between the testing data points and the predicted line. In machine learning, a high variance can introduce random noise in the training data.

In the linear regression or least squares estimation technique, the bias was low resulting into high variance of the testing data points, However, upon introducing a small amount of bias which is now called the Ridge-Regression Line, the magnitude of variation was significantly reduced.

Now, let us understand this mathematically!

Let us first understand what is the sum of squared residuals. The residuals are the difference between the training data points that we originally have from the predicted Linear Regression Line. All the residuals are squared and added for further computations.

When Least Squares is used for determining the values of parameters in the equation :

Size = y-axis intercept + slope * Weight

it minimizes the sum of the squared residuals.

However, the Ridge-Regression determines the values of parameters by introducing a penalty term i.e. lambda*(the slope)² which is added to the sum of squared residuals.

The slope² term is used for introducing the penalty and lambda is used for determining the magnitude of the penalty or in other words dictating the severity of the penalty.

As the value of lambda increases, the slope decreases (the value of lambda can range from 0 to infinity). In Ridge-Regression extremely large values of lambda results in slope becoming very small i.e. very close to zero. One thing to note here is that however large the value of lambda is the slope will never reduce to an out-and-out zero but will result in values asymptotically close to 0.

As a result, the Ridge Regression line is less sensitive to values of x-axis for making predictions than the Least Squares line.

Ridge Regression can also be applied to Logistic Regression. The difference between this model and the Least Squares model is that in Logistic Regression, we take the sum of the likelihoods.

Let us understand what Cross-Validation is before proceeding.

To determine the best model for prediction, cross-validation is a technique that brings into picture the workings of different machine learning models for prediction to ascertain the best model that will give better long-term results. The model which gives the best metrics , here, accuracy will be chosen as our model.

The practice of Cross-validation in Ridge regression is that it helps in determining the value of lambda. Another important point is that when there are very few training data points, then Least Squares will not be able to give predictions whereas Ridge Regression can give predictions using cross-validation and the penalty.

Code Implementation Using Sklearn

Implementation of Ridge Regression using sklearn in python

Lasso Regression (L1 Regularization)

Lasso Regression in the working is very similar to that of Ridge Regression. The reasons for introducing a Lasso-Regression line remains the same as that of Ridge-Regression. A bias is introduced in the overfitted Least Squares Line in the same way. However, a significant difference lies in the penalty.

As can be seen clearly, the lambda here is multiplied with the absolute value of slope for the reasons which remain the same i.e. for the purpose of minimizing the sum of squared residuals.

In Lasso Regression, as the lambda value increases the value of slope here, can also reduce down to a complete 0 unlike the case in Ridge Regression where slope can reduce asymptotically down to 0 but not complete 0.

Similarities between Lasso and Ridge Regression:

The reason for introducing these Regularization methods is to shrink parameters, however, they need not be shrunk equally.
Both the Regression models can be applied to complicated models.

Differences between Lasso and Ridge Regression:

The peripheral difference is that in Ridge-Regression the penalty term i.e. the slope is squared whereas in Lasso Regression, the absolute value of slope is taken.
When a lot of variables or parameters are present in a data which are not of great use and will increase the computational time with no productive output then Lasso regression comes into picture and eliminates them by making them absolutely 0 whereas when a lot of parameters or variables are useful then Ridge Regression comes into picture.

3D graphical representation of Ridge and Lasso Regression. Source:slideshare.net

Demonstration of Linear, Ridge and Lasso Regression models:

Mean Squared Error is an evaluation metric of a machine learning model which measures the average of squares of errors. In the following code, implementation, we will be using diabetes dataset using sklearn to determine the difference between Linear, Lasso and Ridge Regression evaluation metric i.e. mean squared error in this case. The closer the mse will be to 0, the better will be the model.

The diabetes dataset has been loaded for the purpose of demonstration the performance of Linear, Ridge and Lasso Regression.

In this code:

From sklearn library, cross_val_score has been imported to perform cross validation. Here, 10 cross validations have been performed.
In the ‘lin_regressor’ LinearRegression object has been initialized.
Inside the cross_val_score, linear regressor object that has been initialized. X and y values which are the independent and dependent feature respectively are also passed along with the evaluation metric i.e. mse in this case and the cross validation value.
The mean of mse has been taken using numpy.

In this code:

The required libraries have been imported.
Ridge object has been initialized.
Different values of alpha have been passed as parameters.
With the help of GridSearchCV, the best value of alpha will be determined and again the evaluation metric has been chosen as mse.
The best_params_ help us determine the best value of alpha at which mse will be most desired.
The best_score_ gives the mse for Ridge Regression.

This code has been executed similarly to that of the Ridge Regression code except that the object initialized here is Lasso.

From the code snippets, we can deduce that the best result is provided by Ridge Regression Model as the mse is closest to 0 as compared to that of the other two models. There is a small amount of difference between the Ridge and Lasso metric. The worst metric is given by the Linear Regression model.

Graph for prediction of Ridge Regression

Graph for prediction of Lasso Regression

Summary

Linear regression or logistic regression models may sometimes cause overfitting thus reducing the testing accuracy. To avoid this problem, two regularization techniques are introduced, namely, Ridge and Lasso for the purpose of providing better metrics and improving the long-term predictions. Ridge and Lasso Regressions have certain similarities and differences in functioning which have been enlisted above.