Reguralization in Linear Regression

7rishi20ss
Mar 1, 2023
7 min read

So let's start asking a question

what is reguralization and why do we even need?

Regularization is a technique used in machine learning to prevent overfitting of models. Overfitting occurs when a model is too complex and captures noise in the training data, resulting in poor generalization performance on new, unseen data. Regularization addresses this problem by adding a penalty term to the cost function that encourages the model to have smaller parameter values, effectively reducing its complexity.

There are few reasons why regularisation is important :

Prevent Overfitting : Regularization prevents overfitting of models by controlling their complexity and reducing their tendency to memorize noise in the training data.
Improving generalisation: By reducing the overfitting of model, the model can predict efficiently on the new unseen data.
Feature Selection : Regularisation can act as a feature selection tool, as it encourages some features to have small coefficient effectively removing them from model predicting process. This can improve the prediction and reduces the number of features required for accurate predictions.
Handling Multicollinearity : Regularization can handle multicollinearity, which is the presence of high correlations between predictors in a dataset. Multicollinearity can lead to unstable parameter estimates and poor model performance, but regularization can reduce the impact of correlated predictors and improve the stability of the estimates.

What are these Regularisation Techniques?

There are many regularisation techniques out in market that can improve the model's performance but today we will discuss the most important and widely used techniques: LASSO and RIDGE Regression.

Lasso Regression

Lasso regression, also known as L1 regularization, is a linear regression technique that adds a penalty term to the cost function to encourage small values of the coefficients. The term "Lasso" stands for "Least Absolute Shrinkage and Selection Operator".

The cost function for Lasso regression is defined as,

Where,

w is the vector of coefficients
m is the number of training examples
h(x^(i)) is the predicted value for the ith training example
y^(i) is the actual value for the ith training example
lambda is the regularization parameter
|w_j| is the absolute value of the jth coefficient

The first term in the cost function is the mean squared error between the predicted and actual values, and the second term is the L1 penalty term. The L1 penalty term encourages small values of the coefficients and also has the effect of setting some of the coefficients to zero, leading to sparse models.

The L1 penalty term also has a geometric interpretation. The sum of absolute values of the coefficients defines a diamond-shaped constraint region in the coefficient space. The objective is to find the point within this region that has the lowest error, which corresponds to the optimal values of the coefficients

The regularization parameter lambda controls the strength of the penalty term. As lambda increases, the coefficients are shrunk towards zero, leading to a more sparse model with fewer non-zero coefficients. Conversely, as lambda decreases, the model becomes less sparse, with more non-zero coefficients.

Ridge Regression

Ridge regression is a linear regression technique that uses L2 regularization to prevent overfitting and improve the generalization performance of the model. The main idea behind Ridge regression is to add a penalty term to the cost function of the linear regression model. The penalty term is proportional to the square of the L2 norm of the coefficients (also known as the Euclidean norm or the L2 norm squared). The L2 norm of the coefficients can be defined as:

where,

RSS(w) is the residual sum of squares (i.e. the difference between the predicted and actual values of the target variable
||w||_2 is sum of square of coefficient

The L2 penalty term also has a geometric interpretation. The sum of square values of the coefficients defines a circle-shaped constraint region in the coefficient space. The objective is to find the point within this region that has the lowest error, which corresponds to the optimal values of the coefficients.

The regularization parameter controls the strength of the penalty term and is typically chosen using cross-validation. The Ridge regression model can be solved using a analytical/closed-form solution, which is given by:

what is lambda? and why it is important?

As discussed, In Ridge and Lasso regression, the regularization parameter lambda controls the strength of the penalty term that is added to the cost function. The penalty term is used to prevent overfitting of the model.

The behavior of the regression model is determined by the value of the regularization parameter (lambda). Since different lambda values produce noticeably different outcomes, the penalty term is proportional to lambda.

Choosing the appropriate value of is crucial because it manages the model's trade-off between bias and variance, resulting in a model that generalizes well to new data and can make reliable predictions.

What are the different techniques to calculate lambda?

The value of lambda needs to be carefully selected to balance the trade-off between bias and variance in the model. There are several techniques that can be used to select the optimal value of lambda for Lasso and Ridge regression. Here are some common techniques :

Cross-Validation: Cross-Validation is commonly used technique to select the optimal value of lambda. In this approach, the data is randomly partitioned into k-folds, where k is typically set to 5 to 10. The model is trained on K-1 fold and validation remaining fold. This process is repeated upto k-times, with each fold used exactly once as the validation set. The average validation error across all k-folds is used as the estimates of model's generalisation error.
Grid Search: Grid search is a simple technique to select the optimal value of lambda. In this approach, a range of values for lambda is specified, and the model is trained and validated for each value of lambda in the range. The optimal value of lambda is the one that yields the lowest validation error.
Analytical Solution: In Ridge regression, there exists an analytic solution for the optimal value of lambda. The optimal value of lambda is given by:

4. Adaptive Lasso: Adaptive Lasso is a variant of Lasso regression that adapts the penalty term to the importance of the features. In this approach, the penalty term is scaled by the inverse square root of the estimated variances of the coefficients. The optimal value of lambda is then selected using one of the above techniques, such as cross-validation or grid search.

These are some common techniques for selecting the optimal value of lambda in Lasso and Ridge regression. The choice of technique depends on the specific problem at hand and the size of the data.

Let's try one technique (cross-validation) on the dataset used in previous post.

# setting up different lambda values
lambdas = np.logspace(-5,5,1000)

# create emptty list to store the cross validation score
ridge_score = []
lasso_score = []

# perfome ridge and lasso regression on each lambda
for lambda_val in lambdas:
    ridge = Ridge(alpha=lambda_val)
    lasso = Lasso(alpha=lambda_val)

    # storing scores
    ridge_score.append(np.mean(cross_val_score(ridge , X, y, cv=10)))
    lasso_score.append(np.mean( cross_val_score (lasso , X, y, cv=10)))

# Plot the cross-validation scores for Ridge and Lasso regression

plt.plot(lambdas, ridge_score, label='Ridge')
plt.plot(lambdas, lasso_score, label='Lasso')
plt.xlabel('Lambda')
plt.ylabel('Cross-validation score')
plt.legend()
plt.show()

# Find the optimal lambda value for Ridge and Lasso regression
optimal_ridge_lambda = lambdas[np.argmax(ridge_score)]
optimal_lasso_lambda = lambdas[np.argmax(lasso_score)]

print('Optimal lambda value for Ridge regression:', optimal_ridge_lambda) # 44.3247
print('Optimal lambda value for Lasso regression:', optimal_lasso_lambda) # 0.3195

Would you get same optimal lambda values if you try with different techniques? If not why is this happening?

It is not uncommon to have differences between the best hyper-parameter value obtained from Grid Search and the best hyper-parameter value obtained from cross-validation. Here are a few possible reasons for the difference:

Randomness: The performance of the model can be affected by the randomness in the data, the initialization of the model, and the random splits during cross-validation. The best hyper-parameter value obtained from Grid Search may not be the same as the one obtained from cross-validation due to the randomness involved.
Search space: The range of hyperparameter values tested in Grid Search may not have included the optimal value, or the range may have been too wide, leading to overfitting. On the other hand, the range of hyper-parameter values tested during cross-validation may have included the optimal value, leading to a different best hyper-parameter value.
Evaluation metric: Grid Search and cross-validation may use different evaluation metrics to select the best hyper-parameter value. In the case of Ridge regression, the evaluation metric is typically mean squared error (MSE), but other metrics such as R-squared can also be used. If different evaluation metrics are used, it can result in a different best hyper-parameter value.

It is important to carefully choose the range of hyper-parameter values to test and the evaluation metric to use, as well as to perform multiple runs of Grid Search and cross-validation to ensure consistency in the results.

Can Lambda have negative value?

The regularization parameter is typically non-negative in the context of Ridge and Lasso regression. This is because the regularization penalty term is intended to reduce the magnitude of the coefficients to zero, and negative values of would actually increase the magnitude of the coefficients, which is the opposite of the intended effect.

Furthermore, if were negative, the optimization problem used to fit the model would become ill-posed because the objective function would no longer be convex and no unique minimum would exist.

As a result, in practice, is usually chosen from a set of non-negative values, and different values of are used to evaluate the model's performance on a validation set or via cross-validation. The optimal value of is then determined based on the model's performance on the validation set or through cross-validation.

When is Lasso useful over Ridge and Vice Versa? What aspects differ them?

Lasso and Ridge regression are both regularization techniques used to handle the problem of overfitting in linear regression. They differ in the way they penalize the magnitude of the coefficients in the model.

When we have a large number of features in the dataset and believe that only a subset of the features are relevant for predicting the target variable, we can use Lasso regression. Lasso tends to reduce the coefficients of irrelevant features to zero, effectively removing them from the model. This is because the L1 penalty term in the loss function encourages sparse solutions with only a few non-zero coefficients. Lasso regression is a feature selection method because it automatically selects a subset of features and reduces the dimensionality of the problem.

Ridge regression, on the other hand, is useful when there are few features and we want to avoid overfitting by limiting the magnitude of the coefficients. Ridge regression tends to reduce all coefficients to zero, but not to zero exactly, allowing all features to contribute to the prediction. This is because the L2 penalty term in the loss function encourages all coefficients to have small but non-zero values. Ridge regression can be viewed as a method of regularizing and simplifying the model.

In summary, Lasso regression is useful for feature selection and reducing problem dimensionality, whereas Ridge regression is useful for reducing overfitting and regularizing the model. The choice between Lasso and Ridge regression is often a matter of trial and error or domain expertise, depending on the specific problem and the characteristics of the dataset. To balance the strengths of both approaches, a combination of both methods (Elastic Net) can be used in some cases where if the dataset has highly correlated features.

Note : This and the previous post are generally my thought process when solving for the given problem.

A.N. : If you see any things that is incorrect, or you can add more to this post, or you wan to ask questions you can do that in comment or do you can connect with me here.

Rishikesh Solapure

Reguralization in Linear Regression

Recent Posts

Comments