Analysis of Linear Regression model

7rishi20ss
Feb 19, 2023
12 min read

Updated: Feb 21, 2023

As a Machine Learning Engineer, it would be a dream come true if you were given a problem and applied a machine learning algorithm to it and got an accuracy of greater than 95%, because this is very good accuracy, you consider this a great model and submit the assignment or publish your work in company, then you go to bed and have a peaceful sleep, and the next day you are appreciated by the delivery because you did everything well and happy ending. Only if this were true.....

Note : This is for the readers who have some knowledge of Linear regression.

A performance analysis is very important to any algorithms as :

To determine the model's accuracy: The accuracy of a linear regression model in predicting outcomes can be determined through performance evaluation. The evaluation can reveal how well the model fits the data and how accurate its predictions are.
To identify the strengths and weaknesses of the model: By evaluating the performance of a linear regression model, one can identify areas where the model performs well and areas where it needs improvement. This information can be used to refine the model and make it more effective.
To compare different models: Performance evaluation enables the comparison of different linear regression models to determine which one performs better. This can be useful in selecting the best model for a particular application.
To validate the model: By evaluating the performance of a linear regression model, one can validate whether it meets the intended purpose or not. This can help to ensure that the model is useful and reliable for the intended application
To improve the model's performance: Performance evaluation can reveal areas where the model needs improvement, which can help in refining the model and making it more accurate and reliable.

we know that performing an evaluation is must so now the question arises !!

How can you analyse the Linear regression model ?

There are few techniques we can use to analyse the regression model :

Comparing the values of the metrics: You can compare the values of the different metrics to determine which model performed best. In general, a lower MSE, RMSE, and MAE, and a higher R2 and Adjusted-R2 indicate better performance.
Evaluating the significance of the coefficients: You can evaluate the significance of the coefficients in the model to determine which independent variables are most important in predicting the dependent variable. This can be done by examining the p-values or confidence intervals of the coefficients.
Checking the assumptions of linear regression: You can check the assumptions of linear regression to ensure that the model is valid and reliable. This includes checking for linearity, normality, homoscedasticity, and independence of errors.
Visualizing the results: You can visualize the results of the model by plotting the predicted values against the actual values, plotting the residuals against the predicted values, and creating other visualizations to help interpret the results.
Testing the model on new data: You can test the model on new data to determine how well it generalizes to unseen data. This can be done by splitting the data into training and testing sets, or by using cross-validation techniques.

Overall, analyzing a linear regression model entails considering multiple factors, and by doing so, you can determine the model's strengths and weaknesses and make informed decisions about how to improve its performance.

Great now that you have these method evaluation should done right ? or is it? One question might arise that what is metric? How can metric helps us to evaluate the model? And why are there so many metrics available can't we have only one metrics that can decide whether our model is working better or not?

How do you evaluate the performance of a linear regression model?

Mean Squared Error (MSE): MSE measures the average squared difference between the predicted values and the actual values. It is calculated as the sum of the squared differences divided by the number of observations. MSE penalizes large errors more heavily than small errors, and it is useful for comparing the performance of different models.
Root Mean Squared Error (RMSE): RMSE is the square root of the MSE and represents the standard deviation of the residuals. It is calculated as the square root of the sum of the squared differences divided by the number of observations. RMSE is a more interpretable metric than MSE because it is in the same units as the dependent variable.
R-squared (R2): R-squared is a measure of how well the model fits the data, with higher values indicating a better fit. It represents the proportion of variance in the dependent variable that is explained by the independent variables. R2 ranges from 0 to 1, with 1 indicating a perfect fit. However, R2 can be misleading if the model is overfitting or if it does not account for all relevant variables.
Adjusted R-squared: Adjusted R-squared is similar to R-squared, but takes into account the number of independent variables in the model. It is calculated as 1 - [(1 - R2) * (n - 1) / (n - p - 1)], where n is the number of observations and p is the number of independent variables. Adjusted R-squared is a better measure of model complexity and can help prevent overfitting.
Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted values and the actual values. It is calculated as the sum of the absolute differences divided by the number of observations. MAE is less sensitive to outliers than MSE and RMSE, and it can be useful for models that require a more robust evaluation metric.
Residual plots: Residual plots can be used to visualize the differences between the predicted values and the actual values. A good model should have residuals that are randomly scattered around zero, indicating that the errors are normally distributed and there are no systematic patterns in the errors.

If we have all the metrics available to us what can we state about the model?

MSE, RMSE, and MAE provide information on the magnitude of the errors between the predicted and actual values of the target variable. If all three metrics are low, it suggests that the model is accurate and can make reliable predictions.
R-squared provides information on the proportion of variance in the target variable that is explained by the model. A high R-squared value indicates that the model fits the data well and is able to capture a large proportion of the variability in the target variable.
Adjusted R-squared provides a similar measure to R-squared, but takes into account the number of predictors in the model. A high adjusted R-squared value indicates that the model is able to explain a large proportion of the variability in the target variable, while controlling for the number of predictors.
The residual plot can provide information on the pattern of the errors in the model. A random scatter of the residuals around zero suggests that the model is a good fit for the data, while a non-random pattern may suggest that the model is not capturing all the information in the data.

By considering all of these metrics together, we can obtain a more comprehensive picture of the performance of the regression model and make informed decisions about its usefulness for predicting the target variable

If we have MSE, RMSE and MAE of a particular model what can be stated about the model?

If we have MSE, RMSE, and MAE values for a particular regression model, we can use them to gain a more comprehensive understanding of the model's performance. Here are some possible conclusions we could draw

if the MSE, RMSE, and MAE values are all low, it suggests that the model is performing well and accurately predicting the target variable.
If the MSE value is significantly higher than the RMSE and MAE values, it suggests that the model is more sensitive to outliers and large errors. This may indicate that there are some extreme values in the data that are influencing the model's predictions.
If the RMSE value is significantly higher than the MSE and MAE values, it suggests that the model is more sensitive to the scale of the errors. This may indicate that the errors are not distributed normally, or that there are some extreme values that are influencing the standard deviation.
If the MAE value is significantly higher than the MSE and RMSE values, it suggests that the model is more sensitive to the absolute differences between the predicted and actual values, rather than the squared differences. This may indicate that the model is better at predicting values that are closer to the mean of the target variable, but less accurate at predicting extreme values.

If we are generating these values using same regression model then why are we getting these type of variations and what can we do to solve this problem?

Let's try to give reason of each case :

For MSE greater than RMSE or MAE ==>
1. Scale of the target variable: If the scale of the target variable is large, then the MSE value will also be large due to the squared differences between the predicted and actual values.
2. Skewed distribution of the target variable: If the distribution of the target variable is skewed, then the MSE value may be higher due to the influence of extreme values. In such cases, transforming the target variable using techniques such as logarithmic or square root transformation may help to reduce the skewness and improve the model's performance.
3. Non-linear relationship between predictor and target variables: If the relationship between the predictor and target variables is non-linear, then the MSE value may be higher due to the model's inability to capture the non-linear patterns in the data. In such cases, using non-linear models such as polynomial regression or decision trees may help to improve the model's performance.
4. Correlated predictor variables: If the predictor variables are highly correlated with each other, then the MSE value may be higher due to multicollinearity, which can cause instability in the model's coefficients and predictions. In such cases, using regularization techniques such as Ridge or Lasso regression may help to improve the model's performance
For RSME greater than MSE or MAE ==>
1. Non-normal distribution of errors: If the errors are not normally distributed, the RMSE can be inflated, since it is sensitive to extreme values. In this case, one possible solution is to transform the target variable or predictor variables to normalize their distribution. This can be achieved using techniques like log transformation, square-root transformation, or Box-Cox transformation.
2. Outliers or extreme values: If there are some extreme values in the data, they can have a significant impact on the standard deviation of the errors, which in turn affects the RMSE. In this case, one possible solution is to remove or downweight these outliers in the data preprocessing step, or to use a different model that is less sensitive to outliers, such as a decision tree or random forest.
3. Non-linear relationship between predictor and target variables: If the relationship between the predictor and target variables is non-linear, the model may struggle to accurately capture the relationship, especially if the predictor variable has a large range. In this case, one possible solution is to use a different model that is more robust to scale, such as a support vector machine or a neural network, or to include additional features or interactions in the model to capture more complex relationships between the predictor and target variables.
4. Measurement units or scaling of variables: If the predictor and target variables have different measurement units or scales, this can affect the standard deviation of the errors, and hence the RMSE. In this case, one possible solution is to standardize or normalize the variables so that they are on the same scale, or to use a model that is less sensitive to scale, such as a decision tree or random forest.
For MAE greater than MSE or RMSE ==>
1. Non-normal distribution of errors: If the errors between the predicted and actual values are not normally distributed, then the MAE value may be higher because it is more sensitive to extreme values. This may indicate that there are some outliers or skewness in the data that are affecting the model's predictions. One possible solution is to transform the target variable or predictor variables to normalize their distribution, such as using a log transformation or a Box-Cox transformation. Another solution is to use a different model that is more robust to non-normality, such as a quantile regression or a robust regression.
2. Heteroscedasticity: If the variance of the errors is not constant across the range of the predictor variable, then the MAE value may be higher because it is not accounting for the variability in the errors. This may indicate that the model is not capturing all of the relevant features or interactions in the data. One possible solution is to include additional predictor variables or interaction terms that capture the sources of variability in the errors. Another solution is to use a different model that is more robust to heteroscedasticity, such as a weighted regression or a generalized linear model.
3. Underfitting: If the model is not complex enough to capture the true relationship between the predictor and target variables, then the MAE value may be higher because it is not accurately predicting the target variable. This may indicate that the model is too simplistic or missing important features or interactions in the data. One possible solution is to use a more complex model, such as a polynomial regression or a neural network, that can capture more complex relationships between the predictor and target variables. Another solution is to include additional features or interactions in the model to capture more of the variability in the data.
4. Overfitting: If the model is too complex and is fitting the noise in the data rather than the true relationship between the predictor and target variables, then the MAE value may be higher because it is not generalizing well to new data. This may indicate that the model is too sensitive to the specific values in the training data and is not able to make accurate predictions on new data. One possible solution is to use regularization techniques, such as ridge regression or Lasso regression, to penalize overly complex models and improve generalization performance. Another solution is to use cross-validation techniques to evaluate the model's performance on new data and select the optimal model based on this performance.

MSE vs RSME

We know that RMSE is jut square root of MSE than why do we even have these two terms if one just square roots the answer obtained by MSE?

So let's try to understand why we have two metric which behaves almost same but both are equally important to us.

MSE and RMSE are both measures of the same thing, i.e., the average squared error between the predicted and actual values of the target variable.

The RMSE is particularly useful because it has the same units as the target variable, making it easier to interpret. For example, if the target variable is measured in dollars, the RMSE will also be in dollars, which is easier to understand than the squared units of MSE.

However, there are some cases where using MSE may be more appropriate. One reason is that the MSE is a more widely used and well-known metric for evaluating regression models, and it is often reported alongside other metrics like R-squared. Additionally, the MSE has some mathematical properties that can make it easier to work with in certain situations. For example, the MSE values for a set of predictions is always equal to the variance of the target variable.

Another reason to use MSE is that it can be more sensitive to outliers than the RMSE. Because the MSE involves squaring the errors, it will give more weight to large errors than the RMSE. This means that if there are outliers in the data that are causing large errors, the MSE may be more informative than the RMSE.

In summary, both MSE and RMSE are useful metrics for evaluating regression models, and the choice of which one to use depends on the specific problem at hand and the nature of the data. In general, it's a good idea to report both metrics to provide a more complete picture of the model's performance.

" The MSE values for a set of predictions is always equal to the variance of the target variable" What does that even mean? Why having this condition useful to us?

The fact that the MSE values for a set of predictions is always equal to the variance of the target variable has some useful implications in practice.

Firstly, it means that if we have a set of predictions and their corresponding actual values, we can calculate the variance of the target variable and compare it to the MSE values. If the MSE value is close to the variance of the target variable, this suggests that the model is doing a good job of predicting the target variable. On the other hand, if the mean of the MSE values is much larger than the variance of the target variable, this suggests that the model is not doing a good job of predicting the target variable.

Secondly, this relationship between the MSE and the variance of the target variable can help us to better understand the sources of error in the model. Specifically, if the MSE is larger than the variance of the target variable, this suggests that the model is making errors that are not explained by the inherent variability in the target variable. This could be due to a number of factors, such as an incorrect choice of features, an insufficient amount of data, or a poor choice of model architecture.

In summary, the fact that the mean of the MSE values for a set of predictions is always equal to the variance of the target variable provides a useful reference point for evaluating the performance of regression models, and can help us to better understand the sources of error in the model.

Given R2 or adjusted R2 which metric is more useful in which situation?

R-squared (R2) and adjusted R-squared are both metrics used to evaluate the goodness of fit of a linear regression model. However, adjusted R-squared is a modification of R-squared that takes into account the number of predictor variables in the model, while R-squared does not.

Adjusted R-squared is generally considered more useful than R-squared in situations where the number of predictor variables is high, as it penalizes the model for including unnecessary variables that do not improve the model's predictive power. R-squared, on the other hand, tends to increase as more variables are added to the model, even if those variables are not useful predictors of the target variable

In general, adjusted R-squared should be used when comparing models with different numbers of predictor variables. If two models have similar R-squared values, but one has fewer predictor variables, then the model with fewer variables and a higher adjusted R-squared value is likely to be the better model. However, if two models have the same number of predictor variables, then R-squared may be a sufficient metric for comparing their performance

In conclusion, analysis is something that depends on variety of factor and everyone has its own way of analysis things which may not included in this post. Performance evaluation is a repeated and tiring process which every ML Engineer has to go through and is very important.

AN : I have tried to raise the questions that I feel is needed to understand the depth of performance evaluation. It may have a very long post but once you go through it would clear the most of the doubts. And if you think I missed some important question reach me out over here.

Rishikesh Solapure

Analysis of Linear Regression model

Recent Posts

Comments