Thing we need to know about Linear Regression in R

This is where we are going to understand how to evaluate all the model variations that we tried to fit on our dataset. This is the most critical part as it decides which model is best for the given situation. Since we are modelling for a target that is continuous in nature, we will only talk about relevant evaluation parameters i.e. R^2, RMSE, MAE, AIC and BIC.

For those who landed directly on this part of the article, you can visit:

PART A to learn the prerequisites and assumptions of Linear Regression
PART B to learn how to fit different variations of Linear Model

Before start comparing the models, I will try to give some understanding of these evaluation parameters:

R^2 [coefficient of determination]

In formula, this is equal to r^2 i.e. square of coefficient of correlation. As both my professors Prof. Dave Wanik and Prof. Jennifer Eigo never fail to emphasise that R^2 is a good evaluation parameter, but NOT GOOD ENOUGH. The reason is this: R^2 tells you the percentage of variance in the target variable that the predictor variables can explain, or in other words, it explains the distribution of predicted values around the regression line. The closer they are to the regression line, the better R^2 you get. It can be explained by the following pictures. The left one has lower R^2 and the right one has higher R^2. However, R^2 has its own limitations: 1) It doesn't explain all the variations. Sometimes, the dataset has some unexplainable trends that the model cannot catch. In this case, the R^2 will be low, but that doesn't mean the model is bad. 2) It doesn't explain the trends in the residual plot, if any. 3) It doesn't explain overfitting. You may just be adding more and more variables that may or may not make sense, but the R^2 keeps going higher. Further, there might be multicollinearity and the R^2 doesn't explain that too. The adjusted R^2 does penalise for the additional number of predictors added in the model, but you get the jist!

Root Mean Square Error

This is the mean of the square of the residuals. Now if the mail purpose of the model is prediction, then this parameter is of pretty high importance. RMSE tells you how close your predicted values are from the actual values of the dataset. The lower the RMSE, the better the fit. Further, the advantage of RMSE is that it penalises the outliers as we are squaring each residual and then doing the further calculations. So we know if the RMSE is higher, there are chances that the model might have outliers.

Mean Absolute Error

This is Mean Absolute Error and it is the just mean of absolute values of the residuals. This parameter is easy to explain but doesn't really penalises the outliers. You'll know if you have high residuals or not, but that's that.

AIC and BIC

The only difference between these two parameters is that BIC additionally penalises the evaluation based on the sample size. AIC = -2(log-likelihood) + 2K and BIC = -2(log-likelihood) + log(n)K where K is number of predictor variables and n is the sample size. These two parameters are used while comparing multiple models together and the lower the value, the better the model is. As you can see below, the AIC starts to reduce at each step and stops at that steps which gives the lowest AIC.

ANOVA

Analysis of Variance is used between a FULL model and a number of REDUCED models which are using the same predictor variables as the FULL model. ANOVA gives you a number of parameters to analyse, but the most important one is the F-statistics. Higher the F-statistic, the better the model is.

I hope this was helpful.

I'd also like to thank both my professors Prof. Dave Wanik and Prof. Jennifer Eigo for putting great efforts into making modelling easy for all students.

Thanks. Cheers!

The Unsupervised Learner!

Search This Blog