In this part, I'm going to go through the modelling process. Since in the last article we ran a NULL model, giving us the average value of our target variable, now we will run a FULL model.
- PART A to learn the prerequisites and assumptions of Linear Regression
- PART C to learn how to evaluate multiple models to find the best one
FULL Model
A FULL model is one where we use all the predictor variables in the model, irrespective of whether they are good at predicting the target or not. Following is how we do it:
fit.full = lm(target~., data = dataset)
summary(fit.full)
# the '.' here takes all the variables except the target variable
# in the dataset to be predictors. If you have to specify all your
# predictor variables, use the following:
fit.full = lm(target~P1+P2+P3+...+Pn, data = dataset)
summary(fit.full)
- The 'Call' shows the model that we ran
- The 'Residuals' talk about the error between the actual and the predicted value in the model
- The 'Coefficients' explains the impact of each predictor variable in the model. For example: the estimate of avgSI is 0.74. This means that every one unit increase in avgSI predictor, there will be 0.74 increase in the target [Daily Energy] value. Further, this also shows the p-value of each predictor variable. As you can see above, avgWind has a p-value above the significance level [default = 0.05], hence, we can assume that the variable is not significant in our model and should be removed.
- The last section talks about the evaluation parameters like R-square, F-statistic, p-value of the model. We will talk about them when we compare the models.
Now that we have a FULL model and we can see that there are predictors that are not significant in estimating the energy, we should think about reducing the complexity of the model by removing these insignificant predictors.
REDUCED Model
When we remove non significant predictors from out model, we call it the REDUCED model. There are quite a few ways to do it:- Checking the p-value of each predictor and remove the one with highest p-value [only if it is more than the significance level] first and rerun the model. Then, check the p-value of the predictors again and removing them until all the predictors have p-value less the the significance level. This is a manual process and takes time and efforts to figure out what to remove and what to keep.
- Running VIF analysis. This is also used to remove multicollinearity among the predictor variables.
vif = vif(fit.full)
This is going to run a VIF analysis on the predictors of the FULL model. Here, it assumes one predictor as target and run the model with all other as predictors. It keeps doing that until all the predictors have been tried out as target. The output of this are number corresponding to each predictor as shown below:
As a rule of thumb, you take out the predictor with the highest VIF value first and re run the VIF analysis. you keep doing that until all your predictor variable left in the reduced model have VIF < 5.
- Running a stepwise regression. There are 3 type of stepwise regression: forward, backward and both. You can use any of them, but forward is the simplest form of stepwise regression since it starts with only one predictor. In general, stepwise regression automates the process of trying our different combinations of models by checking the significance [p-values] of the predictors it is trying to add/remove.
#forward
fit.reducedFW = step(fit.null, direction = "forward")
#backward
fit.reducedBW = step(fit.full, direction = "backward")
#both
fit.reducedBoth = step(fit.full, direction = "both")
The above shown are output of the forward and backward stepwise regression. They show you all the same results as the FULL model with one additional parameter, that is AIC. We will also talk about this parameter during evaluation of models. But for now, you can see that the model shows the start AIC and then step AIC at each step it takes while adding/removing the predictors. The lower AIC the better.
In the next part, I will talk about evaluation of all these model variations and choosing the best one.
Thanks for reading! Cheers!
Comments
Post a Comment