Things we need to know about Linear Regression in R | Part A - Assumptions

Tool Mastery | Linear Regression - Alteryx Community régression ...

While I was working on the final project for the course 'Statistics with R' at The University of Connecticut, I learned a lot of things about Regression Analysis. Since we had a lot of data coming from the solar systems installed by our customers in India, I decided to try Linear Regression on one of the solar PV systems in Bihar.

For those who landed directly on this part of the article, you can visit:
  1. PART B to learn how to fit different variations of the Linear Model.
  2. PART C to learn how to evaluate multiple models to find the best one

Data Preparation

Since the data we receive is time-series, I converted it into a non-time-series dataset. I used about 1 year of data where the target variable was 'Energy Production in a Day' and the predictor variables were 'Average Solar Irradiation', 'Average Ambient Temperature' and 'Average Wind Speed' in a day.

Modeling

Prerequisites

Linear Model assumes the following:
  • The target variable is normally distributed. You can go ahead and check this assumption by doing a test called 'Anderson Darling test of goodness of fit.' This test assumes the null hypothesis [Ho] that the target variable is normally distributed and then gives a result as shown below. As you can see that the p-value is much less than the significance level [alpha = 0.05 default], we had to reject our null hypothesis, our target variable was not normally distributed. What you can do to make your target normal is transformation. In transformation, one of the ways is power transformation, where you raise the target variable to a positive, negative or log power to make it normal, depending on what kind of shape it has originally.
ad.test(target_variable)
No alt text provided for this image
  • The target variable, in our case 'Energy', has a linear relationship with the predictor variables. You can always check this assumption by making scatter plots between the target variable and the predictor variables.
plot(target_variable, predictor_variable)
  • The predictor variables are not correlated to each other. You can use the following piece of code to check the correlation. One way is to do pairs.panels() using the Pearson correlation [since the target variable is continuous, Pearson is more relevant] and second is to use Variation Inflation Factor [VIF]. As you can see the output of Pearson correlation below, the correlation values between the predictor variable are quite low. Hence, we can assume they are independent.
library(psych)
pairs.panels(dataframe, method = "pearson") #here you can visually see the correlation and the scatter plots
No alt text provided for this image
  • Each observation [row] in the dataset is independent of each other
  • The residuals given by the model are randomly distributed around the mean [will be tested later on when we do the modelling and plot a graph between residuals and the predicted values]
  • The residuals are normally distributed. This is also called multivariate normality [again, we'll check this when we do the modelling]

Null Model

We start by creating a NULL model, which basically means that if there are no predictor variables used, we should get the average value of the target variable in the dataset as our value for any future day. Now obviously, we'd have to make a model that's better than giving us the average value. Following is how you create a NULL model:

fit.null = lm(energy~1, data ​= dataset)


-- PART A ends. I'll write further about modelling in the next part.

Thanks for reading. Cheers!

Comments