In this tutorial, we’ll cover how to perform a simple linear regression step by step. We’ll then look at how to do it quicker and easier using the lm()
function in R. We’ll also look at some easy ways to perform diagnostics for our regression analysis. For this tutorial, we’ll be using the cars
dataset available in R to predict stopping distances (dist
) of cars based on their speed
before stopping.
## [1] "R version 4.0.2 (2020-06-22)"
## Package Version
## ggplot2 3.3.2
## olsrr 0.5.3
Before we get started, let’s take a little peek at the data. In this data frame, we have two variables, speed
and dist
.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Our first step for conducting the regression ‘by hand’ is to create our compact model. This is a model that will attempt to predict our outcome variable, dist
, using no additional parameters other than the mean
of dist
. We can express this model in equation form: \[Y_i=\beta_0+\epsilon_i \]
Let’s go ahead and calculate that mean
and assign it to mean_dist
.
(mean_dist <- mean(cars$dist))
## [1] 42.98
Now that we have mean_dist
, we can calculate the error for our compact model, error_c
, by subtracting mean_dist
from the original dist
variable.
cars$error_c <- cars$dist - mean_dist
We then need to square these values to get our squared error of our compact model, sq_error_c
.
cars$sq_error_c <- cars$error_c^2
Now that we have sq_error_c
we can simply add those values together to get our sum of squared error of our compact model, sse_c
.`
(sse_c <- sum(cars$sq_error_c))
## [1] 32538.98
Next, we’ll begin working on our augmented model. This model will contain all of the parameters from our compact model in addition to any parameters we wish to add to the model. We only have one other variable in our data frame that we can use as a parameter, speed
, so let’s write out that equation and then calculate our mean_speed
. \[Y_i=\beta_0+\beta_1X_{i1}+\epsilon_i \]
(mean_speed <- mean(cars$speed))
## [1] 15.4
Now that we have mean_speed
, we can begin calculating the slope of our speed
variable. First we need to calculate the individual differences between each observation of speed
and mean_speed
. We’ll then use this vector to calculate the numerator and denominator of our slope formula.
cars$x_diff <- cars$speed - mean_speed
Let’s take our differences, x_diff
, and multiply them by our error from the compact model, error_c
. We’ll then add these values together to get our slope numerator.
(slope_num <- sum(cars$x_diff * cars$error_c))
## [1] 5387.4
To calculate our slop denominator, we need to square x_diff
and then sum the values.
(slope_denom <- sum(cars$x_diff^2))
## [1] 1370
We now have both of the required elements to calculate our slope
value. We’ll simply divide slope_num
by slope_denom
.
(slope <- slope_num/slope_denom)
## [1] 3.932409
Because we now know the value of our speed
variable’s slope
, we can calculate the intercept
value by subtracting the product of the slope
of speed
and mean_speed
from the mean of distance, mean_dist
.
(intercept <- mean_dist - slope * mean_speed)
## [1] -17.57909
From here, we can begin calculating the sum of squared error for our augmented model. The first step will be to calculate the dist
values that our model will predict given the slope
and intercept
values we calculated.
cars$pred_a <- intercept + slope * cars$speed
We can then calculate the error between our predicted values, pred_a
, and the actual distance values.
cars$error_a <- cars$dist - cars$pred_a
Now we’ll square those values…
cars$sq_error_a <- cars$error_a^2
…and sum them to get our sum of squared error for our augmented model, sse_a
.
(sse_a <- sum(cars$sq_error_a))
## [1] 11353.52
We’ll now calculate the proportional reduction in error (PRE) which represents the proportion of error from the compact model that is reduced or eliminated when we use the more complex augmented model.
(pre <- 1 - (sse_a/sse_c))
## [1] 0.6510794
We can see that adding the speed
predictor to the model reduced the error by 65%.
We can then use the pre
value to calculate our F-value. We’ll start off by calculating the numerator of F by dividing pre
by the difference in number of parameters between the augmented and compact models.
(f_num <- pre/(2-1)) # parameters in aug - params in comp
## [1] 0.6510794
Next, we’ll calculate the denominator of F using pre
. The denominator of the below formula represents the sample size (n) and the number of parameters in the augmented model.
(f_denom <- (1 - pre)/(length(cars$dist) - 2)) # n - param in aug
## [1] 0.00726918
Finally, we can calculate F by dividing f_num
by f_denom
.
(f_stat <- f_num/f_denom)
## [1] 89.56711
We can also calculate our t-value by taking the square root of f_stat
.
(t_stat <- sqrt(f_stat))
## [1] 9.46399
lm()
Doing a simple regression following the above steps is useful for understanding how the values are derived, but it isn’t very practical for everyday use. Below, we’ll look at how to quickly conduct a regression analysis in R using the lm()
function.
To start, lm()
uses a formula to specify the analysis that you’re wanting to conduct. Simply put, these analyses are formatted outcome variable ~ predictor variable
. If you’re looking to use multiple predictor variables, give the Multiple Regression tutorial a read through. In that tutorial, we’ll cover how to specify multiple predictors.
The lm()
function has many potential arguments that we won’t use in this analysis, so if you have any questions about using lm()
that aren’t answered here, you can always run ?lm
in your console to bring up the documentation for the function.
In the below code, you can see that we’re wanting to predict dist
from speed
.
mod_a <- lm(dist ~ speed, data = cars)
We ran our model, but to see the results of the analysis, we need to use the summary()
function.
summary(mod_a)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Think back to the values that we calculated earlier. Do they match the values shown in the summary output?
They’re the same! Additionally, we can see that the PRE
value is reported as Multiple R-squared
.
You may also want to calculate confidence intervals around your predictor. This is easy to do with the confint()
function. You just need to supply the object
you want confidence intervals for (our model, mod_a
) and the level
(such as 90%, 95%, etc.).
confint(object = mod_a, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
Now that we’ve completed our analysis, we’ll cover some ways to run diagnostics on your data to ensure that your analysis isn’t being unduly influenced by a small subset of your data points.
When doing a simple linear regression, you can often get a good idea of any potential outliers by creating a scatterplot of your predictor and outcome variables. We’ll do that below using the ggplot2
package. If you need a refresher on making plots in R, check out the Intro to ggplot2 and Graphing in ggplot2 tutorials on this site.
ggplot(data = cars, mapping = aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
We can see that there are a few points on the graph that may be unduly influential. We’ll use a few more techniques to determine if we need to do something about these data points.
Leverage values are useful for identifying abberant values, and we have a few ways to easily obtain and visualize these values in R. First, let’s create a new variable in our data frame that contains the leverage value for each observation. There’s not a hardfast cutoff to indicate if a leverage value is too high, so let’s calculate the median value and use that to see where the middle point is.
cars$lev <- hat(model.matrix(mod_a))
median(cars$lev)
## [1] 0.02945985
We can then make a simple plot of these leverage values to see if any observations have a much larger value than the other observations. Based on our median lev
value of 0.029, we can assume that values greatly above that may be unduly influential.
plot(cars$lev)
For this example, we’ll just arbitrarily choose 0.07 as our cutoff. Let’s take a peek at the observations that have greater leverage values than that.
ind <- which(cars$lev > 0.07)
cars[ind,]
## speed dist error_c sq_error_c x_diff pred_a error_a sq_error_a
## 1 4 2 -40.98 1679.3604 -11.4 -1.849460 3.849460 14.81834
## 2 4 10 -32.98 1087.6804 -11.4 -1.849460 11.849460 140.40970
## 3 7 4 -38.98 1519.4404 -8.4 9.947766 -5.947766 35.37593
## 4 7 22 -20.98 440.1604 -8.4 9.947766 12.052234 145.25633
## 46 24 70 27.02 730.0804 8.6 76.798715 -6.798715 46.22253
## 47 24 92 49.02 2402.9604 8.6 76.798715 15.201285 231.07906
## 48 24 93 50.02 2502.0004 8.6 76.798715 16.201285 262.48163
## 49 24 120 77.02 5932.0804 8.6 76.798715 43.201285 1866.35100
## 50 25 85 42.02 1765.6804 9.6 80.731124 4.268876 18.22330
## lev
## 1 0.11486131
## 2 0.11486131
## 3 0.07150365
## 4 0.07150365
## 46 0.07398540
## 47 0.07398540
## 48 0.07398540
## 49 0.07398540
## 50 0.08727007
We can also use studentized deleted residuals to get an idea of influential observations in our data.
cars$stud_del_resid <- rstudent(mod_a)
We can get a handy plot of these values with the ols_plot_resid_stud()
function from the olsrr
package.
ols_plot_resid_stud_fit(mod_a)
Notice that this plot gives the observation number for observations that are outside of its default diagnostic values.
The olsrr
package also has a function that will plot your data by leverage and studentized deleted residuals. You want to pay special attention to any points that are indicated as being an outlier and having a high leverage value. Our plot below doesn’t show any values like that.
ols_plot_resid_lev(mod_a)
Cook’s distance is another useful diagnostic that we can use. We can use these values similarly to how we used the previous diagnostics. The below graph will indicate the default diagnostic threshold and give the observation number for any observations that exceed that threshold.
ols_plot_cooksd_bar(mod_a)
It’s also a good idea to get an idea of the structure of your distriution of errors. We can do this easily by making a qq plot. Ideally you want the errors to fall on the line.
qqnorm(mod_a$residuals)
qqline(mod_a$residuals)
You also want to check that the error variances of your data are homogeneous. In the below plot, we would hope to see the residuals spread equally along the range of predictions. We don’t see this in the plot below; instead, we see a cone pattern that is typically indicative heteroscedasticity.
ggplot(data = cars, mapping = aes(x = predict(mod_a), y = rstudent(mod_a))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)