R’s lm() function is fast, easy, and succinct. However, when you’re getting started, that brevity can be a bit of a curse. I’m going to explain some of the key components to the summary() function in R for linear regression models. In addition, I’ll also show you how to calculate these figures for yourself so you have a better intuition of what they mean.
Before we can examine a model summary, we need to build a model. To follow along with this example, create these three variables.
#Anscombe's Quartet Q1 Data y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68) x1=c(10,8,13,9,11,14,6,4,12,7,5) #Some fake data, set the seed to be reproducible. set.seed(15) x2=sqrt(y)+rnorm(length(y))
Just for fun, I’m using data from Anscombe’s quartet (Q1) and then creating a second variable with a defined pattern and some random error.
Now, we’ll create a linear regression model using R’s lm() function and we’ll get the summary output using the summary() function.
model=lm(y~x1+x2) summary(model)
This is the output you should receive.
> summary(model) Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -1.69194 -0.61053 -0.08073 0.60553 1.61689 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.8278 1.7063 0.485 0.64058 x1 0.5299 0.1104 4.802 0.00135 ** x2 0.6443 0.4017 1.604 0.14744 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.141 on 8 degrees of freedom Multiple R-squared: 0.7477, Adjusted R-squared: 0.6846 F-statistic: 11.85 on 2 and 8 DF, p-value: 0.004054
I’m not going to focus on the Call, Residuals, or Coefficients section. If you’re doing regression analysis, you should understand residuals and the coefficient section. Here’s a brief description of each as a refresher.
summary(y-model$fitted.values)
With those sections out of the way, we’ll focus on the bottom of the summary output.
In R, the lm summary produces the standard deviation of the error with a slight twist. Standard deviation is the square root of variance. Standard Error is very similar. The only difference is that instead of dividing by n-1, you subtract n minus 1 + # of variables involved.
#Residual Standard error (Like Standard Deviation) k=length(model$coefficients)-1 #Subtract one to ignore intercept SSE=sum(model$residuals**2) n=length(model$residuals) sqrt(SSE/(n-(1+k))) #Residual Standard Error
> sqrt(SSE/(n-(1+k))) #Residual Standard Error [1] 1.140965
Also called the coefficient of determination, this is an oft-cited measurement of how well your model fits to the data. While there are many issues with using it alone (see Anscombe’s quartet) , it’s a quick and pre-computed check for your model.
R-Squared subtracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear.
#Multiple R-Squared (Coefficient of Determination) SSyy=sum((y-mean(y))**2) SSE=sum(model$residuals**2) (SSyy-SSE)/SSyy #Alternatively 1-SSE/SSyy
> (SSyy-SSE)/SSyy [1] 0.7476681
If you notice, numerator doesn’t have to be positive. If the model is so bad, you can actually end up with a negative R-Squared.
Multiple R-Squared works great for simple linear (one variable) regression. However, in most cases, the model has multiple variables. The more variables you add, the more variance you’re going to explain. So you have to control for the extra variables.
Adjusted R-Squared normalizes Multiple R-Squared by taking into account how many samples you have and how many variables you’re using.
#Adjusted R-Squared n=length(y) k=length(model$coefficients)-1 #Subtract one to ignore intercept SSE=sum(model$residuals**2) SSyy=sum((y-mean(y))**2) 1-(SSE/SSyy)*(n-1)/(n-(k+1))
> 1-(SSE/SSyy)*(n-1)/(n-(k+1)) [1] 0.6845852
Notice how k is in the denominator. If you have 100 observations (n) and 5 variables, you’ll be dividing by 100-5-1 = 94. If you have 20 variables instead, you’re dividing by 100-20-1 = 79. As the denominator gets smaller, the results get larger: 99 /94 = 1.05; 79/94 = 1.25.
A larger normalizing value is going to make the Adjusted R-Squared worse since we’re subtracting its product from one.
Finally, the F-Statistic. Including the t-tests, this is the second “test” that the summary function produces for lm models. The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero.
#F-Statistic #Ho: All coefficients are zero #Ha: At least one coefficient is nonzero #Compare test statistic to F Distribution table n<-length(y) SSE<-sum(model$residuals**2) SSyy<-sum((y-mean(y))**2) k<-length(model$coefficients)-1 ((SSyy-SSE)/k) / (SSE/(n-(k+1)))
> ((SSyy-SSE)/k) / (SSE/(n-(k+1))) [1] 11.85214
The reason for this test is based on the fact that if you run multiple hypothesis tests (namely, on your coefficients), you’re likely to include a variable that isn’t actually significant. See this for an example (and an explanation).
You can now replicate the summary statistics produced by R’s summary function on linear regression (lm) models!
If you’re interested in more R tutorials on linear regression and beyond, take a look at the Linear Regression page.