Linear Regression Lab

7 minute read

Published:

This lesson is from An Introduction to Statistical Learning

Linear Regression

  • Load the MASS package, which is a very large collection of data sets and functions
  • We also load the ISLR2 package, which includes the data sets

Libraries

  •   library(MASS)
      library(ISLR2)
        
      Boston = ISLR2::Boston
    

Simple Linear Regression

  • Boston data set, which records medv (median house value) for 506 census tracts in Boston. We will seek to predict medv using 12 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status)

  • To find out more about the data set, we can type

      ?Boston
    
    • Boston Data

Description

A data set containing housing values in 506 suburbs of Boston.

Usage

Boston

Format

A data frame with 506 rows and 13 variables (12 predictor and 1 response)

  • crim
    • per capita crime rate by town.
  • zn
    • proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus
    • proportion of non-retail business acres per town.
  • chas
    • Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox
    • nitrogen oxides concentration (parts per 10 million).
  • rm
    • average number of rooms per dwelling.
  • age
    • proportion of owner-occupied units built prior to 1940.
  • dis
    • weighted mean of distances to five Boston employment centres.
  • rad
    • index of accessibility to radial highways.
  • tax
    • full-value property-tax rate per $10,000.
  • ptratio
    • pupil-teacher ratio by town.
  • lstat
    • lower status of the population (percent).
  • medv
    • median value of owner-occupied homes in $1000s.
  • Fit the Model
# lm() to fit a simple linear regression
# medv (median house value) as the response
# lstat (percent of households with low socio economic status) as predictor

# lm.fit = lm(medv ~ lstat) # error

lm.fit = lm(medv ~ lstat, data=Boston)

attach(Boston)
lm.fit = lm(medv ~ lstat)
  • Information about model
lm.fit # some basic information

summary(lm.fit) # p-values, std errors, R^2 statistic, F-statistic

names(lm.fit) # will give model information stored

lm.fit$coefficients # will give coefficients of lm.fit

coef(lm.fit) # safer to use
  • Confidence Intervals for coefficient estimates
confint(lm.fit)
confint(lm.fit, level=0.95)
confint(lm.fit, level=0.99)
  • standard 95% CI consists of a lower 2.5% limit and an upper 97.5% limit
  • When you repeat your experiment many times, the true parameter value will be
    • below the CI in 2.5% of cases and
    • above it in another 2.5% of cases
    • and the CI will cover it in 95% of cases
  • Predict
    • The predict() function can be used to produce
      • confidence intervals and
      • prediction intervals for the prediction of medv for a given value of lstat
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))), interval = "confidence")

predict(lm.fit, data.frame(lstat = 5), interval = "prediction")
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
predict(lm.fit, data.frame(lstat = 15), interval = "prediction")


predict(lm.fit , data.frame(lstat = (c(5, 10, 15))), interval = "prediction")
  • 95% confidence interval associated with a lstat value of 10 is
    • (24.47, 25.63)
  • 95% prediction interval is
    • (12.828, 37.28)
  • As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.05 for medv when lstat equals 10), but the latter are substantially wider.
https://towardsdatascience.com/how-confidence-and-prediction-intervals-work-4592019576d8#:~:text=The%20prediction%20interval%20predicts%20in,such%20as%20the%20population%20mean.
  • Prediction Intervals
    • The prediction interval predicts in what range a future individual observation will fall, while a confidence interval shows the likely range of values associated with some statistical parameter of the data, such as the population mean, intercept. etc
    • Prediction interval expresses inherent uncertainty in the particular data point on top of the sampling uncertainty. It is thus wider than the confidence interval.
  • Plot

      plot(lstat , medv) # scatter plot
      plot(lstat , medv , col = "red")
        
      plot (1:20 , 1:20, pch = 1:20) # all point characters
        
      plot(lstat , medv , pch = 4)
      plot(lstat , medv , pch = "+")
        
        
      abline(lm.fit) # least squares regression line
        
      abline(lm.fit, lwd=3) # line width 3
        
      abline(lm.fit , lwd = 3, col = "red")
    
  • Diagnostic plots

      plot(lm.fit) # see plot one by one pressing enter
        
      par(mfrow = c(2, 2)) # divides the plotting region into a 2 × 2 grid of panels
      plot(lm.fit) # 
        
      plot(predict(lm.fit), residuals(lm.fit)) # plot the residuals against the fitted values
        
      plot(predict(lm.fit), rstudent(lm.fit)) # plot the studentized residuals against the fitted values
    
    • Studentized residual

      • quotient resulting from the division of a residual by an estimate of its standard deviation.
    • On the basis of the residual plots, there is some evidence of non-linearity.

    • Leverage statistics can be computed for any number of predictors using the hatvalues() function.

      • leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.
      • High-leverage points, if any, are outliers with respect to the independent variables
        plot(hatvalues(lm.fit))
        which.max(hatvalues(lm.fit))
      

Multiple Linear Regression

lm.fit = lm(medv ~ lstat + age , data = Boston)
summary(lm.fit)

lm.fit = lm(medv ~ . , data = Boston) # All Variables
summary(lm.fit)


lm.fit1 = lm(medv ~ . - age, data = Boston)
summary(lm.fit1)

lm.fit1 = update(lm.fit , ~ . - age)
summary(lm.fit1)


?summary.lm # individual componenets of summary object

summary(lm.fit)$r.squared # R^2 value

summary(lm.fit)$sigma # Residual Standard Error (RSE) estimate of standard deviation of ϵ
  • vif() function, part of the car package, can be used to compute variance inflation factors.
install.packages("car")
library(car)
vif(lm.fit)

Interaction Terms

  • lstat:age is interaction term between lstat and age
# lstat * age includes lstat, age, and interaction term lstat * age
# lstat * age is shorthand for lstat + age + lstat:age
summary(lm(medv ~ lstat * age , data = Boston))

Non-linear Transformations of the Predictors

lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)
  • near-zero p-value associated with the quadratic term suggests that it leads to an improved model
  • use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit
lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)

lm.fit <- lm(medv ~ lstat)
anova(lm.fit , lm.fit2)
  • anova performs hypothesis test
    • Null hypothesis is two models fit the data equally well
    • Alternate Hypothesis is full model is superior
    • $F$-statistics is 135 and $p$-value is near to 0
      • full model is far superior
par(mfrow = c(2, 2))
plot(lm.fit) # pattern

plot(lm.fit2) # little discernable pattern

# using poly
lm.fit5 <- lm(medv ~ poly(lstat , 5))
summary(lm.fit5)

# no improvement beyond 5
lm.fit6 <- lm(medv ~ poly(lstat , 6))
summary(lm.fit6)
  • By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.
# Using log transformation
summary(lm(medv ~ log(rm)))

Qualitative Predictors

  • Carseats dataset of ISLR2
    • Child Car seats
    • predict sales in 400 locations based on predictors
  • Predictor
    • Shelveloc
      • shelving location—that is, the space within a store in which the car seat is displayed—at each location
      • Bad, Medium, and Good
      • R generates dummy variables automatically
Carseats = ISLR2::Carseats
attach(Carseats)
head(Carseats)

unique(ShelveLoc)

lm.fit = lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats) # all with intercation terms

summary(lm.fit)

contrasts(ShelveLoc) # Coding for dummy variables
  • Coefficient for ShelveLocGood in the regression output is positive 4.8486762
    • indicates that a good shelving location is associated with high sales (relative to a bad location).
  • ShelveLocMedium has a smaller positive coefficient 1.9532620
    • indicates that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location

Writing Functions

LoadLibraries = function (){
  library(ISLR2)
  library(MASS)
  print("The libraries have been loaded.")
}

LoadLibraries()