# Linear Regression Lab

Published:

This lesson is from An Introduction to Statistical Learning

# Linear Regression

• Load the MASS package, which is a very large collection of data sets and functions
• We also load the ISLR2 package, which includes the data sets

## Libraries

•   library(MASS)
library(ISLR2)

Boston = ISLR2::Boston


## Simple Linear Regression

• Boston data set, which records medv (median house value) for 506 census tracts in Boston. We will seek to predict medv using 12 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status)

• To find out more about the data set, we can type

  ?Boston

• Boston Data

### Description

A data set containing housing values in 506 suburbs of Boston.

### Usage

Boston


### Format

A data frame with 506 rows and 13 variables (12 predictor and 1 response)

• crim
• per capita crime rate by town.
• zn
• proportion of residential land zoned for lots over 25,000 sq.ft.
• indus
• proportion of non-retail business acres per town.
• chas
• Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
• nox
• nitrogen oxides concentration (parts per 10 million).
• rm
• average number of rooms per dwelling.
• age
• proportion of owner-occupied units built prior to 1940.
• dis
• weighted mean of distances to five Boston employment centres.
• rad
• index of accessibility to radial highways.
• tax
• full-value property-tax rate per $10,000. • ptratio • pupil-teacher ratio by town. • lstat • lower status of the population (percent). • medv • median value of owner-occupied homes in$1000s.
• Fit the Model
# lm() to fit a simple linear regression
# medv (median house value) as the response
# lstat (percent of households with low socio economic status) as predictor

# lm.fit = lm(medv ~ lstat) # error

lm.fit = lm(medv ~ lstat, data=Boston)

attach(Boston)
lm.fit = lm(medv ~ lstat)

lm.fit # some basic information

summary(lm.fit) # p-values, std errors, R^2 statistic, F-statistic

names(lm.fit) # will give model information stored

summary(lm.fit)$sigma # Residual Standard Error (RSE) estimate of standard deviation of ϵ  • vif() function, part of the car package, can be used to compute variance inflation factors. install.packages("car") library(car) vif(lm.fit)  ## Interaction Terms • lstat:age is interaction term between lstat and age # lstat * age includes lstat, age, and interaction term lstat * age # lstat * age is shorthand for lstat + age + lstat:age summary(lm(medv ~ lstat * age , data = Boston))  ## Non-linear Transformations of the Predictors lm.fit2 <- lm(medv ~ lstat + I(lstat^2)) summary(lm.fit2)  • near-zero p-value associated with the quadratic term suggests that it leads to an improved model • use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit lm.fit2 <- lm(medv ~ lstat + I(lstat^2)) summary(lm.fit2) lm.fit <- lm(medv ~ lstat) anova(lm.fit , lm.fit2)  • anova performs hypothesis test • Null hypothesis is two models fit the data equally well • Alternate Hypothesis is full model is superior •$F$-statistics is 135 and$p\$-value is near to 0
• full model is far superior
par(mfrow = c(2, 2))
plot(lm.fit) # pattern

plot(lm.fit2) # little discernable pattern


# using poly
lm.fit5 <- lm(medv ~ poly(lstat , 5))
summary(lm.fit5)

# no improvement beyond 5
lm.fit6 <- lm(medv ~ poly(lstat , 6))
summary(lm.fit6)

• By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.
# Using log transformation
summary(lm(medv ~ log(rm)))


## Qualitative Predictors

• Carseats dataset of ISLR2
• Child Car seats
• predict sales in 400 locations based on predictors
• Predictor
• Shelveloc
• shelving location—that is, the space within a store in which the car seat is displayed—at each location
• R generates dummy variables automatically
Carseats = ISLR2::Carseats
attach(Carseats)

unique(ShelveLoc)

lm.fit = lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats) # all with intercation terms

summary(lm.fit)

contrasts(ShelveLoc) # Coding for dummy variables

• Coefficient for ShelveLocGood in the regression output is positive 4.8486762
• indicates that a good shelving location is associated with high sales (relative to a bad location).
• ShelveLocMedium has a smaller positive coefficient 1.9532620
• indicates that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location

## Writing Functions

LoadLibraries = function (){
library(ISLR2)
library(MASS)