# Linear Regression Lab

** Published:**

This lesson is from An Introduction to Statistical Learning

# Linear Regression

- Load the MASS package, which is a very large collection of data sets and functions
- We also load the ISLR2 package, which includes the data sets

## Libraries

`library(MASS) library(ISLR2) Boston = ISLR2::Boston`

## Simple Linear Regression

Boston data set, which records medv (median house value) for 506 census tracts in Boston. We will seek to predict medv using 12 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status)

To find out more about the data set, we can type

`?Boston`

- Boston Data

### Description

A data set containing housing values in 506 suburbs of Boston.

### Usage

```
Boston
```

### Format

A data frame with 506 rows and 13 variables (12 predictor and 1 response)

`crim`

- per capita crime rate by town.

`zn`

- proportion of residential land zoned for lots over 25,000 sq.ft.

`indus`

- proportion of non-retail business acres per town.

`chas`

- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

`nox`

- nitrogen oxides concentration (parts per 10 million).

`rm`

- average number of rooms per dwelling.

`age`

- proportion of owner-occupied units built prior to 1940.

`dis`

- weighted mean of distances to five Boston employment centres.

`rad`

- index of accessibility to radial highways.

`tax`

- full-value property-tax rate per $10,000.

`ptratio`

- pupil-teacher ratio by town.

`lstat`

- lower status of the population (percent).

`medv`

- median value of owner-occupied homes in $1000s.

- Fit the Model

```
# lm() to fit a simple linear regression
# medv (median house value) as the response
# lstat (percent of households with low socio economic status) as predictor
# lm.fit = lm(medv ~ lstat) # error
lm.fit = lm(medv ~ lstat, data=Boston)
attach(Boston)
lm.fit = lm(medv ~ lstat)
```

- Information about model

```
lm.fit # some basic information
summary(lm.fit) # p-values, std errors, R^2 statistic, F-statistic
names(lm.fit) # will give model information stored
lm.fit$coefficients # will give coefficients of lm.fit
coef(lm.fit) # safer to use
```

- Confidence Intervals for coefficient estimates

```
confint(lm.fit)
confint(lm.fit, level=0.95)
confint(lm.fit, level=0.99)
```

- standard 95% CI consists of a lower 2.5% limit and an upper 97.5% limit
- When you repeat your experiment many times, the true parameter value will be
- below the CI in 2.5% of cases and
- above it in another 2.5% of cases
- and the CI will cover it in 95% of cases

- Predict
- The predict() function can be used to produce
- confidence intervals and
- prediction intervals for the prediction of medv for a given value of lstat

- The predict() function can be used to produce

```
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))), interval = "confidence")
predict(lm.fit, data.frame(lstat = 5), interval = "prediction")
predict(lm.fit, data.frame(lstat = 10), interval = "prediction")
predict(lm.fit, data.frame(lstat = 15), interval = "prediction")
predict(lm.fit , data.frame(lstat = (c(5, 10, 15))), interval = "prediction")
```

- 95% confidence interval associated with a lstat value of 10 is
- (24.47, 25.63)

- 95% prediction interval is
- (12.828, 37.28)

- As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.05 for medv when lstat equals 10), but the latter are substantially wider.

https://towardsdatascience.com/how-confidence-and-prediction-intervals-work-4592019576d8#:~:text=The%20prediction%20interval%20predicts%20in,such%20as%20the%20population%20mean. |

- Prediction Intervals
- The prediction interval predicts in what range a future individual observation will fall, while a confidence interval shows the likely range of values associated with some statistical parameter of the data, such as the population mean, intercept. etc
- Prediction interval expresses inherent uncertainty in the particular data point on top of the sampling uncertainty. It is thus wider than the confidence interval.

Plot

`plot(lstat , medv) # scatter plot plot(lstat , medv , col = "red") plot (1:20 , 1:20, pch = 1:20) # all point characters plot(lstat , medv , pch = 4) plot(lstat , medv , pch = "+") abline(lm.fit) # least squares regression line abline(lm.fit, lwd=3) # line width 3 abline(lm.fit , lwd = 3, col = "red")`

Diagnostic plots

`plot(lm.fit) # see plot one by one pressing enter par(mfrow = c(2, 2)) # divides the plotting region into a 2 × 2 grid of panels plot(lm.fit) # plot(predict(lm.fit), residuals(lm.fit)) # plot the residuals against the fitted values plot(predict(lm.fit), rstudent(lm.fit)) # plot the studentized residuals against the fitted values`

- quotient resulting from the division of a residual by an estimate of its standard deviation.

On the basis of the residual plots, there is some evidence of non-linearity.

Leverage statistics can be computed for any number of predictors using the hatvalues() function.

- leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.
- High-leverage points, if any, are outliers with respect to the independent variables

`plot(hatvalues(lm.fit)) which.max(hatvalues(lm.fit))`

## Multiple Linear Regression

```
lm.fit = lm(medv ~ lstat + age , data = Boston)
summary(lm.fit)
lm.fit = lm(medv ~ . , data = Boston) # All Variables
summary(lm.fit)
lm.fit1 = lm(medv ~ . - age, data = Boston)
summary(lm.fit1)
lm.fit1 = update(lm.fit , ~ . - age)
summary(lm.fit1)
?summary.lm # individual componenets of summary object
summary(lm.fit)$r.squared # R^2 value
summary(lm.fit)$sigma # Residual Standard Error (RSE) estimate of standard deviation of ϵ
```

- vif() function, part of the car package, can be used to compute variance inflation factors.

```
install.packages("car")
library(car)
vif(lm.fit)
```

## Interaction Terms

`lstat:age`

is interaction term between lstat and age

```
# lstat * age includes lstat, age, and interaction term lstat * age
# lstat * age is shorthand for lstat + age + lstat:age
summary(lm(medv ~ lstat * age , data = Boston))
```

## Non-linear Transformations of the Predictors

```
lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)
```

- near-zero p-value associated with the quadratic term suggests that it leads to an improved model
- use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit

```
lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)
lm.fit <- lm(medv ~ lstat)
anova(lm.fit , lm.fit2)
```

- anova performs hypothesis test
- Null hypothesis is two models fit the data equally well
- Alternate Hypothesis is full model is superior
- $F$-statistics is 135 and $p$-value is near to 0
- full model is far superior

```
par(mfrow = c(2, 2))
plot(lm.fit) # pattern
plot(lm.fit2) # little discernable pattern
```

```
# using poly
lm.fit5 <- lm(medv ~ poly(lstat , 5))
summary(lm.fit5)
# no improvement beyond 5
lm.fit6 <- lm(medv ~ poly(lstat , 6))
summary(lm.fit6)
```

- By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.

```
# Using log transformation
summary(lm(medv ~ log(rm)))
```

## Qualitative Predictors

- Carseats dataset of ISLR2
- Child Car seats
- predict sales in 400 locations based on predictors

- Predictor
- Shelveloc
- shelving location—that is, the space within a store in which the car seat is displayed—at each location
- Bad, Medium, and Good
- R generates dummy variables automatically

- Shelveloc

```
Carseats = ISLR2::Carseats
attach(Carseats)
head(Carseats)
unique(ShelveLoc)
lm.fit = lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats) # all with intercation terms
summary(lm.fit)
contrasts(ShelveLoc) # Coding for dummy variables
```

- Coefficient for ShelveLocGood in the regression output is positive 4.8486762
- indicates that a good shelving location is associated with high sales (relative to a bad location).

- ShelveLocMedium has a smaller positive coefficient 1.9532620
- indicates that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location

## Writing Functions

```
LoadLibraries = function (){
library(ISLR2)
library(MASS)
print("The libraries have been loaded.")
}
LoadLibraries()
```