# Linear Regression

20 minute read

Published:

This lesson is from An Introduction to Statistical Learning

# Multiple Linear Regression

• Simple Linear Regression

• Predict response on single predictor variable
• Sales -> Advertising Budget of TV, Radio Newspaper

• Whether either of two media is associated with sales?

• Multiple Regressions

•  CoefficientStd Errort-statisticp-value
Intercept7.03250.457815.36< 0.0001
TV0.04750.002717.67< 0.0001
•  CoefficientStd Errort-statisticp-value
Intercept9.3120.56316.54< 0.0001
Radio0.2030.0209.92< 0.0001
•  CoefficientStd Errort-statisticp-value
Intercept12.3510.62119.88< 0.0001
Newspaper0.0550.0173.300.00115
• Analysis

• $1000 increase in spending for Advertising budget in • TV provides increase in sales of 47 units • Radio provides increase in 203 units • Newspaper provides increase in 55 units • Issues • Single prediction with a given budget for three media • Each regression equation ignores the other media • may be misleading if there is an association between each budget and sales • Solution • extend with a separate slope coefficients in a single model •$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … +\beta_pX_p + \epsilon$•$Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times newspaper + \epsilon$## Estimating the Regression Coefficients • Prediction after estimates •$Y = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + … +\hat{\beta_p}X_p + \epsilon$• Parameters are estimates using least squares approach, mimimize the sum of squared residuals •$RSS = \sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2 $•$RSS = \sum [y_i - \hat{\beta_0} - \hat{\beta_1}X_1 - \hat{\beta_2}X_2 - … - \hat{\beta_p}X_p]^2 $Two Predictors, One Response Variable, Least squares plane. Plane minimize sum of squared vertical distance beween each observation and plane • Advertising Data • Multiple Linear Regression on Advertising data • CoefficientStd errort-statisticp-value Intercept2.9390.31199.42< 0.0001 TV0.0460.001432.81< 0.0001 Radio0.1890.008621.89< 0.0001 Newspaper−0.0010.0059−0.180.8599 • Spending an additional$1000 on

• TV increase 46 units
• Radio increase 189 units
• Newspaper close to similar
• Analysis

• TV and Radio coefficients are similar as with simple linear regression
• Newspaper coefficients is different and p-value is also no longer significant
• when simple, coefficient $0.055$, p-value $0.00115$
• now close to $0$ with p-value $0.8599$
• simple and multiple regression coefficients can be quite different
• difference stems from the fact that in the simple regression case, the slope term represents the average increase in product sales associated with a $1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio. • By contrast, in the multiple regression setting, the coefficient for newspaper represents the average increase in product sales associated with increasing newspaper spending by$1,000 while holding TV and radio fixed.
• Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite?
• It does, lets see correlation matrix
• Correlation Matrix

•  TVRadioNewspaperSales
TV1.00.05480.05670.7822
Radio 1.00.35410.5762
Newspaper  1.00.2283
Sales   1.0

Correlation between Radio and Newspaper is $0.35$

 * indicates that markets with high newspaper advertising tend to also have high radio advertising

• Now suppose that the multiple regression is correct

• newspaper advertising is not associated with sales, but radio advertising is associated with sales.
• Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows,

• we also tend to spend more on newspaper advertising in those same markets.
• Hence, in a simple linear regression which only examines sales versus newspaper

• we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising is not directly associated with sales.
• So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.

• This slightly counterintuitive result is very common in many real life situations.

• Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper.
• Of course no one has (yet) suggested that ice creams should be banned at beaches to reduce shark attacks.
• In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks.
• A multiple regression of shark attacks onto ice cream sales and temperature reveals that, as intuition implies, ice cream sales is no longer a significant predictor after adjusting for temperature.

## Some Important Questions

• In multiple linear regression
• Is at least one of the predictors $X_1,X_2, . . . ,X_p$ useful in predicting the response?
• Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
• How well does the model fit the data?
• Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

### Is There a Relationship Between the Response and Predictors?

• In simple linear regression, $Y = \beta_0 + \beta_1X$
• check if $\beta_1 = 0$
• $H_0:~\beta_1 = 0$
• $H_a:~\beta_1 \ne 0$
• In multiple regression setting with $p$ predictors
• $H_0:~\beta_1 = \beta_2 = … =\beta_p= 0$
• $H_a:$ at least one $\beta_j \ne 0$
• This hypothesis test is performed by computing the F-statistic
• In general, an F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously, https://blog.minitab.com/en/adventures-in-statistics-2/what-is-the-f-test-of-overall-significance-in-regression-analysis
• $F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$
• where, $TSS = \sum (y_i - \bar{y})^2$ and $RSS = \sum(y_i - \hat{y_i})^2$
• If the linear model assumptions are correct, one can show that
• $E [\frac{RSS}{n-p-1}] = \sigma^2$
• If $H_0$ is true
• $E[\frac{TSS-RSS}{p}] = \sigma^2$
• Thus, $F-$statistic will be close to $1$ when there is no relationship between response and predictors
• If $H_1$ is true
• $E[\frac{TSS-RSS}{p}] > \sigma^2$
• $\implies F > 1$
QuantatityValue (Multiple Regression)
Residual Standard Error1.69
$R^2$0.897
$F$-statistic570
• $F$-statistic is $570$ far larger than $1$
• compelling evidence against null hypothesis
• Large F-statistic suggests that at least one of the advertising media must be related to sales
• How large $F$-statistic should be to reject $H_0$ and conclude that there is relationship
• When $n$ is large, an $F$-statistic that is just a little larger than $1$ might still provide evidence against $H_0$.
• In contrast, a larger F-statistic is needed to reject $H_0$ if $n$ is small.
• When $H_0$ is true and the errors $\epsilon_i$ have a normal distribution, the F-statistic follows an F-distribution.
• For any given value of n and p, any statistical software package can be used to compute the p-value associated with the $F$-statistic using this distribution. Based on this $p$-value, we can determine whether or not to reject $H_0$.

• $p$-value associated with $F$-statistic (570) for advertising data is essentially zero

• extremely strong evidence that at least one of the media is associated with increased sales
• Sometimes we want to test that a particular subset of $q$ coefficients are zero, assuming just last $q$ coefficients

• $H_0:~\beta_{p-q+1} = \beta_{p-q+21} = … =\beta_p= 0$
• In this case we fit a second model that

• uses all the variables except those last $q$.
• Suppose that the residual sum of squares for that model is $RSS_0$.

• Our earlier $F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$ becomes
• $F = \frac{(RSS_0-RSS)/q}{RSS/(n-p-1)}$
• As in Table 1.4 (Multiple Linear Regression) for each individual predictor a t-statistic and a p-value were reported

• These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors.
• It turns out that each of these is exactly equivalent to the F-test that omits that single variable from the model, leaving all the others in.
• square of each t-statistic is the corresponding F-statistic
• So it reports the partial effect of adding that variable to the model.
• For instance, as we discussed earlier, these p-values indicate that TV and radio are related to sales, but that there is no evidence that newspaper is associated with sales, when TV and radio are held fixed.
• Why do we need overall F-statistic, given these individual p-values for each variable?

•  CoefficientStd errort-statisticp-value
Intercept2.9390.31199.42< 0.0001
TV0.0460.001432.81< 0.0001
Radio0.1890.008621.89< 0.0001
Newspaper−0.0010.0059−0.180.8599
• Since if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response.

• However, this logic is flawed, especially when the number of predictors $p$ is large.

• For instance, consider an example, let

• $p = 100$ and

• $H_0 : \beta_1 = \beta_2 = · · · = \beta_p = 0$
• $H_a:$ at least one $\beta_j \ne 0$
• Let $H_0$ is true, so no variable is truly associated with the response.

• $p$-value should be large for all variables if no relationship
• In this situation, about 5% of the p-values associated with each variable (of the type shown in Table 1.4) will be below 0.05 by chance.

•  CoefficientStd errort-statisticp-value
Intercept
$X_1$
$X_2$ …
$X_{100}$
• In other words, we expect to see approximately five small p-values even in the absence of any true association between the predictors and the response.

• In fact, it is likely that we will observe at least one p-value below 0.05 by chance!

• Hence, if we use the individual t-statistics and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship.

• However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors.

• Hence, if $H_0$ is true, there is only a $5\%$ chance that the $F$-statistic will result in a pvalue below $0.05$, regardless of the number of predictors or the number of observations.

• F-statistic works to test for any association between the predictors and the response

• when p is relatively small, and certainly small compared to n
• If p > n
• there are more coefficients $\beta_j$ to estimate than observations from which to estimate them.
• In this case we cannot even fit the multiple linear regression model using least squares, so the $F$-statistic cannot be used
• and neither can most of the other concepts that we have seen so far in this chapter.
• When $p$ is large, some of the approaches discussed in the next section, such as forward selection, can be used.
• This high-dimensional setting is discussed in greater detail in Chapter 6.

### Deciding on Important Variables

• First step in a multiple regression analysis

• compute the $F$-statistic and to examine the associated $p$-value.
• If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!
• We could look at the individual p-values, but if no of predictors, $p$ is large we are likely to make some false discoveries
• What is Variable Selection?

• The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection
• If $p = 2$, then we can consider four models:

• model containing no variables
• model containing $X_1$ only
• model containing $X_2$ only, and
• model containing both $X_1$ and $X_2$
• Which model is best?

• Mallow’s Cp,
• Akaike information criterion (AIC)
• Bayesian information criterion (BIC)
• Adjusted $R^2$
• If $p=30$

• $2^{30} = 1,073,741,824$ models
• Three Classical approaches to select models

• Forward selection

• source: https://quantifyinghealth.com/stepwise-selection/
• Begin with the null model
• a model that contains an intercept but no predictors.
• Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS.
• variable has the lowest $p$-value
• Add to that model the variable that results in the lowest RSS for the new two-variable model.
• This approach is continued until some stopping rule is satisfied.
• Backward selection

• We start with all variables in the model, and backward remove the variable with the largest p-value—that is, the variable selection that is the least statistically significant.

• The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed.

• This procedure continues until a stopping rule is reached.

• For instance, we may stop when all remaining variables have a p-value below some threshold

• source: https://quantifyinghealth.com/stepwise-selection/
• Mixed selection

• This is a combination of forward and backward selection
• We start with no variables in the model, and as with forward selection selection, we add the variable that provides the best fit.
• We continue to add variables one-by-one.
• Of course, as we noted with the Advertising example, the $p$-values for variables can become larger as new predictors are added to the model.
• Hence, if at any point the $p$-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model.
• We continue to perform these forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large p-value if added to the model.
• Backward selection cannot be used if $p > n$, while forward selection can always be used.

• Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

### Model Fit

• Most common numerical measures of model fit are

• Residual Standard Error (RSE), measure of lack of fit
• $R^2$, the fraction of variance explained, lies in $0$ to $1$
• These quantities are computed and interpreted in the same fashion as for simple linear regression
• In simple regression, $R^2$ is the square of the correlation ($r^2 = Cor(X,Y)^2$) of the response and the variable

• $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
• In multiple linear regression, it turns out that $R^2$ equals $Cor(Y, \hat{Y})^2$, the square of the correlation between the response ($Y$) and the fitted linear model ($\hat{Y}$)

• in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.
• An $R^2$ value close to $1$ indicates that the model explains a large portion of the variance in the response variable

• Example

• Advertising Data with multiple regression

• QuantatityValue (Multiple Regression)
Residual Standard Error1.69
$R^2$0.897
$F$-statistic570
• $R^2$ using all three media is 0.8972

• $R^2$ using only TV and Radio is 0.89719

• small increase in $R^2$ when adding newspaper

• $R^2$ will always increase with addition of new variables due to decrease in residual sum of squares on the training data (may not be true for test data)
• $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$
• TSS is total sum of squares = $\sum (y_i-\bar{y})^2$
• RSS is residual sum of squares = $\sum (y_i - \hat{y})^2$
• Essentially, newspaper provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting

• Simple Regression

• TV as Predictor Variable
QuantatityValue
Residual Standard Error3.26
$R^2$0.612
$F$-statistic312.1
•  CoefficientStd Errort-statisticp-value
Intercept7.03250.457815.36< 0.0001
TV0.04750.002717.67< 0.0001
• Model containing only TV as a predictor had an $R^2$ of 0.61.

• Adding radio to the model leads to a substantial improvement in $R^2$ as $0.89719$

• This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertising.

• We could further quantify this improvement by looking at the p-value for the radio coefficient in a model that contains only TV and radio as predictors.

• Model that contains only TV has

• RSE of 3.26
• The model that contains only TV and radio as predictors has

• Residual Standard Error (RSE) of 1.681
• Model that also contains newspaper as a predictor has

• RSE of 1.686
• increase by adding variable
• This corroborates our previous conclusion that a model that uses TV and radio expenditures to predict sales is much more accurate (on the training data) than one that only uses TV spending.

• Furthermore, given that TV and radio expenditures are used as predictors, there is no point in also using newspaper spending as a predictor in the model.

• The observant reader may wonder how RSE can increase when newspaper is added to the model given that RSS must decrease.

• Why RSE increases?

• $RSE = \sqrt{\frac{RSS}{n-p-1}}$
• Models with more variables (higher $p$) can have higher RSE if the decrease in RSS is small relative to the increase in p.
• In addition to looking at the $RSE$ and $R^2$ statistics just discussed, it can be useful to plot the data

• Graphical summaries can reveal problems with a model that are not visible from numerical statistics

Least squares regression plane for Sales Vs TV and Radio Budget
• Linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio.
• It underestimates sales for instances where the budget was split between the two media.
• This pronounced non-linear pattern suggests a synergy or interaction effect between interaction the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.
• In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.

### Predictions

• Once we have fit the multiple regression model, response Y can be predicted on the basis of a set of values for the predictors $X_1,X_2, . . . ,X_p$
• $\hat{y} = \hat + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 +…+ \hat{\beta_p}x_p$
• However, there are three sorts of uncertainty associated with this prediction.
• model parameters $\hat{\beta_i}$ are estimates for $\beta_i$
• Least square plane $\hat{Y}$ is estimate for true regression plane $f(X)$
• inaccuracy in the coefficients estimates is related to reducible error
• we can compute confidence interval to determine how close $\hat{Y}$ will be to $f(X)$
• Model Bias in assuming linear model for $f(X)$
• Random error $\epsilon$, irreduccible error
• Confidence interval is used to quantify the uncertainty surrounding confidence the average sales over a large number of cities
• For example, given that $100,000 is spent on TV advertising and$20,000 is spent on radio advertising in each city, the 95% confidence interval is [10,985, 11,528].
• We interpret this to mean that 95% of intervals of this form will contain the true value of f(X).
• Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
• Given that $100,000 is spent on TV advertising and$20,000 is spent on radio advertising in that city the 95% prediction interval is [7,930, 14,580].
• We interpret this to mean that 95% of intervals of this form will contain the true value of Y for this city.
• Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations

Tags: