Linear Regression

20 minute read


This lesson is from An Introduction to Statistical Learning

Multiple Linear Regression

  • Simple Linear Regression

    • Predict response on single predictor variable
  • Sales -> Advertising Budget of TV, Radio Newspaper

  • Whether either of two media is associated with sales?

  • Multiple Regressions

    •  CoefficientStd Errort-statisticp-value
      Intercept7.03250.457815.36< 0.0001
      TV0.04750.002717.67< 0.0001
    •  CoefficientStd Errort-statisticp-value
      Intercept9.3120.56316.54< 0.0001
      Radio0.2030.0209.92< 0.0001
    •  CoefficientStd Errort-statisticp-value
      Intercept12.3510.62119.88< 0.0001
  • Analysis

    • $1000 increase in spending for Advertising budget in
      • TV provides increase in sales of 47 units
      • Radio provides increase in 203 units
      • Newspaper provides increase in 55 units
  • Issues

    • Single prediction with a given budget for three media
    • Each regression equation ignores the other media
      • may be misleading if there is an association between each budget and sales
  • Solution

    • extend with a separate slope coefficients in a single model
  • $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … +\beta_pX_p + \epsilon$

  • $Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times newspaper + \epsilon$

Estimating the Regression Coefficients

  • Prediction after estimates
    • $Y = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + … +\hat{\beta_p}X_p + \epsilon$
  • Parameters are estimates using least squares approach, mimimize the sum of squared residuals
  • $RSS = \sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2 $
  • $RSS = \sum [y_i - \hat{\beta_0} - \hat{\beta_1}X_1 - \hat{\beta_2}X_2 - … - \hat{\beta_p}X_p]^2 $
Two Predictors, One Response Variable, Least squares plane. Plane minimize sum of squared vertical distance beween each observation and plane
  • Advertising Data

    • Multiple Linear Regression on Advertising data

    •  CoefficientStd errort-statisticp-value
      Intercept2.9390.31199.42< 0.0001
      TV0.0460.001432.81< 0.0001
      Radio0.1890.008621.89< 0.0001
    • Spending an additional $1000 on

      • TV increase 46 units
      • Radio increase 189 units
      • Newspaper close to similar
    • Analysis

      • TV and Radio coefficients are similar as with simple linear regression
      • Newspaper coefficients is different and p-value is also no longer significant
        • when simple, coefficient $0.055$, p-value $0.00115$
        • now close to $0$ with p-value $0.8599$
      • simple and multiple regression coefficients can be quite different
        • difference stems from the fact that in the simple regression case, the slope term represents the average increase in product sales associated with a $1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio.
        • By contrast, in the multiple regression setting, the coefficient for newspaper represents the average increase in product sales associated with increasing newspaper spending by $1,000 while holding TV and radio fixed.
      • Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite?
        • It does, lets see correlation matrix
    • Correlation Matrix

      •  TVRadioNewspaperSales
        Radio 1.00.35410.5762
        Newspaper  1.00.2283
        Sales   1.0

        Correlation between Radio and Newspaper is $0.35$

         * indicates that markets with high newspaper advertising tend to also have high radio advertising
      • Now suppose that the multiple regression is correct

        • newspaper advertising is not associated with sales, but radio advertising is associated with sales.
      • Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows,

        • we also tend to spend more on newspaper advertising in those same markets.
      • Hence, in a simple linear regression which only examines sales versus newspaper

        • we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising is not directly associated with sales.
      • So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.

      • This slightly counterintuitive result is very common in many real life situations.

        • Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper.
          • Of course no one has (yet) suggested that ice creams should be banned at beaches to reduce shark attacks.
        • In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks.
        • A multiple regression of shark attacks onto ice cream sales and temperature reveals that, as intuition implies, ice cream sales is no longer a significant predictor after adjusting for temperature.

Some Important Questions

  • In multiple linear regression
    • Is at least one of the predictors $X_1,X_2, . . . ,X_p$ useful in predicting the response?
    • Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
    • How well does the model fit the data?
    • Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Is There a Relationship Between the Response and Predictors?

  • In simple linear regression, $ Y = \beta_0 + \beta_1X $
    • check if $\beta_1 = 0$
    • $ H_0:~\beta_1 = 0 $
    • $ H_a:~\beta_1 \ne 0 $
  • In multiple regression setting with $p$ predictors
    • $ H_0:~\beta_1 = \beta_2 = … =\beta_p= 0 $
    • $ H_a:$ at least one $\beta_j \ne 0$
  • This hypothesis test is performed by computing the F-statistic
    • In general, an F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously,
  • $ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $
    • where, $TSS = \sum (y_i - \bar{y})^2 $ and $RSS = \sum(y_i - \hat{y_i})^2 $
  • If the linear model assumptions are correct, one can show that
    • $ E [\frac{RSS}{n-p-1}] = \sigma^2 $
  • If $H_0$ is true
    • $E[\frac{TSS-RSS}{p}] = \sigma^2$
  • Thus, $F-$statistic will be close to $1$ when there is no relationship between response and predictors
  • If $H_1$ is true
    • $E[\frac{TSS-RSS}{p}] > \sigma^2$
    • $\implies F > 1$
QuantatityValue (Multiple Regression)
Residual Standard Error1.69
  • $F$-statistic is $570$ far larger than $1$
    • compelling evidence against null hypothesis
  • Large F-statistic suggests that at least one of the advertising media must be related to sales
  • How large $F$-statistic should be to reject $H_0$ and conclude that there is relationship
    • When $n$ is large, an $F$-statistic that is just a little larger than $1$ might still provide evidence against $H_0$.
    • In contrast, a larger F-statistic is needed to reject $H_0$ if $n$ is small.
  • When $H_0$ is true and the errors $\epsilon_i$ have a normal distribution, the F-statistic follows an F-distribution.
  • For any given value of n and p, any statistical software package can be used to compute the p-value associated with the $F$-statistic using this distribution. Based on this $p$-value, we can determine whether or not to reject $H_0$.

  • $p$-value associated with $F$-statistic (570) for advertising data is essentially zero

    • extremely strong evidence that at least one of the media is associated with increased sales
  • Sometimes we want to test that a particular subset of $q$ coefficients are zero, assuming just last $q$ coefficients

    • $ H_0:~\beta_{p-q+1} = \beta_{p-q+21} = … =\beta_p= 0 $
  • In this case we fit a second model that

    • uses all the variables except those last $q$.
  • Suppose that the residual sum of squares for that model is $RSS_0$.

    • Our earlier $ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $ becomes
    • $ F = \frac{(RSS_0-RSS)/q}{RSS/(n-p-1)} $
  • As in Table 1.4 (Multiple Linear Regression) for each individual predictor a t-statistic and a p-value were reported

    • These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors.
    • It turns out that each of these is exactly equivalent to the F-test that omits that single variable from the model, leaving all the others in.
      • square of each t-statistic is the corresponding F-statistic
    • So it reports the partial effect of adding that variable to the model.
    • For instance, as we discussed earlier, these p-values indicate that TV and radio are related to sales, but that there is no evidence that newspaper is associated with sales, when TV and radio are held fixed.
  • Why do we need overall F-statistic, given these individual p-values for each variable?

    •  CoefficientStd errort-statisticp-value
      Intercept2.9390.31199.42< 0.0001
      TV0.0460.001432.81< 0.0001
      Radio0.1890.008621.89< 0.0001
    • Since if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response.

    • However, this logic is flawed, especially when the number of predictors $p$ is large.

    • For instance, consider an example, let

      • $p = 100$ and

        • $H_0 : \beta_1 = \beta_2 = · · · = \beta_p = 0$
        • $H_a:$ at least one $\beta_j \ne 0$
      • Let $H_0$ is true, so no variable is truly associated with the response.

        • $p$-value should be large for all variables if no relationship
      • In this situation, about 5% of the p-values associated with each variable (of the type shown in Table 1.4) will be below 0.05 by chance.

        •  CoefficientStd errort-statisticp-value
          $X_2$ …    
      • In other words, we expect to see approximately five small p-values even in the absence of any true association between the predictors and the response.

      • In fact, it is likely that we will observe at least one p-value below 0.05 by chance!

      • Hence, if we use the individual t-statistics and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship.

      • However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors.

      • Hence, if $H_0$ is true, there is only a $5\%$ chance that the $F$-statistic will result in a pvalue below $0.05$, regardless of the number of predictors or the number of observations.

  • F-statistic works to test for any association between the predictors and the response

    • when p is relatively small, and certainly small compared to n
    • If p > n
      • there are more coefficients $\beta_j$ to estimate than observations from which to estimate them.
      • In this case we cannot even fit the multiple linear regression model using least squares, so the $F$-statistic cannot be used
        • and neither can most of the other concepts that we have seen so far in this chapter.
    • When $p$ is large, some of the approaches discussed in the next section, such as forward selection, can be used.
    • This high-dimensional setting is discussed in greater detail in Chapter 6.

Deciding on Important Variables

  • First step in a multiple regression analysis

    • compute the $F$-statistic and to examine the associated $p$-value.
    • If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!
    • We could look at the individual p-values, but if no of predictors, $p$ is large we are likely to make some false discoveries
  • What is Variable Selection?

    • The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection
  • If $p = 2$, then we can consider four models:

    • model containing no variables
    • model containing $X_1$ only
    • model containing $X_2$ only, and
    • model containing both $X_1$ and $X_2$
  • Which model is best?

    • Mallow’s Cp,
    • Akaike information criterion (AIC)
    • Bayesian information criterion (BIC)
    • Adjusted $R^2$
  • If $p=30$

    • $2^{30} = 1,073,741,824$ models
  • Three Classical approaches to select models

    • Forward selection

      • source:
      • Begin with the null model
        • a model that contains an intercept but no predictors.
      • Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS.
        • variable has the lowest $p$-value
      • Add to that model the variable that results in the lowest RSS for the new two-variable model.
      • This approach is continued until some stopping rule is satisfied.
    • Backward selection

      • We start with all variables in the model, and backward remove the variable with the largest p-value—that is, the variable selection that is the least statistically significant.

      • The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed.

      • This procedure continues until a stopping rule is reached.

      • For instance, we may stop when all remaining variables have a p-value below some threshold

      • source:
    • Mixed selection

      • This is a combination of forward and backward selection
      • We start with no variables in the model, and as with forward selection selection, we add the variable that provides the best fit.
      • We continue to add variables one-by-one.
      • Of course, as we noted with the Advertising example, the $p$-values for variables can become larger as new predictors are added to the model.
      • Hence, if at any point the $p$-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model.
      • We continue to perform these forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large p-value if added to the model.
    • Backward selection cannot be used if $p > n$, while forward selection can always be used.

    • Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

Model Fit

  • Most common numerical measures of model fit are

    • Residual Standard Error (RSE), measure of lack of fit
    • $R^2$, the fraction of variance explained, lies in $0$ to $1$
    • These quantities are computed and interpreted in the same fashion as for simple linear regression
  • In simple regression, $R^2$ is the square of the correlation ($r^2 = Cor(X,Y)^2 $) of the response and the variable

    • $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
  • In multiple linear regression, it turns out that $R^2$ equals $Cor(Y, \hat{Y})^2$, the square of the correlation between the response ($Y$) and the fitted linear model ($\hat{Y}$)

    • in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.
  • An $R^2$ value close to $1$ indicates that the model explains a large portion of the variance in the response variable

  • Example

    • Advertising Data with multiple regression

    • QuantatityValue (Multiple Regression)
      Residual Standard Error1.69
    • $R^2$ using all three media is 0.8972

    • $R^2$ using only TV and Radio is 0.89719

    • small increase in $R^2$ when adding newspaper

      • $R^2$ will always increase with addition of new variables due to decrease in residual sum of squares on the training data (may not be true for test data)
        • $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$
          • TSS is total sum of squares = $\sum (y_i-\bar{y})^2$
          • RSS is residual sum of squares = $\sum (y_i - \hat{y})^2$
    • Essentially, newspaper provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting

      • Simple Regression

        • TV as Predictor Variable
        Residual Standard Error3.26
      •  CoefficientStd Errort-statisticp-value
        Intercept7.03250.457815.36< 0.0001
        TV0.04750.002717.67< 0.0001
    • Model containing only TV as a predictor had an $R^2$ of 0.61.

    • Adding radio to the model leads to a substantial improvement in $R^2$ as $0.89719$

    • This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertising.

    • We could further quantify this improvement by looking at the p-value for the radio coefficient in a model that contains only TV and radio as predictors.

    • Model that contains only TV has

      • RSE of 3.26
    • The model that contains only TV and radio as predictors has

      • Residual Standard Error (RSE) of 1.681
    • Model that also contains newspaper as a predictor has

      • RSE of 1.686
        • increase by adding variable
    • This corroborates our previous conclusion that a model that uses TV and radio expenditures to predict sales is much more accurate (on the training data) than one that only uses TV spending.

    • Furthermore, given that TV and radio expenditures are used as predictors, there is no point in also using newspaper spending as a predictor in the model.

    • The observant reader may wonder how RSE can increase when newspaper is added to the model given that RSS must decrease.

    • Why RSE increases?

      • $ RSE = \sqrt{\frac{RSS}{n-p-1}} $
      • Models with more variables (higher $p$) can have higher RSE if the decrease in RSS is small relative to the increase in p.
  • In addition to looking at the $RSE$ and $R^2$ statistics just discussed, it can be useful to plot the data

  • Graphical summaries can reveal problems with a model that are not visible from numerical statistics

Least squares regression plane for Sales Vs TV and Radio Budget
  • Linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio.
  • It underestimates sales for instances where the budget was split between the two media.
  • This pronounced non-linear pattern suggests a synergy or interaction effect between interaction the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.
  • In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.


  • Once we have fit the multiple regression model, response Y can be predicted on the basis of a set of values for the predictors $X_1,X_2, . . . ,X_p$
    • $ \hat{y} = \hat + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 +…+ \hat{\beta_p}x_p $
  • However, there are three sorts of uncertainty associated with this prediction.
    • model parameters $\hat{\beta_i}$ are estimates for $\beta_i$
      • Least square plane $\hat{Y}$ is estimate for true regression plane $f(X)$
      • inaccuracy in the coefficients estimates is related to reducible error
      • we can compute confidence interval to determine how close $\hat{Y}$ will be to $f(X)$
    • Model Bias in assuming linear model for $f(X)$
    • Random error $\epsilon$, irreduccible error
  • Confidence interval is used to quantify the uncertainty surrounding confidence the average sales over a large number of cities
    • For example, given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city, the 95% confidence interval is [10,985, 11,528].
      • We interpret this to mean that 95% of intervals of this form will contain the true value of f(X).
  • Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
    • Given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in that city the 95% prediction interval is [7,930, 14,580].
    • We interpret this to mean that 95% of intervals of this form will contain the true value of Y for this city.
    • Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations