Linear Regression
Published:
This lesson is from An Introduction to Statistical Learning
Multiple Linear Regression
Simple Linear Regression
- Predict response on single predictor variable
Sales -> Advertising Budget of TV, Radio Newspaper
Whether either of two media is associated with sales?
Multiple Regressions
Coefficient Std Error t-statistic p-value Intercept 7.0325 0.4578 15.36 < 0.0001 TV 0.0475 0.0027 17.67 < 0.0001 Coefficient Std Error t-statistic p-value Intercept 9.312 0.563 16.54 < 0.0001 Radio 0.203 0.020 9.92 < 0.0001 Coefficient Std Error t-statistic p-value Intercept 12.351 0.621 19.88 < 0.0001 Newspaper 0.055 0.017 3.30 0.00115
Analysis
- $1000 increase in spending for Advertising budget in
- TV provides increase in sales of 47 units
- Radio provides increase in 203 units
- Newspaper provides increase in 55 units
- $1000 increase in spending for Advertising budget in
Issues
- Single prediction with a given budget for three media
- Each regression equation ignores the other media
- may be misleading if there is an association between each budget and sales
Solution
- extend with a separate slope coefficients in a single model
$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … +\beta_pX_p + \epsilon$
$Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times newspaper + \epsilon$
Estimating the Regression Coefficients
- Prediction after estimates
- $Y = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + … +\hat{\beta_p}X_p + \epsilon$
- Parameters are estimates using least squares approach, mimimize the sum of squared residuals
- $RSS = \sum\limits_{i=1}^{n} (y_i - \hat{y_i})^2 $
- $RSS = \sum [y_i - \hat{\beta_0} - \hat{\beta_1}X_1 - \hat{\beta_2}X_2 - … - \hat{\beta_p}X_p]^2 $
![]() |
---|
Two Predictors, One Response Variable, Least squares plane. Plane minimize sum of squared vertical distance beween each observation and plane |
Advertising Data
Coefficient Std error t-statistic p-value Intercept 2.939 0.3119 9.42 < 0.0001 TV 0.046 0.0014 32.81 < 0.0001 Radio 0.189 0.0086 21.89 < 0.0001 Newspaper −0.001 0.0059 −0.18 0.8599 Spending an additional $1000 on
- TV increase 46 units
- Radio increase 189 units
- Newspaper close to similar
Analysis
- TV and Radio coefficients are similar as with simple linear regression
- Newspaper coefficients is different and p-value is also no longer significant
- when simple, coefficient $0.055$, p-value $0.00115$
- now close to $0$ with p-value $0.8599$
- simple and multiple regression coefficients can be quite different
- difference stems from the fact that in the simple regression case, the slope term represents the average increase in product sales associated with a $1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio.
- By contrast, in the multiple regression setting, the coefficient for newspaper represents the average increase in product sales associated with increasing newspaper spending by $1,000 while holding TV and radio fixed.
- Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite?
- It does, lets see correlation matrix
Correlation Matrix
TV Radio Newspaper Sales TV 1.0 0.0548 0.0567 0.7822 Radio 1.0 0.3541 0.5762 Newspaper 1.0 0.2283 Sales 1.0 Correlation between Radio and Newspaper is $0.35$
* indicates that markets with high newspaper advertising tend to also have high radio advertising
Now suppose that the multiple regression is correct
- newspaper advertising is not associated with sales, but radio advertising is associated with sales.
Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows,
- we also tend to spend more on newspaper advertising in those same markets.
Hence, in a simple linear regression which only examines sales versus newspaper
- we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising is not directly associated with sales.
So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.
This slightly counterintuitive result is very common in many real life situations.
- Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper.
- Of course no one has (yet) suggested that ice creams should be banned at beaches to reduce shark attacks.
- In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks.
- A multiple regression of shark attacks onto ice cream sales and temperature reveals that, as intuition implies, ice cream sales is no longer a significant predictor after adjusting for temperature.
- Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper.
Some Important Questions
- In multiple linear regression
- Is at least one of the predictors $X_1,X_2, . . . ,X_p$ useful in predicting the response?
- Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Is There a Relationship Between the Response and Predictors?
- In simple linear regression, $ Y = \beta_0 + \beta_1X $
- check if $\beta_1 = 0$
- $ H_0:~\beta_1 = 0 $
- $ H_a:~\beta_1 \ne 0 $
- In multiple regression setting with $p$ predictors
- $ H_0:~\beta_1 = \beta_2 = … =\beta_p= 0 $
- $ H_a:$ at least one $\beta_j \ne 0$
- This hypothesis test is performed by computing the F-statistic
- In general, an F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously, https://blog.minitab.com/en/adventures-in-statistics-2/what-is-the-f-test-of-overall-significance-in-regression-analysis
- $ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $
- where, $TSS = \sum (y_i - \bar{y})^2 $ and $RSS = \sum(y_i - \hat{y_i})^2 $
- If the linear model assumptions are correct, one can show that
- $ E [\frac{RSS}{n-p-1}] = \sigma^2 $
- If $H_0$ is true
- $E[\frac{TSS-RSS}{p}] = \sigma^2$
- Thus, $F-$statistic will be close to $1$ when there is no relationship between response and predictors
- If $H_1$ is true
- $E[\frac{TSS-RSS}{p}] > \sigma^2$
- $\implies F > 1$
Quantatity | Value (Multiple Regression) |
---|---|
Residual Standard Error | 1.69 |
$R^2$ | 0.897 |
$F$-statistic | 570 |
- $F$-statistic is $570$ far larger than $1$
- compelling evidence against null hypothesis
- Large F-statistic suggests that at least one of the advertising media must be related to sales
- How large $F$-statistic should be to reject $H_0$ and conclude that there is relationship
- When $n$ is large, an $F$-statistic that is just a little larger than $1$ might still provide evidence against $H_0$.
- In contrast, a larger F-statistic is needed to reject $H_0$ if $n$ is small.
- When $H_0$ is true and the errors $\epsilon_i$ have a normal distribution, the F-statistic follows an F-distribution.
- Even if the errors are not normally-distributed, the F-statistic approximately follows an F-distribution provided that the sample size n is large
- https://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/
For any given value of n and p, any statistical software package can be used to compute the p-value associated with the $F$-statistic using this distribution. Based on this $p$-value, we can determine whether or not to reject $H_0$.
$p$-value associated with $F$-statistic (570) for advertising data is essentially zero
- extremely strong evidence that at least one of the media is associated with increased sales
Sometimes we want to test that a particular subset of $q$ coefficients are zero, assuming just last $q$ coefficients
- $ H_0:~\beta_{p-q+1} = \beta_{p-q+21} = … =\beta_p= 0 $
In this case we fit a second model that
- uses all the variables except those last $q$.
Suppose that the residual sum of squares for that model is $RSS_0$.
- Our earlier $ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $ becomes
- $ F = \frac{(RSS_0-RSS)/q}{RSS/(n-p-1)} $
As in Table 1.4 (Multiple Linear Regression) for each individual predictor a t-statistic and a p-value were reported
- These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors.
- It turns out that each of these is exactly equivalent to the F-test that omits that single variable from the model, leaving all the others in.
- square of each t-statistic is the corresponding F-statistic
- So it reports the partial effect of adding that variable to the model.
- For instance, as we discussed earlier, these p-values indicate that TV and radio are related to sales, but that there is no evidence that newspaper is associated with sales, when TV and radio are held fixed.
Why do we need overall F-statistic, given these individual p-values for each variable?
Coefficient Std error t-statistic p-value Intercept 2.939 0.3119 9.42 < 0.0001 TV 0.046 0.0014 32.81 < 0.0001 Radio 0.189 0.0086 21.89 < 0.0001 Newspaper −0.001 0.0059 −0.18 0.8599 Since if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response.
However, this logic is flawed, especially when the number of predictors $p$ is large.
For instance, consider an example, let
$p = 100$ and
- $H_0 : \beta_1 = \beta_2 = · · · = \beta_p = 0$
- $H_a:$ at least one $\beta_j \ne 0$
Let $H_0$ is true, so no variable is truly associated with the response.
- $p$-value should be large for all variables if no relationship
In this situation, about 5% of the p-values associated with each variable (of the type shown in Table 1.4) will be below 0.05 by chance.
Coefficient Std error t-statistic p-value Intercept $X_1$ $X_2$ … $X_{100}$
In other words, we expect to see approximately five small p-values even in the absence of any true association between the predictors and the response.
In fact, it is likely that we will observe at least one p-value below 0.05 by chance!
Hence, if we use the individual t-statistics and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship.
However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors.
Hence, if $H_0$ is true, there is only a $5\%$ chance that the $F$-statistic will result in a pvalue below $0.05$, regardless of the number of predictors or the number of observations.
F-statistic works to test for any association between the predictors and the response
- when p is relatively small, and certainly small compared to n
- If p > n
- there are more coefficients $\beta_j$ to estimate than observations from which to estimate them.
- In this case we cannot even fit the multiple linear regression model using least squares, so the $F$-statistic cannot be used
- and neither can most of the other concepts that we have seen so far in this chapter.
- When $p$ is large, some of the approaches discussed in the next section, such as forward selection, can be used.
- This high-dimensional setting is discussed in greater detail in Chapter 6.
Deciding on Important Variables
First step in a multiple regression analysis
- compute the $F$-statistic and to examine the associated $p$-value.
- If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!
- We could look at the individual p-values, but if no of predictors, $p$ is large we are likely to make some false discoveries
What is Variable Selection?
- The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection
If $p = 2$, then we can consider four models:
- model containing no variables
- model containing $X_1$ only
- model containing $X_2$ only, and
- model containing both $X_1$ and $X_2$
Which model is best?
- Mallow’s Cp,
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- Adjusted $R^2$
If $p=30$
- $2^{30} = 1,073,741,824$ models
Three Classical approaches to select models
Forward selection
source: https://quantifyinghealth.com/stepwise-selection/ - Begin with the null model
- a model that contains an intercept but no predictors.
- Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS.
- variable has the lowest $p$-value
- Add to that model the variable that results in the lowest RSS for the new two-variable model.
- This approach is continued until some stopping rule is satisfied.
Backward selection
We start with all variables in the model, and backward remove the variable with the largest p-value—that is, the variable selection that is the least statistically significant.
The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed.
This procedure continues until a stopping rule is reached.
For instance, we may stop when all remaining variables have a p-value below some threshold
source: https://quantifyinghealth.com/stepwise-selection/
Mixed selection
- This is a combination of forward and backward selection
- We start with no variables in the model, and as with forward selection selection, we add the variable that provides the best fit.
- We continue to add variables one-by-one.
- Of course, as we noted with the Advertising example, the $p$-values for variables can become larger as new predictors are added to the model.
- Hence, if at any point the $p$-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model.
- We continue to perform these forward and backward steps until all variables in the model have a sufficiently low $p$-value, and all variables outside the model would have a large p-value if added to the model.
Backward selection cannot be used if $p > n$, while forward selection can always be used.
Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.
Model Fit
Most common numerical measures of model fit are
- Residual Standard Error (RSE), measure of lack of fit
- $R^2$, the fraction of variance explained, lies in $0$ to $1$
- These quantities are computed and interpreted in the same fashion as for simple linear regression
In simple regression, $R^2$ is the square of the correlation ($r^2 = Cor(X,Y)^2 $) of the response and the variable
- $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
In multiple linear regression, it turns out that $R^2$ equals $Cor(Y, \hat{Y})^2$, the square of the correlation between the response ($Y$) and the fitted linear model ($\hat{Y}$)
- in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.
An $R^2$ value close to $1$ indicates that the model explains a large portion of the variance in the response variable
Example
Advertising Data with multiple regression
Quantatity Value (Multiple Regression) Residual Standard Error 1.69 $R^2$ 0.897 $F$-statistic 570 $R^2$ using all three media is 0.8972
$R^2$ using only TV and Radio is 0.89719
small increase in $R^2$ when adding newspaper
- $R^2$ will always increase with addition of new variables due to decrease in residual sum of squares on the training data (may not be true for test data)
- $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$
- TSS is total sum of squares = $\sum (y_i-\bar{y})^2$
- RSS is residual sum of squares = $\sum (y_i - \hat{y})^2$
- $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$
- $R^2$ will always increase with addition of new variables due to decrease in residual sum of squares on the training data (may not be true for test data)
Essentially, newspaper provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting
Simple Regression
- TV as Predictor Variable
Quantatity Value Residual Standard Error 3.26 $R^2$ 0.612 $F$-statistic 312.1 Coefficient Std Error t-statistic p-value Intercept 7.0325 0.4578 15.36 < 0.0001 TV 0.0475 0.0027 17.67 < 0.0001
Model containing only TV as a predictor had an $R^2$ of 0.61.
Adding radio to the model leads to a substantial improvement in $R^2$ as $0.89719$
This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertising.
We could further quantify this improvement by looking at the p-value for the radio coefficient in a model that contains only TV and radio as predictors.
Model that contains only TV has
- RSE of 3.26
The model that contains only TV and radio as predictors has
- Residual Standard Error (RSE) of 1.681
Model that also contains newspaper as a predictor has
- RSE of 1.686
- increase by adding variable
- RSE of 1.686
This corroborates our previous conclusion that a model that uses TV and radio expenditures to predict sales is much more accurate (on the training data) than one that only uses TV spending.
Furthermore, given that TV and radio expenditures are used as predictors, there is no point in also using newspaper spending as a predictor in the model.
The observant reader may wonder how RSE can increase when newspaper is added to the model given that RSS must decrease.
Why RSE increases?
- $ RSE = \sqrt{\frac{RSS}{n-p-1}} $
- Models with more variables (higher $p$) can have higher RSE if the decrease in RSS is small relative to the increase in p.
In addition to looking at the $RSE$ and $R^2$ statistics just discussed, it can be useful to plot the data
Graphical summaries can reveal problems with a model that are not visible from numerical statistics
![]() |
---|
Least squares regression plane for Sales Vs TV and Radio Budget |
- Linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio.
- It underestimates sales for instances where the budget was split between the two media.
- This pronounced non-linear pattern suggests a synergy or interaction effect between interaction the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.
- In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.
Predictions
- Once we have fit the multiple regression model, response Y can be predicted on the basis of a set of values for the predictors $X_1,X_2, . . . ,X_p$
- $ \hat{y} = \hat + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 +…+ \hat{\beta_p}x_p $
- However, there are three sorts of uncertainty associated with this prediction.
- model parameters $\hat{\beta_i}$ are estimates for $\beta_i$
- Least square plane $\hat{Y}$ is estimate for true regression plane $f(X)$
- inaccuracy in the coefficients estimates is related to reducible error
- we can compute confidence interval to determine how close $\hat{Y}$ will be to $f(X)$
- Model Bias in assuming linear model for $f(X)$
- Random error $\epsilon$, irreduccible error
- model parameters $\hat{\beta_i}$ are estimates for $\beta_i$
- Confidence interval is used to quantify the uncertainty surrounding confidence the average sales over a large number of cities
- For example, given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city, the 95% confidence interval is [10,985, 11,528].
- We interpret this to mean that 95% of intervals of this form will contain the true value of f(X).
- For example, given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city, the 95% confidence interval is [10,985, 11,528].
- Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
- Given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in that city the 95% prediction interval is [7,930, 14,580].
- We interpret this to mean that 95% of intervals of this form will contain the true value of Y for this city.
- Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations