Linear Regression

30 minute read


This lesson is from An Introduction to Statistical Learning

Other Considerations in the Regression Model

Qualitative Predictors

  • Predictor as qualitative

  • Credit data set

    • Response
      • balance (average credit card debt for each individual) and
    • Quantitative predictors:
      • age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating).
    • Quantitative variables
      • own (house ownership), student (student status), status (marital status), and region (East, West or South)

Predictors with Only Two Levels

  • Qualitative predictor aka factor only has two levels:

    • Create dummy variables variable (or one-hot encoding in ML) with value as $1$ when present and $0$ if absent and use it in regression equation

      • Credit Card problem

        • Response Variable is Credit card balance, average credit card debt for each individual

        • Predictor variable is if a person owns house or not \(x_i = \begin{cases} 1 &\text{if $i^{th}$ person owns house } \\ 0 &\text{if $i^{th}$ person doesn't own house } \end{cases}\)

    • Model

      • $ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $ \(y_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i &\text{if $i^{th}$ person owns house } \\ \beta_0 + \epsilon_i &\text{if $i^{th}$ person doesn't own house } \end{cases}\)

        • $\beta_0$ can be interpreted as the average credit card balance among those who do not own
        • $\beta_0 + \beta_1$ is the average credit card balance among those who do own their house,
        • $\beta_1$ as the average difference in credit card balance between owners and non-owners
         CoefficientStd Errort-statisticp-value
        Intercept509.833.1315.389< 0.0001
        own [Yes]19.7346.050.4290.669
      • $\beta_0 = 509.8$
        • average debt for non-owners
      • $\beta_1 = 19.73$
        • average difference
        • $509.8 + 19.73 = 529.53$ is average debt for non-owners
      • However,
        • $p$-value of dummy variable is very high
          • No statistical evidence of difference in average credit balance based on house ownership
  • Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable \(x_i = \begin{cases} 1 &\text{if $i^{th}$ person owns house } \\ -1 &\text{if $i^{th}$ person doesn't own house } \end{cases}\)

    • Model

      • $ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $ \(y_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i &\text{if $i^{th}$ person owns house } \\ \beta_0 - \beta_1 + \epsilon_i &\text{if $i^{th}$ person doesn't own house } \end{cases}\)

      • $\beta_0$ can be interpreted as the overall average credit card balance (ignoring the house ownership effect), and

      • $\beta_1$ is the amount by which house owners and non-owners have credit card balances that are above and below the average, respectively

        • $\beta_0 = 519.665$
        • $\beta_1 = 9.865$
          • overall values remain same

Qualitative Predictors with More than Two Levels

  • We have three levels for Region - East, West, South
  • One variable for two levels and two variables for three levels

$$ x_{i1} = \begin{cases} 1 &\text{if $i^{th}$ person is from South }
0 &\text{if $i^{th}$ person is not from South } \end{cases} \

x_{i2} = \begin{cases} 1 &\text{if $i^{th}$ person is from West }
0 &\text{if $i^{th}$ person is not from West } \end{cases} $$ ​ $y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i$

  • $x_{i1}$$x_{i2}$ 
\[y_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i &\text{if $i^{th}$ person is from South } \\ \beta_0 + \beta_2 + \epsilon_i &\text{if $i^{th}$ person is from West } \\ \beta_0 + \epsilon_i &\text{if $i^{th}$ person is from East } \end{cases}\]
 CoefficientStd errort-statisticp-value
region [South]−18.6965.02-0.2870.7740
region [West]-12.5056.68-0.2210.8260
  • $\beta_0$
    • average credit card balance for individuals from the East (baseline)
  • $\beta_1$
    • difference in the average balance between people from the South versus the East
  • $\beta_2$
    • difference in the average balance between those from the West versus the East
  • Estimated balance
    • Baseline, East $531.00
    • South will have $18.69 less debt than those in the East
    • West will have $12.50 less debt than those in the East
  • $p$-values associated with the coefficient estimates for the two dummy variables are very large
    • suggesting no statistical evidence of a real difference in average credit card balance between South and East or between West and East.
    • baseline category selection is arbitrary, but final predictions for each group will be the same regardless of this choice
    • However, the coefficients and their $p$-values do depend on the choice of dummy variable coding
  • Rather than rely on the individual coefficients, we can use an F-test to test, this does not depend on the coding
    • $H_0 : \beta_1 = \beta_2 = 0$
  • $F$-test has a $p$-value of 0.96
    • indicating that we fail to reject the null hypothesis that there is no relationship between balance and region
  • Using this dummy variable approach presents no difficulties when incorporating both quantitative and qualitative predictors. For example, to regress balance on both a quantitative variable such as income and a qualitative variable such as student, we must simply create a dummy variable for student and then fit a multiple regression model using income and the dummy variable as predictors for credit card balance
  • There are many different ways of coding qualitative variables besides the dummy variable approach taken here. All of these approaches lead to equivalent model fits, but the coefficients are different and have different interpretations, and are designed to measure particular contrasts. This topic is beyond the scope of the book

Extensions of the Linear Model

  • assumptions
    • relationship between the predictors and response are additive and linear
    • additivity assumption
      • means that the association between a predictor $X_j$ and the response $Y$ does not depend on the values of the other predictors.
    • linearity assumption
      • states that the change in the response $Y$ associated with a one-unit change in $X_j$ is constant, regardless of the value of $X_j$
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon \\ Let~~ \tilde{x_2} = x_2 +1 \\ \begin{align*} \tilde{y} &= \beta_0 + \beta_1x_1 + \beta_2\tilde{x_2} + \epsilon \\ &= \beta_0 + \beta_1x_1 + \beta_2(x_2+1) + \epsilon \\ &= \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon + \beta_2 \\ &= y + \beta_2 \end{align*}\]
  • in a linear model, that it doesn’t matter what value $x_j$ takes - the effect of an incremental change in $𝑥_j$ is the same

Removing the Additive Assumption

  • In our previous analysis of the Advertising data, we concluded that both TV and radio seem to be associated with sales.

  • The linear models that formed the basis for this conclusion assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media.

  • For example, the below linear model states that the average increase in sales associated with a one-unit increase in TV is always $\beta_1$, regardless of the amount spent on radio

    • $y = \beta_0 + \beta_1\times TV + \beta_2 \times Radio + \beta_3 \times Newspaper + \epsilon$
  • However, this simple model may be incorrect.

  • Suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.

  • In this situation, given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.

  • In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.

  • Standard Linear Regression Model with two variables

    • $ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon $

    • regardless of the value of $X_2$, a one unit increase in $X_1$ is associated with a $\beta_1$-unit increase in $Y$.

    • One way of extending this model is to include a third predictor, called an interaction term, which is constructed by computing the product of $X_1$ and $X_2$.

      • $ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon $
    • How does it relax additive assumption

      \[\begin{align*} Y &= \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon \\ &= \beta_0 + \beta_1X_1 + \beta_3X_1X_2 + \beta_2X_2 + \epsilon \\ &= \beta_0 + (\beta_1 + \beta_3X_2)X_1 + \beta_2X_2 + \epsilon \\ &= \beta_0 + \tilde{\beta_1}X_1 + \beta_2X_2 + \epsilon \end{align*}\]
      • where $\tilde{\beta_1} = \beta_1 + \beta_3X_2$
      • Since $\tilde{\beta_1}$ is now a function of $X_2$, the association between $X_1$ and $Y$ is no longer constant
        • a change in the value of $X_2$ will change the association between $X_1$ and $Y$
        • A similar argument shows that a change in the value of $X_1$ changes the association between $X_2$ and $Y$
  • Example

    • productivity of a factory
      • We wish to predict the number of units produced on the basis of the number of production lines and the total number of workers.
      • It seems likely that the effect of increasing the number of production lines will depend on the number of workers, since if no workers are available to operate the lines, then increasing the number of lines will not increase production
      • This suggests that it would be appropriate to include an interaction term between lines and workers in a linear model to predict units
\[\begin{align*} units &\approx 1.2 + 3.4 \times Lines + 0.22 \times Workers + 1.4 \times (Lines \times Workers) \\ &= 1.2 + (3.4 + 1.4 \times Workers)\times Lines + 0.22 \times Workers \end{align*}\]
  • adding an additional line will increase the number of units produced by 3.4 + 1.4 × workers
  • Hence the more workers we have, the stronger will be the effect of lines.

  • Advertising Example

    • linear model that uses radio, TV, and an interaction between the two to predict sales

    • \[\begin{align*} sales &= \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times (TV \times Radio) + \epsilon \\ &= \beta_0 + (\beta_1+\beta_3 \times Radio) \times TV + \beta_2 \times Radio + \epsilon \end{align*}\]
      • We can interpret $\beta_3$ as the increase in the effectiveness of TV advertising associated with a one-unit increase in radio advertising (or vice-versa)
       CoefficientStd Errort-statisticp-value
      Intercept6.75020.24827.23< 0.0001
      TV0.01910.00212.70< 0.0001
      TV x Radio0.00110.00020.73< 0.0001
    • model that includes the interaction term is superior to the model that contains only main effects
    • The $p$-value for the interaction term, $TV \times radio$, is extremely low, indicating that there is strong evidence for $H_a: \beta_3 \ne 0$.
    • In other words, it is clear that the true relationship is not additive
    • The $R^2$ for this model is 96.8 %, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.
      • This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term
    • The coefficient estimates suggest that an increase in TV advertising of USD 1,000 is associated with increased sales of $(\hat{\beta_1} + \hat{\beta_3} \times Radio )\times 1000$ = $1000\hat{\beta_1} + 1000\hat{\beta_3} \times Radio$ = $ 19 + 1.1\times Radio$ units.
    • And an increase in radio advertising of USD 1,000 will be associated with an increase in sales of $(\hat{\beta_2} + \hat{\beta_3} \times TV )\times 1000$ = $1000\hat{\beta_2} + 1000\hat{\beta_3} \times TV$ = $ 29 + 1.1\times TV$ units
  • In this example, the $p$-values associated with TV, radio, and the interaction term all are statistically significant

    • all three variables should be included in the model
  • However, it is sometimes the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not.

  • Hierarchical principle

    • states that if we include an interaction in a model, we should also include the main effects, even if the $p$-values associated with principle their coefficients are not significant.
  • In other words, if the interaction between $X_1$ and $X_2$ seems important, then we should include both $X_1$ and $X_2$ in the model even if their coefficient estimates have large $p$-values.

  • The rationale for this principle is that if $X_1 \times X_2$ is related to the response, then whether or not the coefficients of $X_1$ or $X_2$ are exactly zero is of little interest.

  • Also $X_1 \times X_2$ is typically correlated with $X_1$ and $X_2$, and so leaving them out tends to alter the meaning of the interaction

  • What if Qualitative?

    • In the previous example, we considered an interaction between TV and radio, both of which are quantitative variables.
    • However, the concept of interactions applies just as well to qualitative variables, or to a combination of quantitative and qualitative variables.
  • Example: Credit data set

    • predict balance using the income (quantitative) and student (qualitative) variables

      \[\begin{align*} Balance_i &\approx \beta_0 + \beta_1 \times income_i + \begin{cases} \beta_2 &\text{if ith person is student} \\ 0 &\text{if ith person is not student} \\ \end{cases} \\ &= \beta_1 \times income_i + \begin{cases} \beta_0 + \beta_2 &\text{if ith person is student} \\ \beta_0 &\text{if ith person is not student} \\ \end{cases} \end{align*}\]
      • Will give two lines
        • Slope $\beta_1$
        • Intercept $\beta_0 + \beta_2$ if student else $\beta_0$
      Left is without interaction and Right is with interaction
      • The fact that the lines are parallel in Left means that the average effect on balance of a one-unit increase in income does not depend on whether or not the individual is a student.

      • This represents a potentially serious limitation of the model, since in fact a change in income may have a very different effect on the credit card balance of a student versus a non-student.

      • This limitation can be addressed by adding an interaction variable, created by multiplying income with the dummy variable for student.

        • $ Balance \approx \beta_0 + \beta_1 \times Income_i + \beta_2 \times Student_i + \beta_3 \times Income \times Student_i$
      • If Student

        • $ Balance \approx \beta_0 + \beta_1 \times Income_i + \beta_2 + \beta_3 \times Income$
        • $ Balance \approx (\beta_0 + \beta_2) + (\beta_1+\beta_3) \times Income_i$
      • If no Student

        • $ Balance \approx \beta_0 + \beta_1 \times Income_i $
      • \[Balance_i \approx \begin{cases} (\beta_0 + \beta_2) + (\beta_1+\beta_3) \times Income_i &\text{if student} \\ \beta_0 + \beta_1 \times Income_i &\text{if no student} \end{cases}\]
      • Now the lines have different slope and intercept

        • Right figure shows slope for students is lower than the slope for non-students.
        • This suggests that increases in income are associated with smaller increases in credit card balance among students as compared to non-students

Non-linear Relationships

  • true relationship between the response and the predictors may be nonlinear

    Auto Dataset - mpg vs horspower
  • very simple way to directly extend the linear model to accommodate non-linear relationships is using polynomial regression

    • simply include transformed versions of the predictors
    • Quadratic
      • $ mpg = \beta_0 + \beta_1 \times \text{horsepower} + \beta_2 \times \text{horsepower}^2 + \epsilon $
    • Still we can use linear regression software to estimate $\beta_0, \beta_1, \beta_2$
    • $R^2$ of linear fit is 0.606 and of quadartic fit is 0.688
    • p-value for quadratic term is also significant
 CoefficientStd Errort-statisticp-value
Intercept56.90011.800431.6< 0.0001
Horsepower−0.46620.0311−15.0< 0.0001

Potential Problems

  • Non-linearity of the response-predictor relationships
  • Correlation of error terms
  • Non-constant variance of error terms
  • Outliers
  • High-leverage points
  • Collinearity

Non-linearity of the Data

Auto dataset: mpg vs horsepower
  • Residuals plots
    • can identify non-linearity
    • simple regression
      • plot residuals, $e_i = y_i -\hat{y_i} $ vs predictors $x_i$
    • multiple regression
      • plot residuals, $e_i = y_i -\hat{y_i} $ vs predicted (or fitted) value $\hat{y_i}$
    • Ideally, the residual plot will show no fitted discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.
      • Left plot shows U-shape indicating non-linearity
      • Right plot shows no pattern indicating quadratic term improved the fitting
    • Residual plot indicates non-linearlity
      • try predictors such as $log~ X,~\sqrt{X},~ X^2$

Correlation of Error Terms

  • Assumption of the linear regression model is that the error terms $\epsilon_i$ are uncorrelated

    • i.e. if $\epsilon_i$ is +ve then no or little information about $\epsilon_{i+1}$
  • The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms.

    • Standard Error for $\beta_0$ and $\beta_1$

      • $[SE(\hat{\beta_0)}]^2 = \sigma^2 [\frac{1}{n} + \frac{\bar{x}^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}]$
      • $[SE(\hat{\beta_1}]^2) = \frac{\sigma^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
      • where $\sigma^2 = Var(\epsilon)$
    • In general, $\sigma^2$ is unknown but can be estimated using data. Estimate of $\sigma$ is known as residual standard error,

      $RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^{n}(y_i-\hat{y_i})^2}{n-2}} $

      where $RSS$ is residual sum of squares

    • Standard Errors can be used to compute confidence intervals

      • 95% CI for $\beta_1$ and $\beta_1$
        • $\hat{\beta_0} \pm 2 . SE(\hat{\beta_0})$
        • $\hat{\beta_1} \pm 2 . SE(\hat{\beta_1})$
  • If in fact there is correlation among the error terms
    • then the estimated standard errors will tend to underestimate the true standard errors
      • estimated standard errors will be lower
    • As a result, confidence and prediction intervals will be narrower than they should be.
  • For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter.
  • In addition, $p$-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant.
  • In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model
  • Example
    • suppose we accidentally doubled our data, leading to observations and error terms identical in pairs.
      • standard error calculations would be as if we had a sample of size 2n
        • when in fact we have only n samples.
      • Our estimated parameters would be the same for the $2n$ samples as for the $n$ samples, but the confidence intervals would be narrower by a factor of $\sqrt{2}$
  • Why might correlations among the error terms occur?
    • Such correlations frequently occur in the context of time series data, which consists of observations for which measurements are obtained at discrete points in time
    • We can plot residuals and see if there is a pattern
    • Another Example
      • For instance, consider a study in which individuals’ heights are predicted from their weights.
      • The assumption of uncorrelated errors could be violated if some of the individuals in the study are members of the same family, eat the same diet, or have been exposed to the same environmental factors
      • In general, the assumption of uncorrelated errors is extremely important for linear regression as well as for other statistical methods, and good experimental design is crucial in order to mitigate the risk of such correlations

Non-constant Variance of Error Terms

  • Assumption of the linear regression model is that
    • error terms have a constant variance
    • $Var(\epsilon_i) = \sigma^2$
    • The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption
Residual Plots
  • Unfortunately, it is often the case that the variances of the error terms are non-constant
    • For instance, the variances of the error terms may increase with the value of the response
    • One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in residual plot.
      • Left-hand panel, in which the magnitude of the residuals tends to increase with the fitted values.
    • When faced with this problem, one possible solution is to transform the response $Y$ using a concave function such as $log~Y$ or $\sqrt{Y}$
    • Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.
      • Right-hand panel shows using $log~Y$
    • The residuals now appear to have constant variance, though there is some evidence of a slight non-linear relationship in the data.
    • Sometimes we have a good idea of the variance of each response.
    • For example, the $i^{th}$ response could be an average of $n_i$ raw observations.
      • If each of these raw observations is uncorrelated with variance $\sigma^2$, then their average has variance $\sigma_i^2 = \sigma^2/n_i$.
      • In this case a simple remedy is to fit our model by weighted least squares, with weights proportional to the inverse variances—i.e. $w_i=n_i$ in this case.
      • Most linear regression software allows for observation weights


  • point for which $y_i$ is far from the value predicted by the model

    • may be incorrect recording of an observation during data collection

      Left (red line with outlier and blue line is without outlier)
  • Removing outlier may have little impact on slope and intercept typically when predictor value ($X$) is unusual

    • There may be other problems
  • RSE, residual standard error $\sqrt{\frac{RSS}{n-2}}$

    • 1.09 if outlier Included
    • 0.77 if removed
    • RSE is used to compute confidence intervals and p-values
  • $R^2$ value

    • 0.892 if removed
    • 0.805 if included
  • Identifying outliers using residual plots

    • Center panel shows outlier has high residuals
    • how large is sufficient is a challenge
  • Studentized Residuals

    • computed by dividing each residual $e_i$ by its estimated standard error
    • Observations whose studentized residuals are greater than 3 in absolute value are possible outliers
  • If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation.

  • However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor

High Leverage Points

  • outliers are observations for which the response $y_i$ is unusual given the predictor $x_i$
  • observations with high leverage have an unusual value for $x_i$
Left (red with all data and blue by removing high leverage point 41)
  • Removing high leverage observation has substantial impact on the least squares line than removing ouliers
  • Concern if least square line is affected by some high leverage observations
  • Simple linear regression
    • simply look for observations for which the predictor value is outside of the normal range of the observations.
  • Multiple linear regression
    • it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors
      • Middle Panel shows two variables, red dot is outside the range when both variables considered
    • Compute Leverage Statistic
      • Simple Linear Regression
        • $ h_i = \frac{1}{n} + \frac{(x_i -\bar{x} )^2}{\sum\limits_{j=1}^{n} (x_j -\bar{x} )^2 } $
        • $h_i$ increases with the distance of $x_i$ from $\bar{x}$
        • Large $h_i$ indicates high leverage
      • Multiple Linear Regression
        • Leverage statistic $h_i$ is always
          • between $1/n$ and $1$
        • Average leverage for all the observations is always
          • equal to $(p+1)/n$
        • If a given observation has a leverage statistic that greatly exceeds (p+1)/n
          • we may suspect that the corresponding point has high leverage
  • Studentized Residuals versus $h_i$
    • Right-hand panel
    • Observation 41 stands out as having a very high leverage statistic as well as a high studentized residual.
      • outlier as well as a high leverage observation
      • Particularly dangerous combination
      • Plot also reveals the reason that observation 20 had relatively little effect on the least squares fit, it has low leverage


  • situation in which two or more predictor variables are closely related to one another

    Credit Card Dataset: Limit and Age have no relationship but Limit and rating are highly correlated and are collinear
  • If collinear

    • difficult to separate out the individual effects of collinear variables on the response

      • since both increase/decrease together
      RSS Contour Plots
    • Left

      • RSS associated with different coefficient estimates of limit and age
      • black dot (least square estimates) represents smallest RSS
    • Right

      • RSS associated with different coefficient estimates of limit and rating
      • contours run along a narrow valley; there is a broad range of values for the coefficient estimates that result in equal values for RSS
      • Hence a small change in the data could cause the pair of coefficient values that yield the smallest RSS—that is, the least squares estimates—to move anywhere along this valley.
      • This results in a great deal of uncertainty in the coefficient estimates
  • Collinearity reduces the accuracy of the estimates of the regression coefficients

    • causes the standard error for $\hat{\beta_j}$ to grow
    • $t$-statistic for each predictor is calculated by dividing $\hat{\beta_j}$ by its standard error.
    • Consequently, collinearity results in a decline in the t-statistic
    • As a result of reduction in t-statistic value, in the presence of collinearity, we may fail to reject $H_0 : \beta_j = 0$
    • This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity
      CoefficientStd Errort-statisticp-value
    • Model 1: Balance on Age and Limit
      • age and limit have no relationship
      • Both age and limit significant
        • Limit is significant in presence of age
    • Model 2: Balance on Rating and Limit
      • Limit and rating are highly related and collinear
      • Limit is not significant
      • Importance of Limit has been masked due to presence of collinearity
    • Detect Collinearity
      • Check for large absolute value in correlation matrix of predictors to detect pair-wise collinearity
    • Multi-collinearity
      • collinearity between three or more variables
        • even if no pair of variables has high correlation
    • Detecting Muticollinearity
      • compute variance inflation factor (VIF)
      • $VIF(\hat{\beta_j}) = \frac{1}{1-R_{X_jX{-j}}^2}$, where $R_{X_jX{-j}}^2$ is $R^2$ from a regression of $X_j$ onto all of the other predictors
      • Minimum VIF is $1$
        • absence of collinearity
      • If VIF $>5 ~or~ 10$
        • collinearity is present
      • If $R_{X_jX{-j}}^2$ is close to $1$, then collinearity is present and VIF will be large
  • Example

    • Credit Card Dataset
      • regression of balance on age, rating, and limit
      • VIF values are
        • 1.01 (age)
        • 160.67 (rating)
        • 160.59 (limit)
      • shows considerable amount of collinearity
  • How to solve Collinearity

    • drop one of the problematic variables
      • let rating dropped
      • regression of balance on age and limit
        • VIF close to 1 for both age and limit
        • $R^2$ drops from 0.754 to 0.75
      • dropping solves collinearity without compromising fit
    • combine collinear variables into single predictor
      • average of standardized versions of limit and rating in order to create a new variable that measures credit worthiness

Marketing Plan

  • Is there a relationship between sales and advertising budget?
    • fitting a multiple regression model of sales onto TV, radio, and newspaper
      • $sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper + \epsilon $
    • testing the hypothesis
      • $H_0 : \beta_{TV} = \beta_{radio} = \beta_{newspaper} = 0$
    • Use $F$-statistic to determine whether or not we should reject this null hypothesis.
      • In this case the $p$-value corresponding to the $F$-statistic value of 570 is essentially zero i.e. very low
      • indicating clear evidence of a relationship between advertising and sales
  • How strong is the relationship?
    • RSE, Residual Standard Error
      • estimates the standard deviation of the response from the population regression line.
      • For the Advertising data, the RSE is 1.69 units while the mean value for the response is 14.022
        • indicating a percentage error of roughly 12 %
    • R2 statistic
      • records the percentage of variability in the response that is explained by the predictors
      • The value for advertising dataset is 0.897
      • The predictors explain almost 90% of the variance in sales
  • Which media are associated with sales?
    • examine the p-values associated with each predictor’s t-statistic
    • In the multiple linear regression displayed, the p-values for TV and radio are low, but the p-value for newspaper is not
    • This suggests that only TV and radio are related to sales
      • chapter 6 explores this question in greater detail
  • How large is the association between each medium and sales?
    • standard error of $\hat{\beta_j}$ can be used to construct confidence intervals for $\beta_j$.
    • For the Advertising data
      • $\beta_{TV} = 0.046, StdError_{TV} = 0.0014$
        • 95% CI are (0.043, 0.049) $0.046 \pm 2 \times 0.0014$
      • $\beta_{Radio} = 0.189, StdError_{Radio} = 0.0086$
        • 95% CI are (0.172, 0.206) $0.189 \pm 2 \times 0.0086$
      • $\beta_{Newspaper} = -0.001, StdError_{TV} = 0.0059$
        • 95% CI are (-0.013, 0.011) $-0.001 \pm 2 \times 0.0059$
      • Analysis
        • CI for TV and Radio are narrow far from zero
          • statistically significant
        • CI for Newspaper includes zero
          • no statistical significance
        • Is collinearity a reason for CI of newspaper
          • VIF scores are 1.005, 1.145, 1,145 for TV, radio, Newspaper
            • no collinearity
        • In order to assess the association of each medium individually on sales, we can perform three separate simple linear regressions.
          • There is evidence of an extremely strong association between
            • TV and sales and
              • 0.0475 implies 47 units of TV increase for USD 1000 spent
            • radio and sales
              • 0.203 implies 203 units of Radio increase for USD 1000 spent
          • There is evidence of a mild association between
            • newspaper and sales, when the values of TV and radio are ignored
              • 0.055 implies 55 units of newspaper increase for USD 1000 spent
  • How accurately can we predict future sales?
    • The response can be predicted using
      • $Y = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + … +\hat{\beta_p}X_p + \epsilon$
    • The accuracy associated with this estimate depends on whether we wish to predict
      • an individual response, $Y = f(X) + \epsilon $, or
        • use a prediction interval
      • the average response, $f(X)$
        • confidence interval
    • Prediction intervals will always be wider than confidence intervals because they account for the uncertainty associated with $\epsilon$, the irreducible error
  • Is the relationship linear?
    • residual plots can be used in order to identify non-linearity
    • If the relationships are linear, then the residual plots should display no pattern.
    • transformations of the predictors can be included in the linear regression model in order to accommodate non-linear relationships
  • Is there synergy among the advertising media?
    • The standard linear regression model assumes an additive relationship between the predictors and the response.
    • An additive model is easy to interpret because the association between each predictor and the response is unrelated to the values of the other predictors.
    • However, the additive assumption may be unrealistic for certain data sets.
    • Interaction term can be included in the regression model in order to accommodate non-additive relationships.
    • A small p-value associated with the interaction term indicates the presence of such relationships.
    • Figure 3.5 suggested that the Advertising data may not be additive. Including an interaction term in the model results in a substantial increase in $R^2$, from around 90% to almost 97 %.