Linear Regression

12 minute read


This lesson is from An Introduction to Statistical Learning


  • simple approach for supervised learning
  • predicts quantative response
  • Example
    • a marketing plan for next year that will result in high product sales using advertising data
      • Is there a relationship between advertising budget and sales?
        • no money should be spent on advertising if no relationship
      • How strong is the relationship between advertising budget and sales?
        • information about product sales based on knowledge of the advertising budget
      • Which media are associated with sales?
        • Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated
          • we must find a way to separate out the individual contribution of each medium to sales when we have spent money on all three media.
      • How large is the association between each medium and sales?
        • For every dollar spent on advertising in a particular medium, by what amount will sales increase?
        • How accurately can we predict this amount of increase?
      • How accurately can we predict future sales?
        • For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?
      • Is the relationship linear?
        • If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool.
        • If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.
      • Is there synergy among the advertising media?
        • Perhaps spending $50,000$ on television advertising and $50,000$ on radio advertising is associated with higher sales than allocating $100,000$ to either television or radio individually.
        • In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.
    • It turns out that linear regression can be used to answer each of these questions.

Simple Linear Regression

  • predicting a quantitative response Y on the basis of a single predictor variable X
    • $ Y \approx \beta_0 + \beta_1X $
    • $Sales \approx \beta_0 + \beta_1 \times TV $
      • $\beta_0, \beta_1$ represent intercept and slope aka model coefficients or parameters
    • We predict $\hat{\beta_0}, \hat{\beta_1}$ using training data and then future sales can be predicted using $ \hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X $

Estimating the Coefficients

  • We may use training data to estimate the coefficients

    • $y_i = \hat{\beta_0} + \hat{\beta_0}x_i$ for $i=1,2,…,n$
  • The obejctive is to find intercept and slope so that line is close to $200$ points

  • Least Square Criterion

    • Prediction is $\hat{y_i} = \hat{\beta_0} + \hat{\beta_0}x_i$
    • $e_i = y_i - \hat{y_i}$ is the $i^{th}$ residual
    • Residual Sum of Squares = $RSS = e_1^2+e_2^2+…+e_n^2$
    • $RSS = (y_1-\hat{\beta_0}-\hat{\beta_1}x_1)^2 + (y_2-\hat{\beta_0}-\hat{\beta_1}x_2)^2+…+(y_n-\hat{\beta_0}-\hat{\beta_1}x_n)^2$
    • By Calculus
      • $\hat{\beta_1} = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
      • $\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$
        • $\bar{x}$ and $\bar{y}$ are sample means
  • Sales in 1000 units and Budget in 1000 dollars; $\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$
  • Additional $1000$ spent on TV advertising is associated with selling approx $47.5$ additional units of product

    • $ y = 7.03 + 0.0475 (x + 1000) = 7.03 + 0.0475x + 47.5$
  • RSS plot for different values of $\beta_0, \beta_1$ , red dot where RSS is minimum

Assessing the Accuracy of the Coefficient Estimates

  • True relatioship
    • $Y = \beta_0 + \beta_1 X + \epsilon$
      • $\epsilon$ is mean-zero random error
    • $\beta_0$ intercept
      • expected value of $Y$ when $X=0$
    • $\beta_1$ slope
      • average increase in $Y$ associated with a one unit increase in X
  • Error $\epsilon$ may be due to other variables or measurement error
    • independent of $X$
Scatter plot of 100 random points from $Y = 2+ 3X + \epsilon$ where $\epsilon$ generated from normal distribution with mean $0$
  • Red line: Y = 2+ 3X True relationship

  • Blue: Least Square Line

  • Light Blue: ten least squares line computed from different random observations

  • Why is difference beween population regression line and least square line

    • Standard statistical approach of using the information from a sample to estimate characteristics of a large population
  • if we estimate $\hat{\beta_0}$ and $\hat{\beta_1}$ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_0$ and $\beta_1$.

    • But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!
    • In fact, we can see from the right hand panel that the average of many least squares lines, each estimated from a separate data set, is pretty close to the true population regression line.
  • Whenever we perform linear regression, we want to know if there is a statistically significant relationship between the predictor variable and the response variable.

  • Statistics

    • Estimate population mean $\mu$ of a random variable $Y$ using sample mean $\bar{y}$
      • we can say, $\hat{\mu} = \bar{y}$
      • Average of $\hat{\mu}$ from huge number of datasets will give us $\mu$
      • One dataset may over-estimate or under-estimate
      • How far will that single estimate $\hat{\mu}$ be?
        • Standard Error of $\hat{\mu}$= $SE(\hat{\mu})$
        • $Var(\hat{\mu}) = [SE(\hat{\mu})]^2 = \frac{\sigma^2}{n}$
          • $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$
        • It will decrease as $n$ gets larger
  • Standard Error for $\beta_0$ and $\beta_1$

    • $[SE(\hat{\beta_0)}]^2 = \sigma^2 [\frac{1}{n} + \frac{\bar{x}^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}]$
    • $[SE(\hat{\beta_1}]^2) = \frac{\sigma^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
    • where $\sigma^2 = Var(\epsilon)$
  • $SE(\hat{\beta_1)}$ is smaller when $x_i$ are more spread out

    • intitutively, we have more leverage to estimate a slope
  • $SE(\hat{\beta_0)} = \frac{\sigma^2}{n} = SE(\hat{\mu})$, when $\bar{x} = 0$

  • In general, $\sigma^2$ is unknown but can be estimated using data. Estimate of $\sigma$ is known as residual standard error, $RSE = \sqrt{\frac{RSS}{n-2}}$

  • Standard Errors can be used to compute confidence intervals

    • 95% CI for $\beta_1$ and $\beta_1$
      • $\hat{\beta_0} \pm 2 . SE(\hat{\beta_0})$
      • $\hat{\beta_1} \pm 2 . SE(\hat{\beta_1})$
  • $\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$

  • CI are $\hat{\beta_0} = [6.130, 7.935]$ and $\hat{\beta_1} = [0.042, 0.053]$

  • Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units.

  • Furthermore, for each $1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.

  • Standard errors can also be used to perform hypothesis tests on the hypothesis coefficients.

  • The most common hypothesis test involves testing the null test hypothesis of

    • $H_0$ : There is no relationship between $X$ and $Y$
    • $H_a$ : There is some relationship between $X$ and $Y$
  • Mathematically,

    • $H_0:~\beta_1 = 0 $
    • $H_a:~\beta_1 \ne 0 $
  • $\beta_1 = 0 \implies Y = \beta_0 + \epsilon$

    • $X$ is not associated with $Y$
  • We need to determine whether $\hat{\beta_1}$ is sufficiently far from $0$ so that we can be confident that $\beta_1$ is non-zero

  • How far is far - depends on $SE(\hat{\beta_1})$

  • If $SE(\hat{\beta_1})$ is small, then even small values of $\hat{\beta_1}$ may prove that $\beta_1 \ne 0$

  • If $SE(\hat{\beta_1})$ is large, then $\hat{\beta_1}$ must be large to reject null hypothesis

  • In practice, we compute t-statistic

    • \[t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})} \label{eqn:t}\]
    • measures the number of standard deviations that $\hat{\beta_1}$ is away from $0$
  • If there is no relationship between $X$ and $Y$, then Equation $\ref{eqn:t}$ will have $t$ distribution with $n-2$ degrees of freedom.

  • The t-distribution has a bell shape and for values of n greater than approximately 30 it is quite similar to the standard normal distribution.

  • Consequently, it is a simple matter to compute the probability of observing any number equal to $\mid t \mid$ or larger in absolute value, assuming $\beta_1 = 0$.

  • We call this probability the $p$-value.

    • a small $p$-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.
    • Small p-value
      • reject null
      • we can infer that there is an association between the predictor and the response
    • We reject the null hypothesis—that is, we declare a relationship to exist between X and Y —if the p-value is small enough.
    • Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%
    • When n = 30, these correspond to t-statistics of around 2 and 2.75, respectively.
  • Advertising Data

     CoefficientStd Errort-statisticp-value
    Intercept7.03250.457815.36< 0.0001
    TV0.04750.002717.67< 0.0001
  • $Sales \approx \beta_0 + \beta_1 \times TV = 7.0325 + 0.0475 \times TV $
  • An increase of USD 1,000 in the TV advertising budget is associated with an increase in sales by around $50$ units.
  • Coefficients for $\beta_0$ and $\beta_1$ are very large in comparison to their Std Errors
    • t-statistics are also large
    • the probabilities of seeing such values if $H_0$ is true are virtually zero.
    • $p$-value for $\beta_0$ and $\beta_1$ are small
      • Hence we can conclude that $\beta_0 \ne 0$ and $\beta_1 \ne 0$

Assessing the Accuracy of the Model

  • Once we have rejected null hypothesis $H_0$ that states no relationship between $X$ and $Y$
    • the extent to which model fits the data
  • Quality of Linear Regression
    • Residual Standard Error (RSE)
    • $R^2$ statistic

Residual Standard Error (RSE)

  • Since the true model is $Y = \beta_0 + \beta_1 X + \epsilon$, due to presence of error term, we will not be able to accurately predict even if $\beta_0$ and $\beta_1$ are known.

  • RSE is estimate of standard deviation of $\epsilon$

  • It is the average amount that the response will deviate from the true regression line.

  • $RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^{n}(y_i-\hat{y_i})^2}{n-2}} $

    • where $RSS$ is residual sum of squares
  • Advertising Data

    • QuantatityValue
      Residual Standard Error3.26
    • RSE is $3.26$
      • actual sales in each market deviate from the true regression line by approximately 3,260 units, on average.
      • In the advertising data set, the mean value of sales over all markets is approximately $14,000$ units, and so the percentage error is $3,260/14,000 = 23\%$.
    • RSE is considered a measure of the lack of fit

$R^2$ Statistic

  • RSE, measure of lack of fit, is measured in units of $Y$ so not always clear if it is a good measure.

  • $R^2$ takes the form of a proportion-the proporation of variance explained-and varies in $[0,1]$ and is independent of $Y$

  • $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$

    • TSS is total sum of squares = $\sum (y_i-\bar{y})^2$
    • RSS is residual sum of squares = $\sum (y_i - \hat{y})^2$
  • TSS

    • measures the total variance in the response Y , and
    • can be squares thought of as the amount of variability inherent in the response before the regression is performed.
  • RSS

    • measures the amount of variability that is left unexplained after performing the regression.
  • $TSS −RSS$

    • measures the amount of variability in the response that is explained (or removed) by performing the regression
    • $R^2$ measures the proportion of variability in Y that can be explained using X.
    • An R2 statistic that is close to $1$ indicates that a large proportion of the variability in the response is explained by the regression.
    • A number near $0$ indicates that the regression does not explain much of the variability in the response.
      • this might occur because the linear model is wrong, or the error variance $\sigma^2$ is high, or both.
  • Example - Advertising data

    • QuantatityValue
      Residual Standard Error3.26

      $R^2 = 0.612$ means two-thirds of variability has been explained in Sales

  • $R^2$ is better than $RSE$ due to better interpretation and value being between $0$ and $1$. However, still it is a challenge to determine what is a good $R^2$ value and depend on the type of application.

    • In some cases $R^2$ close to 1 is best while in other case less than $0.1$ is also realistic when linear model is rough approximation and residual errors due to other unmeasured factors are large
  • As $R^2$ is a measure of linear relationship between $X$ and $Y$, Correlation is also a measure of linear relationship between $X$ and $Y$

    • $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
      • sample correlation $\hat{Cor}$, hat omitted for ease of notation
  • Can we use $r$ instead of $R^2$?

    • We can show that $r^2 = R^2$
  • Advantage of $R^2$

    • Multiple Linear Regression where many predictors simultaneously predict the response
    • $Cor$ quantatifies the association between single pair of variables rather than larger number of variables
    • $R^2$ works well with multiple linear regression