Linear Regression

22 minute read


This lesson covers Linear Regression.


  • An Introduction to Statistical Learning by Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani; Springer


  • a very simple approach for supervised learning
  • useful tool for predicting a quantitative response.
  • a useful and widely used statistical learning method. Moreover and serves as a good jumping-off point for newer approaches
  • Is there a relationship between advertising budget and sales?
    • If the evidence is weak, then one might argue that no money should be spent on advertising!
  • How strong is the relationship between advertising budget and sales?
    • strength of this relationship.
    • Given a certain advertising budget, can we predict sales with a high level of accuracy? This would be a strong relationship. Or
    • Is a prediction of sales based on advertising expenditure only slightly better than a random guess? This would be a weak relationship.
  • Which media contribute to sales?
    • Do all three media—TV, radio, and newspaper—contribute to sales do just one or two of the media contribute?
    • To answer this question
      • we must find a way to separate out the individual effects of each medium when we have spent money on all three media.
  • How accurately can we estimate the effect of each medium on sales?
    • For every dollar spent on advertising in a particular medium, by what amount will sales increase? How accurately can we predict this amount of increase?
  • How accurately can we predict future sales?
    • For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?
  • Is the relationship linear?
    • If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool.
    • If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.
  • Is there synergy among the advertising media?
    • Perhaps spending $50,000 on television advertising and $50,000 on radio advertising results in more sales than allocating $100,000 to either television or radio individually.
    • In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.

It turns out that linear regression can be used to answer each of these questions.

Simple Linear Regression

  • predicting a quantitative response $Y$​​ on the basis of a single predictor variable $X$.
  • It assumes that there is approximately a linear relationship between $X$​ and $Y$.
    • $Y \thicksim \beta_0 + \beta_1X $​​
      • $Y$ is approximately modelled
      • Regressing $Y$​ onto $X$​
      • $Sales \thicksim \beta_0 + \beta_1TV $​​​
    • $\beta_0$​ and $\beta_1$​ are intercept and slope in the linear model
    • $\beta_0$ and $\beta_1$​ are model parameters or coefficients
      • We use training data to estimate model parameters
    • Future sales can be predicted after computing model parameters
      • $\hat{y} = \hat{\beta_0} + \hat{\beta_1} x $​​

Estimating the Coefficients

  • Training Data
    • $(x_1, y_1),~(x_2, y_2),~…,(x_n, y_n)$​​​​
  • Objective is to estimate $ \beta_0;~ \beta_1 $
    • $\hat{y_i} \sim \hat{\beta_0} + \hat{\beta_1} x_i ~~\forall~ i=1,2,..n$​​
  • Most common approach is minimise the least squares criterion
Source: An Introduction to Statistical Learning, Springer
  • Residual $ e_i = y_i - \hat{y_i} = y_i - \hat{\beta_0} + \hat{\beta_1} x_i $
  • Residual Sum of Squares = $RSS = e_1^2 + e_2^2 +…+ e_n^2 $​
  • $ RSS = (y_1 - \hat{\beta_0} + \hat{\beta_1} x_1)^2 + (y_2 - \hat{\beta_0} + \hat{\beta_1} x_2)^2 +…+(y_n - \hat{\beta_0} + \hat{\beta_1} x_n)^2 $
  • Using Ca​​lculus, minimisers are :
    • $\hat{\beta_1} = \frac{\Sigma(x-\bar{x})(y-\bar{y})}{\Sigma(x-\bar{x})^2} $​​​
    • $ \beta_0 = \bar{y} - \beta_1\bar{x} $
    • where $\bar{x}$​​ and $\bar{y}$​​ are sample means
  • TV Advt vs Sales implies $ \hat\beta_0 = 7.03; \hat\beta_1=0.0475$​​
    • $ y = 7.03 + 0.0475 * x$
    • $$1$ of advt is associated approx with $.0475$​ units increase sales
    • or $$1000$​​ of advt is associated approx with $47.5$​​ units increase sales
$RSS$​​ with different $\beta_0, \beta_1$​​ with Red Dot representing $ \hat\beta_0 = 7.03; \hat\beta_1=0.0475 $
Source: An Introduction to Statistical Learning, Springer

Assessing the Accuracy of the Coefficient Estimates

  • True relationship between $X$ and $Y$:
\[\begin{align*} Y &= f(X) + \epsilon \\ &= \beta_0 + \beta_1X + \epsilon \end{align*}\]
  • $\beta_0$ is intercept term, expected value of $Y$ when $X=0$
  • $\beta_1$ is slope, average increase in $Y$ associated with unit increase in $X$
  • $\epsilon$​ is mean-zero random error term, true relationship is probably not linear, there may be other variables that cause variation in $Y$​ e.g. measurement error. We assume that $ \epsilon $ is independent of $X$.
  • Population Regression Line
    • $Y = \beta_0 + \beta_1X + \epsilon$​
  • Least Squares Line
    • $ Y = \hat\beta_0 + \hat\beta_1X $​
  • Population Regression Line is Unobserved
Red Line is Population Regression Line $ Y = 2 + 3X $Red line is Population Regression Line and Blue is Least squares regression line
Blue Line is Least Squares Lines; least squares estimate for $f(X)$​​ based on observed data from $Y = 2 + 3X + \epsilon $​, $ \epsilon $​ generated from Normal Distribution with mean $0$Light Blue: Ten Least Squares lines computed based on separate random set of observations from $ Y = 2 + 3X + \epsilon $​​
Average of many least squares lines, each estimated from a separate data set, is pretty close to the true population regression line

Source: An Introduction to Statistical Learning, Springer

  • We have one dataset and two lines (Population and Least squares) describe the relationship between predictor and response

  • this concept is natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population

  • Example

    • Suppose that we are interested in knowing the population mean $ \mu $​​​​​​ of some random variable $Y$​​​​​ . Unfortunately, $\mu$​​​​ is unknown, but we do have access to $n$​​​ observations from $Y$​​, which we can write as $y_1, . . . , y_n$​, and which we can use to estimate $\mu$.
    • A reasonable estimate is $\hat\mu = \bar y$​​, where $\bar y = \frac{\sum_i y_i}{n}$​​ is the sample mean.
    • The sample mean and the population mean are different, but in general the sample mean will provide a good estimate of the population mean.
  • In the same way, the unknown coefficients $ \beta_0 $​​​ and $ \beta_1 $​​ in linear regression define the population regression line.

  • We seek to estimate these unknown coefficients using least squares approach. These coefficient estimates define the least squares line.

  • The analogy between linear regression and estimation of the mean of a random variable is an apt one based on the concept of bias.

    • If we use the sample mean $\hat\mu$​ to estimate $ \mu $​, this estimate is unbiased, in the sense that on average, we expect $\hat\mu$​ to equal $ \mu $. What exactly does this mean?
    • It means that $\hat\mu $ might overestimate and underestimate μ for different sets of observations $y_1, . . . , y_n$
    • But if we could average a huge number of estimates of $\mu$​ obtained from a huge number of sets of observations, then this average would exactly equal $\mu$. Hence, an unbiased estimator does not systematically over- or under-estimate the true parameter.
  • The property of unbiasedness holds for the least squares coefficient estimates: if we estimate $\beta_0; \beta_1 $ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_0; \beta_1 $. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!

  • We continue the analogy with the estimation of the population mean $\mu $​ of a random variable $Y$.

    • How accurate is the sample mean $ \hat\mu $​ as an estimate of $ \mu $?
      • We have established that the average of $ \hat\mu s $ over many data sets will be very close to $ \mu $, but that a single estimate $ \hat\mu $ may be a substantial underestimate or overestimate of μ.
    • How far off will that single estimate of $ \hat\mu $ be?
      • In general, we answer this question by computing the standard error of $ \hat\mu $, written as $ SE(\hat\mu) $:
        • $ SE(\hat\mu) = \frac{\sigma}{\sqrt{n}}$
        • $ Var(\hat\mu) = SE(\hat\mu)^2 = \frac{\sigma^2}{n}$​
      • where $\sigma $ is standard deviation of $Y$.
      • Standard error tells us the average amount that this estimate $ \hat\mu $​​ differs from the actual value of $ \mu $​​. SE shrinks as $n$​ increases
    • Standard Errors for $ \hat\beta_0 $​ and $ \hat\beta_1 $​
      • $ SE(\hat\beta_0)^2 = \sigma^2[\frac{1}{n} + \frac{\bar x^2}{\sum_i (x_i - \bar x)^2}] $​​​
        • $SE(\hat\beta_0)$​​​​ is same as $ SE(\hat\mu)$​​​​ when $\bar x = 0$​​​ (in which case $\hat\beta_0 $​​​ would be equal to $ \bar y $​ ​​)
      • $ SE(\hat\beta_1)^2 = \frac{\sigma^2}{\sum_i (x_i - \bar x)^2} $​​
        • $SE$ is smaller when $x_i$ are more spread out
      • where $\sigma^2 = Var(\epsilon) $
    • In general, $ \sigma^2$ is not known, but can be estimated from the data
      • Estimated value of $\sigma $ is known as residual standard error (RSE):
        • $ RSE = \sqrt{\frac{RSS}{n-2}} $​​​, where $RSS$​​​ is residual sum of squares
    • Standard errors can be used to compute confidence intervals.
    • A $95\%$​​ confidence interval is defined as a range of values such that with $95\%$ interval probability, the range will contain the true unknown value of the parameter.
    • The range is defined in terms of lower and upper limits computed from the sample of data. For linear regression, the 95% confidence interval for $\beta_1 $​ and $\beta_0 $​ approximately takes the form:
      • $ \hat\beta_1 \pm 2 * SE(\hat\beta_1) $​
      • $ \hat\beta_0 \pm 2 * SE(\hat\beta_0) $​​​
  • We have $ \hat\beta_0 = 7.03; \hat\beta_1=0.0475$

    • $CI~ 95\%$ for $ \beta_0 $ is $ 6.130, 7.935 $
    • $CI~ 95\%$​​​ for $\beta_1$​​​ is $ 0.042, 0.053 $​​​
    • Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,940 units. Furthermore, for each $1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.
  • Standard errors can also be used to perform hypothesis tests on the hypothesis coefficients. The most common hypothesis test involves testing the null hypothesis:

    • $ H_0 $ ​​​ : There is no relationship between $X$​ and $Y$​​
    • $H_a$​​​ : There is some relationship between $X$ and $Y$​
  • Mathematically, this corresponds to testing

    • $ H_0 : \beta_1 = 0 $​​
    • $ H_a : \beta_1 \ne 0 $​​​​
  • To test the null hypothesis, we need to determine whether $ \hat\beta_1 $, our estimate for $ \beta_1 $, is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. How far is far enough?

    • This of course depends on the accuracy of $\hat \beta_1$​ that is, it depends on $SE(\hat \beta_1)$​.

    • If $SE(\hat \beta_1)$​​​​ is small, then even relatively small values of $\hat \beta_1$​​​​ may provide strong evidence that $\beta_1 \ne 0$​​​, and hence that there is a relationship between $X$​ and $Y$.

    • In contrast, if $SE(\hat \beta_1)$​ is large, then $\hat\beta_1$ must be large in absolute value in order for us to reject the null hypothesis.

    • In practice, we compute a $t-statistic$​​ which measures the number of standard deviations that $ \hat\beta_1 $​​ is away from $0$.

      • $ t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)} $
    • If there really is no relationship between $X$​ and $Y$ , then we expect $t$​​​ will have a t-distribution with $n−2$​​ degrees of freedom.

    • The $t$​​-distribution has a bell shape and for values of $n$​ greater than approximately $30$ it is quite similar to the normal distribution.

    • Consequently, it is a simple matter to compute the probability of observing any number equal to $\mid t \mid$​ or larger in absolute value, assuming $ \beta_1 = 0 $​. We call this probability the $p-value$.

    • $p-value$

      • small: it is unlikely to observe such a substantial association between the predictor and response due to chance, in the absence of any real association between the predictor and the response

      • small $p-value$: there is an association between predictor and response. We reject the Null Hypothesis.

      • Typical $p-value$​ cutoffs for rejecting the null hypothesis are 5 or 1%. When n = 30, these correspond to t-statistics of around 2 and 2.75, respectively:

        •  CoefficientStd Error$t-statistic$$p-value$
          Intercept7.03250.457815.36< 0.0001
          TV0.04750.002717.67< 0.0001
      • Increase of $1000 is associated with approx 50 (or 47.5) units

      • $\hat\beta_0$​ and $\hat\beta_1 $​ are very large relative to $SEs$ and $p-value$ is approx $0$.

      • We conclude $\beta_0 \ne 0$​​​ and $ \beta_1 \ne 0 $​​​

Assessing the Accuracy of the Model

  • Once we have rejected the null hypothesis in favor of the alternative hypothesis, it is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error ($RSE$) and the $R^2$​ statistic.

    Residual standard error3.26
  • Residual Standard Error
    • We will not be able to perfectly predict $Y$ from $X$ even if we knew the true regression line due to $\epsilon $ in $ Y = \beta_0 + \beta_1X + \epsilon $
    • $RSE$ is an estimate of the standard deviation of $\epsilon$
    • It is the average amount that the response will deviate from the true regression line
      • $RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_i (y_i - \hat y_i)^2}{n-2}} $​​
      • $3.26$
    • In other words, actual sales in each market deviate from the true regression line by approximately 3,260 units, on average.
    • Another way to think about this is that even if the model were correct and the true values of the unknown coefficients $\beta_0$​ and $\beta_1$ were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average.
    • Of course, whether or not 3,260 units is an acceptable prediction error depends on the problem context. In the advertising data set, the mean value of sales over all markets is approximately 14,000 units, and so the percentage error is 3,260/14,000 = 23%.
    • The $RSE$​​​​​ is considered a measure of the lack of fit of the model to the data. If the predictions obtained using the model are very close to the true outcome values—that is, if $\hat y_i \sim y_i$​​​ for $i = 1, . . . , n$​—then $RSE$ will be small, and we can conclude that the model fits the data very well.
    • On the other hand, if $\hat y_i$​ is very far from $y_i$ for one or more observations, then the RSE may be quite large, indicating that the model doesn’t fit the data well.
  • $R^2$ Statistic
    • $RSE$ is measures on the units of $ Y $​
    • $R^2$ statistic takes the form of proportion - the proportion of variance - always takes between $0$ and $1$ and is independent of the scale of $ Y $
    • $ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} $
    • where $TSS = \sum_i (y_i - \bar y)^2 $​​ = Total sum of squares
    • where $RSS = \sum_i (y_i - \hat y_i)^2 $ = Residual Sum of Squares
  • $TSS$​​​ measures the total variance in the response $Y$​​
    • amount of variability inherent in the response before the regression is performed.
  • $RSS$ measures the amount of variability that is left unexplained after performing the regression.
  • $TSS−RSS$​​​ measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$​​ measures the proportion of variability in $Y$​ that can be explained using $X$.
  • An $R^2$​ statistic that is close to $1$ indicates that a large proportion of the variability in the response has been explained by the regression.
  • A number near $0$​​ indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error $\sigma^2 $​ is high, or both.
  • $R^2 = 0.612$​ means two-thirds of the variability in sales is explained by linear regression on TV.
  • $R^2 statistic$
    • The $R^2$​​​​ statistic has an interpretational advantage over the $RSE$​​​ (residual standard error), since unlike the RSE, it always lies between $0$ and $1$​.
    • However, it can still be challenging to determine what is a good $R^2$ value, and in general, this will depend on the application.
    • For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an $R^2$​​ value that is extremely close to $1$​, and a substantially smaller $R^2$ value might indicate a serious problem with the experiment in which the data were generated.
    • On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an $R^2$​​​ value well below $0.1$​​ might be more realistic!
    • The $R^2$​ statistic is a measure of the linear relationship between $X$ and $ Y $.
    • Correlation is also a measure of the linear relationship between $X$ and $Y $
      • $r = Cor(X,Y) = \frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum(x_i-\bar x)^2} \sqrt{\sum(y_i-\bar y)^2}}$​​​​
    • We might be able to use $Cor(X,Y)=r$ instead of using $R^2$ to assess the fit of linear model
    • In simple regression setting:
      • $R^2 = r^2$

Multiple Linear Regression

  • Sales vs Advt on TV, radio, and newspaper
    • $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon $​​​​
  • When we have $p$ distinct predictors:
    • $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +… + \beta_pX_p + \epsilon$
  • $X_i $​ represents $i^{th}$​ predictor and $\beta_i$ quantifies the association between the variable and response
  • $ \beta_i $​ is the average effect on $Y$​ of a one unit increase in $X_i$​, holding all other predictors fixed.
  • Example
    • $ sales= \beta_0 + \beta_1TV + \beta_2 radio + \beta_3 * newspaper + \epsilon $​​​​​

Estimating the Regression Coefficients

  • $ \hat y = \hat\beta_0 + \hat\beta_1x_1 + \hat\beta_2x_2 +…+ \hat\beta_px_p $​​
  • Residual sum of squares:
\[\begin{align} RSS &= \sum_i (y_i - \hat y_i)^2 \\ &= \sum_i (y - \hat\beta_0 - \hat\beta_1x_{i1} - \hat\beta_2x_{i2} -...- \hat\beta_px_{ip})^2 \end{align}\]
Two predictors and one response; least squares regression plane; plane is chosen to minimize sum of squared vertical distance
Source: An Introduction to Statistical Learning, Springer
  • Values $ \hat\beta_0, \hat\beta_1, …, \hat\beta_p $​ that minimize $RSS$ are the multiple least squares coefficients
    • complicated form but can be easily represented using matrix algebra and we can use software to estimate model coefficients
    • above fig shows plane when $p=2$

  • Spending 1000 on Radio leads to increase by approx 189
  • Newspaper coefficient is approx $0$ and $p=value$ is not significant
    • This may be different when simple linear regression. Simple and multiple regression coefficients can be quite different.
    • This difference stems from the fact that in the simple regression case, the slope term represents the average effect of a $1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio.
    • In contrast, in the multiple regression setting, the coefficient for newspaper represents the average effect of increasing newspaper spending by $1,000 while holding TV and radio fixed.
  • Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite?
    • It does
Correlation Matrix
  • Correlation between radio and newspaper is $0.35$
    • tendency to spend more on newspaper when more is spend on radio
  • Now suppose that the multiple regression is correct and newspaper advertising has no direct impact on sales, but radio advertising does increase sales.
  • Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets.
  • Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising does not actually affect sales.
  • So newspaper sales are a surrogate for radio advertising; newspaper gets “credit” for the effect of radio on sales.
  • This slightly counterintuitive result is very common in many real life situations. Consider an absurd example to illustrate the point.
    • Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper.
    • Of course no one (yet) has suggested that ice creams should be banned at beaches to reduce shark attacks.
    • In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks.
    • A multiple regression of attacks versus ice cream sales and temperature reveals that, as intuition implies, the former predictor is no longer significant after adjusting for temperature.

Some Important Questions

  • When we perform multiple linear regression, we usually are interested in answering a few important questions.
    • Is at least one of the predictors $X_1,X_2, . . . , X_p$ useful in predicting the response?
    • Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
    • How well does the model fit the data?
    • Given a set of predictor values, what response value should we predict, and how accurate is our prediction?