# Linear Regression

** Published:**

This lesson is from An Introduction to Statistical Learning

# Regression

- simple approach for supervised learning
- predicts quantative response
- Example
- a marketing plan for next year that will result in high product sales using advertising data
- Is there a relationship between advertising budget and sales?
- no money should be spent on advertising if no relationship

- How strong is the relationship between advertising budget and sales?
- information about product sales based on knowledge of the advertising budget

- Which media are associated with sales?
- Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated
- we must find a way to separate out the individual contribution of each medium to sales when we have spent money on all three media.

- Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated
- How large is the association between each medium and sales?
- For every dollar spent on advertising in a particular medium, by what amount will sales increase?
- How accurately can we predict this amount of increase?

- How accurately can we predict future sales?
- For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?

- Is the relationship linear?
- If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool.
- If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

- Is there synergy among the advertising media?
- Perhaps spending $50,000$ on television advertising and $50,000$ on radio advertising is associated with higher sales than allocating $100,000$ to either television or radio individually.
- In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.

- Is there a relationship between advertising budget and sales?
- It turns out that linear regression can be used to answer each of these questions.

- a marketing plan for next year that will result in high product sales using advertising data

# Simple Linear Regression

- predicting a quantitative response Y on the basis of a single predictor variable X
- $ Y \approx \beta_0 + \beta_1X $
- $Sales \approx \beta_0 + \beta_1 \times TV $
- $\beta_0, \beta_1$ represent intercept and slope aka model coefficients or parameters

- We predict $\hat{\beta_0}, \hat{\beta_1}$ using training data and then future sales can be predicted using $ \hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X $

## Estimating the Coefficients

We may use training data to estimate the coefficients

- $y_i = \hat{\beta_0} + \hat{\beta_0}x_i$ for $i=1,2,…,n$

The obejctive is to find intercept and slope so that line is close to $200$ points

Least Square Criterion

- Prediction is $\hat{y_i} = \hat{\beta_0} + \hat{\beta_0}x_i$
- $e_i = y_i - \hat{y_i}$ is the $i^{th}$ residual
- Residual Sum of Squares = $RSS = e_1^2+e_2^2+…+e_n^2$
- $RSS = (y_1-\hat{\beta_0}-\hat{\beta_1}x_1)^2 + (y_2-\hat{\beta_0}-\hat{\beta_1}x_2)^2+…+(y_n-\hat{\beta_0}-\hat{\beta_1}x_n)^2$
- By Calculus
- $\hat{\beta_1} = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
- $\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$
- $\bar{x}$ and $\bar{y}$ are sample means

Sales in 1000 units and Budget in 1000 dollars; $\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$ Additional $1000$ spent on TV advertising is associated with selling approx $47.5$ additional units of product

- $ y = 7.03 + 0.0475 (x + 1000) = 7.03 + 0.0475x + 47.5$

RSS plot for different values of $\beta_0, \beta_1$ , red dot where RSS is minimum

## Assessing the Accuracy of the Coefficient Estimates

- True relatioship
- $Y = \beta_0 + \beta_1 X + \epsilon$
- $\epsilon$ is mean-zero random error

- $\beta_0$ intercept
- expected value of $Y$ when $X=0$

- $\beta_1$ slope
- average increase in $Y$ associated with a one unit increase in X

- $Y = \beta_0 + \beta_1 X + \epsilon$
- Error $\epsilon$ may be due to other variables or measurement error
- independent of $X$

Scatter plot of 100 random points from $Y = 2+ 3X + \epsilon$ where $\epsilon$ generated from normal distribution with mean $0$ |

Red line: Y = 2+ 3X True relationship

Blue: Least Square Line

Light Blue: ten least squares line computed from different random observations

Why is difference beween population regression line and least square line

- Standard statistical approach of using the information from a sample to estimate characteristics of a large population

if we estimate $\hat{\beta_0}$ and $\hat{\beta_1}$ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_0$ and $\beta_1$.

- But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!
- In fact, we can see from the right hand panel that the average of many least squares lines, each estimated from a separate data set, is pretty close to the true population regression line.

Whenever we perform linear regression, we want to know if there is a statistically significant relationship between the predictor variable and the response variable.

Statistics

- Estimate population mean $\mu$ of a random variable $Y$ using sample mean $\bar{y}$
- we can say, $\hat{\mu} = \bar{y}$
- Average of $\hat{\mu}$ from huge number of datasets will give us $\mu$
- One dataset may over-estimate or under-estimate
- How far will that single estimate $\hat{\mu}$ be?
- Standard Error of $\hat{\mu}$= $SE(\hat{\mu})$
- $Var(\hat{\mu}) = [SE(\hat{\mu})]^2 = \frac{\sigma^2}{n}$
- $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$

- It will decrease as $n$ gets larger

- Estimate population mean $\mu$ of a random variable $Y$ using sample mean $\bar{y}$
Standard Error for $\beta_0$ and $\beta_1$

- $[SE(\hat{\beta_0)}]^2 = \sigma^2 [\frac{1}{n} + \frac{\bar{x}^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}]$
- $[SE(\hat{\beta_1}]^2) = \frac{\sigma^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
- where $\sigma^2 = Var(\epsilon)$

$SE(\hat{\beta_1)}$ is smaller when $x_i$ are more spread out

- intitutively, we have more leverage to estimate a slope

$SE(\hat{\beta_0)} = \frac{\sigma^2}{n} = SE(\hat{\mu})$, when $\bar{x} = 0$

In general, $\sigma^2$ is unknown but can be estimated using data. Estimate of $\sigma$ is known as residual standard error, $RSE = \sqrt{\frac{RSS}{n-2}}$

Standard Errors can be used to compute confidence intervals

- 95% CI for $\beta_1$ and $\beta_1$
- $\hat{\beta_0} \pm 2 . SE(\hat{\beta_0})$
- $\hat{\beta_1} \pm 2 . SE(\hat{\beta_1})$

- 95% CI for $\beta_1$ and $\beta_1$
$\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$

CI are $\hat{\beta_0} = [6.130, 7.935]$ and $\hat{\beta_1} = [0.042, 0.053]$

Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units.

Furthermore, for each $1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units.

Standard errors can also be used to perform hypothesis tests on the hypothesis coefficients.

The most common hypothesis test involves testing the null test hypothesis of

- $H_0$ : There is no relationship between $X$ and $Y$
- $H_a$ : There is some relationship between $X$ and $Y$

Mathematically,

- $H_0:~\beta_1 = 0 $
- $H_a:~\beta_1 \ne 0 $

$\beta_1 = 0 \implies Y = \beta_0 + \epsilon$

- $X$ is not associated with $Y$

We need to determine whether $\hat{\beta_1}$ is sufficiently far from $0$ so that we can be confident that $\beta_1$ is non-zero

How far is far - depends on $SE(\hat{\beta_1})$

If $SE(\hat{\beta_1})$ is small, then even small values of $\hat{\beta_1}$ may prove that $\beta_1 \ne 0$

If $SE(\hat{\beta_1})$ is large, then $\hat{\beta_1}$ must be large to reject null hypothesis

In practice, we compute t-statistic

- \[t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})} \label{eqn:t}\]
- measures the number of standard deviations that $\hat{\beta_1}$ is away from $0$

If there is no relationship between $X$ and $Y$, then Equation $\ref{eqn:t}$ will have $t$ distribution with $n-2$ degrees of freedom.

The t-distribution has a bell shape and for values of n greater than approximately 30 it is quite similar to the standard normal distribution.

Consequently, it is a simple matter to compute the probability of observing any number equal to $\mid t \mid$ or larger in absolute value, assuming $\beta_1 = 0$.

We call this probability the $p$-value.

- a small $p$-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.
- Small p-value
- reject null
- we can infer that there is an association between the predictor and the response

- We reject the null hypothesis—that is, we declare a relationship to exist between X and Y —if the p-value is small enough.
- Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%
- When n = 30, these correspond to t-statistics of around 2 and 2.75, respectively.

Advertising Data

Coefficient Std Error t-statistic p-value Intercept 7.0325 0.4578 15.36 < 0.0001 TV 0.0475 0.0027 17.67 < 0.0001 - $Sales \approx \beta_0 + \beta_1 \times TV = 7.0325 + 0.0475 \times TV $
- An increase of USD 1,000 in the TV advertising budget is associated with an increase in sales by around $50$ units.
- Coefficients for $\beta_0$ and $\beta_1$ are very large in comparison to their Std Errors
- t-statistics are also large
- the probabilities of seeing such values if $H_0$ is true are virtually zero.
- $p$-value for $\beta_0$ and $\beta_1$ are small
- Hence we can conclude that $\beta_0 \ne 0$ and $\beta_1 \ne 0$

## Assessing the Accuracy of the Model

- Once we have rejected null hypothesis $H_0$ that states no relationship between $X$ and $Y$
- the extent to which model fits the data

- Quality of Linear Regression
- Residual Standard Error (RSE)
- $R^2$ statistic

### Residual Standard Error (RSE)

Since the true model is $Y = \beta_0 + \beta_1 X + \epsilon$, due to presence of error term, we will not be able to accurately predict even if $\beta_0$ and $\beta_1$ are known.

RSE is estimate of standard deviation of $\epsilon$

It is the average amount that the response will deviate from the true regression line.

$RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_\limits{i=1}^{n}(y_i-\hat{y_i})^2}{n-2}} $

- where $RSS$ is residual sum of squares

Advertising Data

Quantatity Value Residual Standard Error 3.26 $R^2$ 0.612 $F$-statistic 312.1 - RSE is $3.26$
- actual sales in each market deviate from the true regression line by approximately 3,260 units, on average.
- In the advertising data set, the mean value of sales over all markets is approximately $14,000$ units, and so the percentage error is $3,260/14,000 = 23\%$.

- RSE is considered a measure of the lack of fit

### $R^2$ Statistic

RSE, measure of lack of fit, is measured in units of $Y$ so not always clear if it is a good measure.

$R^2$ takes the form of a proportion-the proporation of variance explained-and varies in $[0,1]$ and is independent of $Y$

$R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$

- TSS is total sum of squares = $\sum (y_i-\bar{y})^2$
- RSS is residual sum of squares = $\sum (y_i - \hat{y})^2$

TSS

- measures the total variance in the response Y , and
- can be squares thought of as the amount of variability inherent in the response before the regression is performed.

RSS

- measures the amount of variability that is left unexplained after performing the regression.

$TSS −RSS$

- measures the amount of variability in the response that is explained (or removed) by performing the regression
- $R^2$ measures the proportion of variability in Y that can be explained using X.
- An R2 statistic that is close to $1$ indicates that a large proportion of the variability in the response is explained by the regression.
- A number near $0$ indicates that the regression does not explain much of the variability in the response.
- this might occur because the linear model is wrong, or the error variance $\sigma^2$ is high, or both.

Example - Advertising data

Quantatity Value Residual Standard Error 3.26 $R^2$ 0.612 $F$-statistic 312.1 $R^2 = 0.612$ means two-thirds of variability has been explained in Sales

$R^2$ is better than $RSE$ due to better interpretation and value being between $0$ and $1$. However, still it is a challenge to determine what is a good $R^2$ value and depend on the type of application.

- In some cases $R^2$ close to 1 is best while in other case less than $0.1$ is also realistic when linear model is rough approximation and residual errors due to other unmeasured factors are large

As $R^2$ is a measure of linear relationship between $X$ and $Y$, Correlation is also a measure of linear relationship between $X$ and $Y$

- $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
- sample correlation $\hat{Cor}$, hat omitted for ease of notation

- $r = Cor(X,Y) = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{(x_i - \bar{x})^2} \sqrt{(y_i - \bar{y})^2}}$
Can we use $r$ instead of $R^2$?

- We can show that $r^2 = R^2$

Advantage of $R^2$

- Multiple Linear Regression where many predictors simultaneously predict the response
- $Cor$ quantatifies the association between single pair of variables rather than larger number of variables
- $R^2$ works well with multiple linear regression