# Linear Regression

Published:

This lesson is from An Introduction to Statistical Learning

# Regression

• simple approach for supervised learning
• predicts quantative response
• Example
• a marketing plan for next year that will result in high product sales using advertising data
• Is there a relationship between advertising budget and sales?
• no money should be spent on advertising if no relationship
• How strong is the relationship between advertising budget and sales?
• information about product sales based on knowledge of the advertising budget
• Which media are associated with sales?
• Are all three media—TV, radio, and newspaper—associated with sales, or are just one or two of the media associated
• we must find a way to separate out the individual contribution of each medium to sales when we have spent money on all three media.
• How large is the association between each medium and sales?
• For every dollar spent on advertising in a particular medium, by what amount will sales increase?
• How accurately can we predict this amount of increase?
• How accurately can we predict future sales?
• For any given level of television, radio, or newspaper advertising, what is our prediction for sales, and what is the accuracy of this prediction?
• Is the relationship linear?
• If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool.
• If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.
• Is there synergy among the advertising media?
• Perhaps spending $50,000$ on television advertising and $50,000$ on radio advertising is associated with higher sales than allocating $100,000$ to either television or radio individually.
• In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.
• It turns out that linear regression can be used to answer each of these questions.

# Simple Linear Regression

• predicting a quantitative response Y on the basis of a single predictor variable X
• $Y \approx \beta_0 + \beta_1X$
• $Sales \approx \beta_0 + \beta_1 \times TV$
• $\beta_0, \beta_1$ represent intercept and slope aka model coefficients or parameters
• We predict $\hat{\beta_0}, \hat{\beta_1}$ using training data and then future sales can be predicted using $\hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X$

## Estimating the Coefficients

• We may use training data to estimate the coefficients

• $y_i = \hat{\beta_0} + \hat{\beta_0}x_i$ for $i=1,2,…,n$
• The obejctive is to find intercept and slope so that line is close to $200$ points

• Least Square Criterion

• Prediction is $\hat{y_i} = \hat{\beta_0} + \hat{\beta_0}x_i$
• $e_i = y_i - \hat{y_i}$ is the $i^{th}$ residual
• Residual Sum of Squares = $RSS = e_1^2+e_2^2+…+e_n^2$
• $RSS = (y_1-\hat{\beta_0}-\hat{\beta_1}x_1)^2 + (y_2-\hat{\beta_0}-\hat{\beta_1}x_2)^2+…+(y_n-\hat{\beta_0}-\hat{\beta_1}x_n)^2$
• By Calculus
• $\hat{\beta_1} = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
• $\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$
• $\bar{x}$ and $\bar{y}$ are sample means
• Sales in 1000 units and Budget in 1000 dollars; $\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$
• Additional $1000$ spent on TV advertising is associated with selling approx $47.5$ additional units of product

• $y = 7.03 + 0.0475 (x + 1000) = 7.03 + 0.0475x + 47.5$
• RSS plot for different values of $\beta_0, \beta_1$ , red dot where RSS is minimum

## Assessing the Accuracy of the Coefficient Estimates

• True relatioship
• $Y = \beta_0 + \beta_1 X + \epsilon$
• $\epsilon$ is mean-zero random error
• $\beta_0$ intercept
• expected value of $Y$ when $X=0$
• $\beta_1$ slope
• average increase in $Y$ associated with a one unit increase in X
• Error $\epsilon$ may be due to other variables or measurement error
• independent of $X$ Scatter plot of 100 random points from $Y = 2+ 3X + \epsilon$ where $\epsilon$ generated from normal distribution with mean $0$
• Red line: Y = 2+ 3X True relationship

• Blue: Least Square Line

• Light Blue: ten least squares line computed from different random observations

• Why is difference beween population regression line and least square line

• Standard statistical approach of using the information from a sample to estimate characteristics of a large population
• if we estimate $\hat{\beta_0}$ and $\hat{\beta_1}$ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_0$ and $\beta_1$.

• But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!
• In fact, we can see from the right hand panel that the average of many least squares lines, each estimated from a separate data set, is pretty close to the true population regression line.
• Whenever we perform linear regression, we want to know if there is a statistically significant relationship between the predictor variable and the response variable.

• Statistics

• Estimate population mean $\mu$ of a random variable $Y$ using sample mean $\bar{y}$
• we can say, $\hat{\mu} = \bar{y}$
• Average of $\hat{\mu}$ from huge number of datasets will give us $\mu$
• One dataset may over-estimate or under-estimate
• How far will that single estimate $\hat{\mu}$ be?
• Standard Error of $\hat{\mu}$= $SE(\hat{\mu})$
• $Var(\hat{\mu}) = [SE(\hat{\mu})]^2 = \frac{\sigma^2}{n}$
• $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$
• It will decrease as $n$ gets larger
• Standard Error for $\beta_0$ and $\beta_1$

• $[SE(\hat{\beta_0)}]^2 = \sigma^2 [\frac{1}{n} + \frac{\bar{x}^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}]$
• $[SE(\hat{\beta_1}]^2) = \frac{\sigma^2}{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}$
• where $\sigma^2 = Var(\epsilon)$
• $SE(\hat{\beta_1)}$ is smaller when $x_i$ are more spread out

• intitutively, we have more leverage to estimate a slope
• $SE(\hat{\beta_0)} = \frac{\sigma^2}{n} = SE(\hat{\mu})$, when $\bar{x} = 0$

• In general, $\sigma^2$ is unknown but can be estimated using data. Estimate of $\sigma$ is known as residual standard error, $RSE = \sqrt{\frac{RSS}{n-2}}$

• Standard Errors can be used to compute confidence intervals

• 95% CI for $\beta_1$ and $\beta_1$
• $\hat{\beta_0} \pm 2 . SE(\hat{\beta_0})$
• $\hat{\beta_1} \pm 2 . SE(\hat{\beta_1})$
• $\hat{\beta_0} = 7.03$ and $\hat{\beta_1} = 0.0475$

• CI are $\hat{\beta_0} = [6.130, 7.935]$ and $\hat{\beta_1} = [0.042, 0.053]$

• Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units.

Tags: