# Chapter 1: Introduction

** Published:**

# Chapter 2: Statistical Learning

## What is Statistical Learning

Task: investigate the association between advertising and sales of a particular product

Data: ad budget (in TV, Radio, and Newspaper) and sales for 200 markets

Sales being predicted for TV, Radio, Newspaper Goal: Develop an accurate model that can be used to predict sales on the basis of the three media budgets.

$X$: Predictor, independent variables, features, input variables

- TV, Radio, Newspaper
- $X = (X_1, X_2, …, X_p)$

$Y$: Response, dependent variable

- Sales

Relationship:

- \[Y = f(X) + \epsilon\]
$f$ is fixed, but an unknown function of $X_1, X_2, …, X_p$

- $\epsilon$ is a random error term independent of $X$ and mean zero

Some errors are +ve and some -ve. Overall errors have approximately a mean zero Two Predictors and one response. Some errors are +ve and some -ve. Overall errors have approximately a mean zero In essence, statistical learning refers to a set of approaches for estimating $f$.

### Why Estimate $f$

Two Reasons

- Prediction
- Inference

Prediction: $\hat{f}$ is black box

In many cases, $X$ is known and $Y$ is unknown. Since, $\epsilon$ averages to $0$

- \[\hat{Y} = \hat{f}(X)\]

Accuracy of $\hat{Y}$ as a prediction of $Y$ depends on:

reducible error

- since $\hat{f}$ is not a perfect estimate of $f$ but can be made perfect using a different technique

irreducible error

- $Y$ is also a function of $\epsilon$. Further, $\epsilon$ may contain unmeasured variables that are useful in predicting $Y$

- \[E(Y - \hat{Y})^2 &= E[f(X) + \epsilon - \hat{f}(X)]^2 \\ &= [f(X) - \hat{f}(X)]^2 + Var(\epsilon) \\ &= \text{Reducibile} + \text{irreducibile}\]

Inference: $\hat{f}$ is not a black box

- understanding association between $Y$ and $X_1, …, X_p$
- Questions that can be answered:
- Which predictors are associated with the response?
- maybe a subset

- What is the relationship between the response and each predictor?
- maybe positive or opposite

- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

- Which predictors are associated with the response?
- Inference on Advertising data:
- Which media are associated with sales?
- Which media generate the biggest boost in sales? or
- How large of an increase in sales is associated with a given increase in TV advertising?

- Real Estate Data
- How much extra will a house with a river view be worth?
- Inference

- Is this house under or over-valued?
- Prediction

- How much extra will a house with a river view be worth?
- Goal: Prediction or Inference or Combination
- Linear model: simple and interpretable inference but less accurate prediction
- Non-linear Approaches: More accurate predictions at the cost of inference and less interpretable model

### How do we estimate $f$

Using Training Data

- ${(x_1, y_1), (x_2, y_2), . . . , (x_n, y_n)}$, where $xi = (x_{i1}, x_{i2}, . . . ,x_{ip})^T$
- Find $\hat{f}$ such that $Y \approx \hat{f}(X)$
- The method can be Parametric or Non-parametric

Parametric Methods

Two steps approach

Step 1: make assumptions about the functional form, e.g. shape

e.g. Linear Model: $f$ is linear in $X$

- \[f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + · · · + \beta_pX_p \label{eq:lm}\]
- Now, instead of estimating entirely arbitrary p-dimensional function f(X), one only needs to estimate the $p + 1$ coefficients $\beta_0, \beta_1, . . . , \beta_p$.

Step 2: Train the model on the data

- Linear model in Equation $\eqref{eq:lm}$ can be trained by (ordinary) least squares

Problem is to reduced to find a set of parameters => parametric

However, the model may not match the true form of $f$. We can choose flexible model with greater number of parameters. However, this can overfit that means it can fit to noise or errors too closely.

Parametric Approach applied to Income Data $income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority$

- True data as in figure 2.3 has some curvature while 2.4 doesn’t capture that. However it still captures the relationship.

Non-parametric Methods

No assumption about $f$

- However, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$.

For example, thin-plate spline

Thin-plate Spline, need to select level of smoothness Thin-plate Spline, with low level of smoothness (zero error) - overfitting

### The Trade-Off Between Prediction Accuracy and Model Interpretability

Low Flexibility => High Interpretability and High Flexibility => Low Interpretability |

### Supervised Versus Unsupervised Learning

### Regression Versus Classification Problems

## Assessing Model Accuracy

### Measuring the Quality of Fit

- Compute for test data

Left: Actual (Black), Linear Regression (orange), Split fit 1(blue), Split fit 2(green) Right: Train and Test Error for Linear Regression (orange), Split fit 1(blue), Split fit 2(green) |

Another example: Linear Regression has both errors small |

Linear Regression is poor. Both errors are high. |

### The Bias-Variance Trade-Off

\[\text{Expected test MSE at } x_0 = E[y_0 - \hat{f}(x_0)]^2 = Var[\hat{f}(x_0)] + [Bias\hat{f}(x_0)]^2 + Var(\epsilon)\]- To minimize the test MSE, we need to minimize Var and Bias. Variance is non-negative and Squared Bias is also non-negative. Thus minimum expected test is $Var(\epsilon)$, irreducible term.
- Variance
- refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set.
- Ideally, $\hat{f}
*{training_data_1} \approx \hat{f}*{training_data_2}$ - High Var
- small changes in training data => large changes in $\hat{f}$
- In general, more flexible statistical methods => higher variance.

- Bias
- refers to the error that is introduced by approximating a real-life problem e.g. assumption that there is linear relationship. Irrespective of the amount of training data, there will always be error.

- Trade-off
- Linear Regression can have more bias. Thus, we need more flexible methods to accurately represent the data but that can lead to high variance