# Statistical Learning

** Published:**

This lesson is from An Introduction to Statistical Learning

# What Is Statistical Learning?

Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper

Sales as function of media budget for 200 different markets

Develop an accurate model that can be used to predict sales on the basis of the three media budgets

- Advertising Budget
- Input Variable, $X$
- $X_1, X_2, X_3$ for TV, Radio, Newspaper

- Predictors, Independent Variables, Features, Variables

- Input Variable, $X$
- Sales
- Output Variable, $Y$
- Response, Dependent Variable

- In general, we assume quantitative response Y and p different predictors, $X_1, X_2,…, X_p$. We assume that there is some relationship between $Y$ and $X=X_1, X_2,…, X_p$
- $Y = f(X) + \epsilon$
- $f$ is fixed but unknown function and
- $\epsilon$ is random error term independent of $X$ and mean $0$
- $f$ represents the systematic information that $X$ provides about $Y$

- $Y = f(X) + \epsilon$

- Advertising Budget

Red: Observed values; Blue curve: unknown function $f$; Black lines: error (+ve or -ve with overall mean zero) |

Red: observed values for some individuals; Blue surface: true relationship between income, years of education, seniority (since data is simulated) |

## Why Estimate $f$

- Prediction
- Inference

### Prediction

- Since error term averages to zero, $\hat{Y} = \hat{f}(X)$
- Example
- $X_1, . . . ,X_p$ - characteristics of a patient’s blood sample, and $Y$ encode risk for a severe adverse reaction to a particular drug

- Accuracy of $\hat{Y}$ as prediction of $Y$ depends on reducible error and irreducible error.
- Reducible Error
- Inaccuracy that $\hat{f}$ is not perfect estimate of $f$
- This error can be reduced by appropriate technique to estimate $\hat{f}$
- $\hat{Y} = f(X)$

- Irreducible Error
- $Y$ is also a function of $\epsilon$ which cannot be determined by $X$
- e.g. risk of adverse reaction might vary for a given patient on a given day, depend on manufacturing variation in the drug or the patient’s general feeling on that day

- $Y$ is also a function of $\epsilon$ which cannot be determined by $X$
- Average or Expected Value of squared difference between predicted and actual, $E(Y-\hat{Y})^2$
- $\hat{Y} = \hat{f}(X)$
- Let $\hat{f}$ and $X$ are fixed
- Variability comes from $\epsilon$

- $E(Y-\hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]$ since $Y = f(X) +\epsilon$ and $\hat{Y} = \hat{f}(X)$
- Now, it be shown
- $E(Y-\hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + Var(\epsilon)$
- Reducible Error: $[f(X) - \hat{f}(X)]^2$
- Irreducible Error: $Var(\epsilon)$

- $\hat{Y} = \hat{f}(X)$
- Our objective is to reduce the reducible error
- Irreducible error will always provide an upper bound on accuracy of prediction

### Inference

- We want to infer how the output is generated as a function of the data.
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor - may +ve or -ve
- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

### Examples

- Direct Marketing Campaign
- Identify individuals likely to respond positively based on demographic variables
- We are not interested in understanding the relationship between predictor variables and response
- Prediction Problem

- Advertising Budget
- Sales and advertising budget for different markets
- Response is sales and predictor variables are advertising budget for different categories
- We are interested in knowing the relationship:
- Which media are associated with sales?
- Which media generate the biggest boost in sales? or
- How large of an increase in sales is associated with a given increase in TV advertising?

- Inference Problem

- Different methods for prediction and inference
- different methods may be appropriate
- Linear Models
- relatively simple and interpretable inference
- less accuracte prediction

- Complex Models
- invlove high non-linear approaches
- accuracte estimate
- less interpretable so inference is challenging

## How do we estimate $f$

- Linear and Non-Linear approaches for estimating $f$
- Let we have some observations - training data
- $x_{ij}$
- $j^{th}$ predictor/input for $i^{th}$ observation
- $i=1,2,..n$ and $j=1,2,…p$

- $y_i$
- response variable for $i^{th}$ observation

- Training Data
- ${(x_1, y_1), (x_2, y_2),…,(x_n, y_n)}$
- $x_1 = (x_{11},x_{12},…,x_{1p})^T$
- $x_2 = (x_{21},x_{22},…,x_{2p})^T$
- …
- $x_i = (x_{i1},x_{i2},…,x_{ip})^T$

- ${(x_1, y_1), (x_2, y_2),…,(x_n, y_n)}$
- Objective of Statistical Learning Method
- Find $\hat{f}$ such that $Y \approx \hat{f}(X)$
- Two approaches
- Parametric
- Non-parametric

### Parametric Methods

First Step: Make an assumption about function form

- e.g. linear
- $f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$

- We need to estimate $p+1$ coefficients $\beta_0, \beta_1,…, \beta_p$

- e.g. linear
Second Step: fit/train the model

- Estimate $\beta_0, \beta_1,…, \beta_p$ such that $Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$
- e.g. Least Squares Approach

Observations in Red and Linear plane that fits the data in yellow This is called parametric since the problem of determing $f$ has been reduced to estimating parameters

The disadvantage may be that assumed model doesn’t fit properly

We can try choosing flexible models that can fit different functional forms

However, it will increase number of parameters and may lead to overfitting the data

*https://medium.com/@cs.sabaribalaji/overfitting-6c1cd9af589*

$ income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority$

Linear fit

### Non-Parametric Methods

No assumptions about functional form of $f$

Estimate $f$ as close to data points as possible, however we need a large number of data points to obtain accurate estimate of $f$

Actual thin-plate spline is used to estimate thin-plate spline is used to estimate but overfits

## The Trade-Off Between Prediction Accuracy and Model Interpretability

- Some variables are more restrictive or less flexible e.g. linear regression is less flexible but thin plate splines are highly flexible
- If we are interested in inference then restrictive models are preferred for being more interpretable e.g. linear is inflexible or restrictive but more interpretable

## Supervised Versus Unsupervised Learning

Supervised Learning

- For each observation of the predictor measurement(s) $x_i, i = 1, . . . ,n$ there is an associated response measurement $y_i$.

Unsupervised Learning

For every observation $i = 1, . . . ,n$, we observe a vector of measurements $x_i$ but no associated response $y_i$.

We can seek to understand the relationships between the variables or between the observations - Cluster Analysis

$X_1$ vs $X_2$ may be easily separable (left) or challenging (right) - three customer groups - Market Segmentation Analysis
- Big Spenders vs Low Spenders

Semi-supervised Learning

- In a set of $n$ observations, we have set of $m$ observations with predictor and response variables, and $m < n$
- For remaining $n-m$ observations, we have only predictor variables but no response measurement
- Can happen when response measuremet is expensive

## Regression Versus Classification Problems

- Variables
- Quantitative
- numeric values

- Qualitative or Categorical
- values in one of $K$ different classes or categories

- Quantitative
- Response Variable is Quantative
- Regression

- Response Variable is Qualitative
- Classification

- Examples
**Least squares linear regression**is used with quantative response**Logistic Regression**is with qualitative (two-class/binary) response - classification method :-)- However, since logistic regression estimates class probabilities, it can be thought of as regression method also

- K-nearest neighbors and Boosting
- can be used in quantative or qualtative responses

# Assessing Model Accuracy

- No free lunch in statistics
- no one method dominates all others over all possible data sets

## Measuring the Quality of Fit

Regression - Mean Squared Error

$MSE = \frac{1}{n}\sum_\limits{i=1}^{n}[y_i - \hat{f}(x_i)]^2$

small if predicted response is close to actual and large if differ significantly

Training MSE

Test MSE

Data simulated from $f$, shown in Black; Linear Regression (orange); Two smoothing spline fits (blue and green curve); Right shows MSE with degrees of freedom Trining MSE declines as flexibility increases but Test MSE first declines and then starts increasing Degrees of Freedom

- quantity that summarizes the flexibility of curve
- Low degrees of freedom if restrictive
- Linear regression has $2$ degrees of freedom (slope and intercept)

U-shape cureve in Test MSE is fundamental propoerty of statistical learning

Test decreses slightly since true data is linear and then increases again True data is highly non-linear; Train MSE and Test MSE both decresases with similar patterns but test MSE start increases again We use Cross-Validation when no test data is available

## The Bias-Variance Trade-Off

- U-shape test MSE curve is the result of two competing statistical propoerties. It can be shown
- expected test MSE for $x_0$= Variance of $\hat{f}(x_0)$ + Squared bias of $\hat{f}(x_0)$ + Variance of error $\epsilon$
- $E[(y_0 - \hat{f}(x_0))^2] = Var[\hat{f}(x_0)] + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$
- LHS, expected test MSE at $x_0$, refers to average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_0$
- To reduce the error, we need to have low variance and low bias
- Variance is non-negative
- Bias squared is also non-negative
- Thus, Average test MSE cann’t be less than $Var(\epsilon)$, which is irreducible error

- Variance
- Amount by which $\hat{f}$ would change if different training dataset is used
- Ideally, $\hat{f}$ shouldn’t vary too much
- If a method has high variance, then
- small changes in training dataset can result large change in $\hat{f}$

- More flexible methods will have high variance
- since changing one of the datapoint may cause change in $\hat{f}$ significantly

- Less flexible method will have low variance
- moving one observation may cause a small shift

- Bias
- refers to the error that is introduced by approximating a extremely complicated by a much simpler model.
- e.g. Linear Regression will introduce some bias if the underlying data is truely non-linear i.e. linear regression results in high bias
- Less flexible methods will have high bias
- More flexible methods will have low bias

- General Rule
- High flexible methods
- the variance will increase and the bias will decrease

- High flexible methods
- Test MSE increment or decreament depend on
- the relative rate of change of bias and variance
- As flexibility is increased
- the bias tends to initially decrease faster than the variance increases.
- Consequently, the expected test MSE declines.
- However, at some point increasing flexibility has little impact on the bias
- but starts to significantly increase the variance.

- When this happens the test MSE increases.

Bias Variance Trade-off |
---|

Vertical dotted line shows flexibility level for smallest MSE |

## Classification Setting

- Training Error Rate = $\frac{1}{n}\sum\limits_{i=1}^{n}I(y_i\ne\hat{y_i})$
- $I$ is indicator variable that is either $1$ or $0$

- Test Error Rate = $ Ave [I(y_0\ne\hat{y_0})]$
- Good classifier is one for which test error is smallest

### Bayes Classifier

…

### K-Nearest Neighbors (KNN)

Given a positive integer $K$ and a test observation $x_0$

- KNN identifies $K$ points in the training data that are close to $x_0$, represented by $\mathbb{N}_0$
- It then estimates the conditional probability for class $j$ as the fraction of points in $\mathbb{N_0}$ whose response values equal $j$
$Pr(Y=j X=x_0) = \frac{1}{K}\sum\limits_{i\in \mathbb{N_0}} I(y_i = j)$

- Finally, KNN classifies the test observation $x_0$ to the class with the largest probability

Applying KNN with K=3 to all of possible points give decision boundary

Comparison of $K=1$ and $K=100$; black is KNN decision boundary |

- When K = 1, the decision boundary is overly flexible.
- This corresponds to a classifier that has low bias but very high variance.

- As K grows, the method becomes less flexible and produces a decision boundary that is close to linear.
- This corresponds to a low-variance but high-bias classifier.

- On this simulated data set, neither K = 1 nor K = 100 give good predictions: they have test error rates of 0.1695 and 0.1925, respectively.

KNN training error rate (blue) vs test error rate (orange) using 1/K on log scale |

In both the regression and classification settings,

- choosing the correct level of flexibility is critical to the success of any statistical learning method.
The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.