# Statistical Learning

Published:

This lesson is from An Introduction to Statistical Learning

# What Is Statistical Learning?

• Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper

• Sales as function of media budget for 200 different markets
• Develop an accurate model that can be used to predict sales on the basis of the three media budgets

• Input Variable, $X$
• $X_1, X_2, X_3$ for TV, Radio, Newspaper
• Predictors, Independent Variables, Features, Variables
• Sales
• Output Variable, $Y$
• Response, Dependent Variable
• In general, we assume quantitative response Y and p different predictors, $X_1, X_2,…, X_p$. We assume that there is some relationship between $Y$ and $X=X_1, X_2,…, X_p$
• $Y = f(X) + \epsilon$
• $f$ is fixed but unknown function and
• $\epsilon$ is random error term independent of $X$ and mean $0$
• $f$ represents the systematic information that $X$ provides about $Y$ Red: Observed values; Blue curve: unknown function $f$; Black lines: error (+ve or -ve with overall mean zero) Red: observed values for some individuals; Blue surface: true relationship between income, years of education, seniority (since data is simulated)

## Why Estimate $f$

• Prediction
• Inference

### Prediction

• Since error term averages to zero, $\hat{Y} = \hat{f}(X)$
• Example
• $X_1, . . . ,X_p$ - characteristics of a patient’s blood sample, and $Y$ encode risk for a severe adverse reaction to a particular drug
• Accuracy of $\hat{Y}$ as prediction of $Y$ depends on reducible error and irreducible error.
• Reducible Error
• Inaccuracy that $\hat{f}$ is not perfect estimate of $f$
• This error can be reduced by appropriate technique to estimate $\hat{f}$
• $\hat{Y} = f(X)$
• Irreducible Error
• $Y$ is also a function of $\epsilon$ which cannot be determined by $X$
• e.g. risk of adverse reaction might vary for a given patient on a given day, depend on manufacturing variation in the drug or the patient’s general feeling on that day
• Average or Expected Value of squared difference between predicted and actual, $E(Y-\hat{Y})^2$
• $\hat{Y} = \hat{f}(X)$
• Let $\hat{f}$ and $X$ are fixed
• Variability comes from $\epsilon$
• $E(Y-\hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]$ since $Y = f(X) +\epsilon$ and $\hat{Y} = \hat{f}(X)$
• Now, it be shown
• $E(Y-\hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + Var(\epsilon)$
• Reducible Error: $[f(X) - \hat{f}(X)]^2$
• Irreducible Error: $Var(\epsilon)$
• Our objective is to reduce the reducible error
• Irreducible error will always provide an upper bound on accuracy of prediction

### Inference

• We want to infer how the output is generated as a function of the data.
• Which predictors are associated with the response?
• What is the relationship between the response and each predictor - may +ve or -ve
• Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

### Examples

• Direct Marketing Campaign
• Identify individuals likely to respond positively based on demographic variables
• We are not interested in understanding the relationship between predictor variables and response
• Prediction Problem
• Sales and advertising budget for different markets
• Response is sales and predictor variables are advertising budget for different categories
• We are interested in knowing the relationship:
• Which media are associated with sales?
• Which media generate the biggest boost in sales? or
• How large of an increase in sales is associated with a given increase in TV advertising?
• Inference Problem
• Different methods for prediction and inference
• different methods may be appropriate
• Linear Models
• relatively simple and interpretable inference
• less accuracte prediction
• Complex Models
• invlove high non-linear approaches
• accuracte estimate
• less interpretable so inference is challenging

## How do we estimate $f$

• Linear and Non-Linear approaches for estimating $f$
• Let we have some observations - training data
• $x_{ij}$
• $j^{th}$ predictor/input for $i^{th}$ observation
• $i=1,2,..n$ and $j=1,2,…p$
• $y_i$
• response variable for $i^{th}$ observation
• Training Data
• ${(x_1, y_1), (x_2, y_2),…,(x_n, y_n)}$
• $x_1 = (x_{11},x_{12},…,x_{1p})^T$
• $x_2 = (x_{21},x_{22},…,x_{2p})^T$
• $x_i = (x_{i1},x_{i2},…,x_{ip})^T$
• Objective of Statistical Learning Method
• Find $\hat{f}$ such that $Y \approx \hat{f}(X)$
• Two approaches
• Parametric
• Non-parametric

### Parametric Methods

• First Step: Make an assumption about function form

• e.g. linear
• $f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$
• We need to estimate $p+1$ coefficients $\beta_0, \beta_1,…, \beta_p$
• Second Step: fit/train the model

• Estimate $\beta_0, \beta_1,…, \beta_p$ such that $Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$
• e.g. Least Squares Approach
• Observations in Red and Linear plane that fits the data in yellow
• This is called parametric since the problem of determing $f$ has been reduced to estimating parameters

• The disadvantage may be that assumed model doesn’t fit properly

• We can try choosing flexible models that can fit different functional forms

• However, it will increase number of parameters and may lead to overfitting the data

• https://medium.com/@cs.sabaribalaji/overfitting-6c1cd9af589
• $income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority$

• Linear fit

### Non-Parametric Methods

• No assumptions about functional form of $f$

• Estimate $f$ as close to data points as possible, however we need a large number of data points to obtain accurate estimate of $f$

•   Actualthin-plate spline is used to estimatethin-plate spline is used to estimate but overfits

## The Trade-Off Between Prediction Accuracy and Model Interpretability

• Some variables are more restrictive or less flexible e.g. linear regression is less flexible but thin plate splines are highly flexible
• • If we are interested in inference then restrictive models are preferred for being more interpretable e.g. linear is inflexible or restrictive but more interpretable

## Supervised Versus Unsupervised Learning

• Supervised Learning

• For each observation of the predictor measurement(s) $x_i, i = 1, . . . ,n$ there is an associated response measurement $y_i$.
• Unsupervised Learning

• For every observation $i = 1, . . . ,n$, we observe a vector of measurements $x_i$ but no associated response $y_i$.

• We can seek to understand the relationships between the variables or between the observations - Cluster Analysis

• $X_1$ vs $X_2$ may be easily separable (left) or challenging (right) - three customer groups
• Market Segmentation Analysis
• Big Spenders vs Low Spenders
• Semi-supervised Learning

• In a set of $n$ observations, we have set of $m$ observations with predictor and response variables, and $m < n$
• For remaining $n-m$ observations, we have only predictor variables but no response measurement
• Can happen when response measuremet is expensive

## Regression Versus Classification Problems

• Variables
• Quantitative
• numeric values
• Qualitative or Categorical
• values in one of $K$ different classes or categories
• Response Variable is Quantative
• Regression
• Response Variable is Qualitative
• Classification
• Examples
• Least squares linear regression is used with quantative response
• Logistic Regression is with qualitative (two-class/binary) response - classification method :-)
• However, since logistic regression estimates class probabilities, it can be thought of as regression method also
• K-nearest neighbors and Boosting
• can be used in quantative or qualtative responses

# Assessing Model Accuracy

• No free lunch in statistics
• no one method dominates all others over all possible data sets

## Measuring the Quality of Fit

• Regression - Mean Squared Error

• $MSE = \frac{1}{n}\sum_\limits{i=1}^{n}[y_i - \hat{f}(x_i)]^2$

• small if predicted response is close to actual and large if differ significantly

• Training MSE

• Test MSE

• Data simulated from $f$, shown in Black; Linear Regression (orange); Two smoothing spline fits (blue and green curve); Right shows MSE with degrees of freedom
Trining MSE declines as flexibility increases but Test MSE first declines and then starts increasing
• Degrees of Freedom

• quantity that summarizes the flexibility of curve
• Low degrees of freedom if restrictive
• Linear regression has $2$ degrees of freedom (slope and intercept)
• U-shape cureve in Test MSE is fundamental propoerty of statistical learning

• Test decreses slightly since true data is linear and then increases again True data is highly non-linear; Train MSE and Test MSE both decresases with similar patterns but test MSE start increases again

We use Cross-Validation when no test data is available

• U-shape test MSE curve is the result of two competing statistical propoerties. It can be shown
• expected test MSE for $x_0$= Variance of $\hat{f}(x_0)$ + Squared bias of $\hat{f}(x_0)$ + Variance of error $\epsilon$
• $E[(y_0 - \hat{f}(x_0))^2] = Var[\hat{f}(x_0)] + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$
• LHS, expected test MSE at $x_0$, refers to average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_0$
• To reduce the error, we need to have low variance and low bias
• Variance is non-negative
• Bias squared is also non-negative
• Thus, Average test MSE cann’t be less than $Var(\epsilon)$, which is irreducible error
• Variance
• Amount by which $\hat{f}$ would change if different training dataset is used
• Ideally, $\hat{f}$ shouldn’t vary too much
• If a method has high variance, then
• small changes in training dataset can result large change in $\hat{f}$
• More flexible methods will have high variance
• since changing one of the datapoint may cause change in $\hat{f}$ significantly
• Less flexible method will have low variance
• moving one observation may cause a small shift
• Bias
• refers to the error that is introduced by approximating a extremely complicated by a much simpler model.
• e.g. Linear Regression will introduce some bias if the underlying data is truely non-linear i.e. linear regression results in high bias
• Less flexible methods will have high bias
• More flexible methods will have low bias
• General Rule
• High flexible methods
• the variance will increase and the bias will decrease
• Test MSE increment or decreament depend on
• the relative rate of change of bias and variance
• As flexibility is increased
• the bias tends to initially decrease faster than the variance increases.
• Consequently, the expected test MSE declines.
• However, at some point increasing flexibility has little impact on the bias
• but starts to significantly increase the variance.
• When this happens the test MSE increases. Vertical dotted line shows flexibility level for smallest MSE

## Classification Setting

• Training Error Rate = $\frac{1}{n}\sum\limits_{i=1}^{n}I(y_i\ne\hat{y_i})$
• $I$ is indicator variable that is either $1$ or $0$
• Test Error Rate = $Ave [I(y_0\ne\hat{y_0})]$
• Good classifier is one for which test error is smallest

### K-Nearest Neighbors (KNN)

• Given a positive integer $K$ and a test observation $x_0$

• KNN identifies $K$ points in the training data that are close to $x_0$, represented by $\mathbb{N}_0$
• It then estimates the conditional probability for class $j$ as the fraction of points in $\mathbb{N_0}$ whose response values equal $j$
•  $Pr(Y=j X=x_0) = \frac{1}{K}\sum\limits_{i\in \mathbb{N_0}} I(y_i = j)$
• Finally, KNN classifies the test observation $x_0$ to the class with the largest probability Applying KNN with K=3 to all of possible points give decision boundary Comparison of $K=1$ and $K=100$; black is KNN decision boundary
• When K = 1, the decision boundary is overly flexible.
• This corresponds to a classifier that has low bias but very high variance.
• As K grows, the method becomes less flexible and produces a decision boundary that is close to linear.
• This corresponds to a low-variance but high-bias classifier.
• On this simulated data set, neither K = 1 nor K = 100 give good predictions: they have test error rates of 0.1695 and 0.1925, respectively. KNN training error rate (blue) vs test error rate (orange) using 1/K on log scale

In both the regression and classification settings,

• choosing the correct level of flexibility is critical to the success of any statistical learning method.

The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.

Tags: