Statistical Learning

11 minute read


This lesson is from An Introduction to Statistical Learning

What Is Statistical Learning?

  • Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper

    • Sales as function of media budget for 200 different markets
  • Develop an accurate model that can be used to predict sales on the basis of the three media budgets

    • Advertising Budget
      • Input Variable, $X$
        • $X_1, X_2, X_3$ for TV, Radio, Newspaper
      • Predictors, Independent Variables, Features, Variables
    • Sales
      • Output Variable, $Y$
      • Response, Dependent Variable
    • In general, we assume quantitative response Y and p different predictors, $X_1, X_2,…, X_p$. We assume that there is some relationship between $Y$ and $X=X_1, X_2,…, X_p$
      • $Y = f(X) + \epsilon$
        • $f$ is fixed but unknown function and
        • $\epsilon$ is random error term independent of $X$ and mean $0$
        • $f$ represents the systematic information that $X$ provides about $Y$
Red: Observed values; Blue curve: unknown function $f$; Black lines: error (+ve or -ve with overall mean zero)
Red: observed values for some individuals; Blue surface: true relationship between income, years of education, seniority (since data is simulated)

Why Estimate $f$

  • Prediction
  • Inference


  • Since error term averages to zero, $\hat{Y} = \hat{f}(X)$
  • Example
    • $X_1, . . . ,X_p$ - characteristics of a patient’s blood sample, and $Y$ encode risk for a severe adverse reaction to a particular drug
  • Accuracy of $\hat{Y}$ as prediction of $Y$ depends on reducible error and irreducible error.
  • Reducible Error
    • Inaccuracy that $\hat{f}$ is not perfect estimate of $f$
    • This error can be reduced by appropriate technique to estimate $\hat{f}$
      • $\hat{Y} = f(X)$
  • Irreducible Error
    • $Y$ is also a function of $\epsilon$ which cannot be determined by $X$
      • e.g. risk of adverse reaction might vary for a given patient on a given day, depend on manufacturing variation in the drug or the patient’s general feeling on that day
  • Average or Expected Value of squared difference between predicted and actual, $E(Y-\hat{Y})^2$
    • $\hat{Y} = \hat{f}(X)$
      • Let $\hat{f}$ and $X$ are fixed
      • Variability comes from $\epsilon$
    • $E(Y-\hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]$ since $Y = f(X) +\epsilon$ and $\hat{Y} = \hat{f}(X)$
    • Now, it be shown
      • $E(Y-\hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + Var(\epsilon)$
      • Reducible Error: $[f(X) - \hat{f}(X)]^2$
      • Irreducible Error: $Var(\epsilon)$
  • Our objective is to reduce the reducible error
  • Irreducible error will always provide an upper bound on accuracy of prediction


  • We want to infer how the output is generated as a function of the data.
    • Which predictors are associated with the response?
    • What is the relationship between the response and each predictor - may +ve or -ve
    • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?


  • Direct Marketing Campaign
    • Identify individuals likely to respond positively based on demographic variables
    • We are not interested in understanding the relationship between predictor variables and response
    • Prediction Problem
  • Advertising Budget
    • Sales and advertising budget for different markets
    • Response is sales and predictor variables are advertising budget for different categories
    • We are interested in knowing the relationship:
      • Which media are associated with sales?
      • Which media generate the biggest boost in sales? or
      • How large of an increase in sales is associated with a given increase in TV advertising?
    • Inference Problem
  • Different methods for prediction and inference
    • different methods may be appropriate
    • Linear Models
      • relatively simple and interpretable inference
      • less accuracte prediction
    • Complex Models
      • invlove high non-linear approaches
      • accuracte estimate
      • less interpretable so inference is challenging

How do we estimate $f$

  • Linear and Non-Linear approaches for estimating $f$
  • Let we have some observations - training data
  • $x_{ij}$
    • $j^{th}$ predictor/input for $i^{th}$ observation
    • $i=1,2,..n$ and $j=1,2,…p$
  • $y_i$
    • response variable for $i^{th}$ observation
  • Training Data
    • ${(x_1, y_1), (x_2, y_2),…,(x_n, y_n)}$
      • $x_1 = (x_{11},x_{12},…,x_{1p})^T$
      • $x_2 = (x_{21},x_{22},…,x_{2p})^T$
      • $x_i = (x_{i1},x_{i2},…,x_{ip})^T$
  • Objective of Statistical Learning Method
    • Find $\hat{f}$ such that $Y \approx \hat{f}(X)$
    • Two approaches
      • Parametric
      • Non-parametric

Parametric Methods

  • First Step: Make an assumption about function form

    • e.g. linear
      • $f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$
    • We need to estimate $p+1$ coefficients $\beta_0, \beta_1,…, \beta_p$
  • Second Step: fit/train the model

    • Estimate $\beta_0, \beta_1,…, \beta_p$ such that $Y \approx \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p$
    • e.g. Least Squares Approach
  • Observations in Red and Linear plane that fits the data in yellow
  • This is called parametric since the problem of determing $f$ has been reduced to estimating parameters

  • The disadvantage may be that assumed model doesn’t fit properly

    • We can try choosing flexible models that can fit different functional forms

    • However, it will increase number of parameters and may lead to overfitting the data

  • $ income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority$

  • Linear fit

Non-Parametric Methods

  • No assumptions about functional form of $f$

  • Estimate $f$ as close to data points as possible, however we need a large number of data points to obtain accurate estimate of $f$

  • Actualthin-plate spline is used to estimatethin-plate spline is used to estimate but overfits

The Trade-Off Between Prediction Accuracy and Model Interpretability

  • Some variables are more restrictive or less flexible e.g. linear regression is less flexible but thin plate splines are highly flexible
  • If we are interested in inference then restrictive models are preferred for being more interpretable e.g. linear is inflexible or restrictive but more interpretable

Supervised Versus Unsupervised Learning

  • Supervised Learning

    • For each observation of the predictor measurement(s) $x_i, i = 1, . . . ,n$ there is an associated response measurement $y_i$.
  • Unsupervised Learning

    • For every observation $i = 1, . . . ,n$, we observe a vector of measurements $x_i$ but no associated response $y_i$.

    • We can seek to understand the relationships between the variables or between the observations - Cluster Analysis

    • $X_1$ vs $X_2$ may be easily separable (left) or challenging (right) - three customer groups
    • Market Segmentation Analysis
      • Big Spenders vs Low Spenders
  • Semi-supervised Learning

    • In a set of $n$ observations, we have set of $m$ observations with predictor and response variables, and $m < n$
    • For remaining $n-m$ observations, we have only predictor variables but no response measurement
    • Can happen when response measuremet is expensive

Regression Versus Classification Problems

  • Variables
    • Quantitative
      • numeric values
    • Qualitative or Categorical
      • values in one of $K$ different classes or categories
  • Response Variable is Quantative
    • Regression
  • Response Variable is Qualitative
    • Classification
  • Examples
    • Least squares linear regression is used with quantative response
    • Logistic Regression is with qualitative (two-class/binary) response - classification method :-)
      • However, since logistic regression estimates class probabilities, it can be thought of as regression method also
    • K-nearest neighbors and Boosting
      • can be used in quantative or qualtative responses

Assessing Model Accuracy

  • No free lunch in statistics
    • no one method dominates all others over all possible data sets

Measuring the Quality of Fit

  • Regression - Mean Squared Error

    • $MSE = \frac{1}{n}\sum_\limits{i=1}^{n}[y_i - \hat{f}(x_i)]^2$

    • small if predicted response is close to actual and large if differ significantly

    • Training MSE

    • Test MSE

    • Data simulated from $f$, shown in Black; Linear Regression (orange); Two smoothing spline fits (blue and green curve); Right shows MSE with degrees of freedom
      Trining MSE declines as flexibility increases but Test MSE first declines and then starts increasing
    • Degrees of Freedom

      • quantity that summarizes the flexibility of curve
      • Low degrees of freedom if restrictive
        • Linear regression has $2$ degrees of freedom (slope and intercept)
    • U-shape cureve in Test MSE is fundamental propoerty of statistical learning

    • Test decreses slightly since true data is linear and then increases again
      True data is highly non-linear; Train MSE and Test MSE both decresases with similar patterns but test MSE start increases again

      We use Cross-Validation when no test data is available

The Bias-Variance Trade-Off

  • U-shape test MSE curve is the result of two competing statistical propoerties. It can be shown
    • expected test MSE for $x_0$= Variance of $\hat{f}(x_0)$ + Squared bias of $\hat{f}(x_0)$ + Variance of error $\epsilon$
    • $E[(y_0 - \hat{f}(x_0))^2] = Var[\hat{f}(x_0)] + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$
    • LHS, expected test MSE at $x_0$, refers to average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_0$
    • To reduce the error, we need to have low variance and low bias
      • Variance is non-negative
      • Bias squared is also non-negative
      • Thus, Average test MSE cann’t be less than $Var(\epsilon)$, which is irreducible error
  • Variance
    • Amount by which $\hat{f}$ would change if different training dataset is used
    • Ideally, $\hat{f}$ shouldn’t vary too much
    • If a method has high variance, then
      • small changes in training dataset can result large change in $\hat{f}$
    • More flexible methods will have high variance
      • since changing one of the datapoint may cause change in $\hat{f}$ significantly
    • Less flexible method will have low variance
      • moving one observation may cause a small shift
  • Bias
    • refers to the error that is introduced by approximating a extremely complicated by a much simpler model.
    • e.g. Linear Regression will introduce some bias if the underlying data is truely non-linear i.e. linear regression results in high bias
    • Less flexible methods will have high bias
    • More flexible methods will have low bias
  • General Rule
    • High flexible methods
      • the variance will increase and the bias will decrease
  • Test MSE increment or decreament depend on
    • the relative rate of change of bias and variance
    • As flexibility is increased
      • the bias tends to initially decrease faster than the variance increases.
      • Consequently, the expected test MSE declines.
      • However, at some point increasing flexibility has little impact on the bias
        • but starts to significantly increase the variance.
      • When this happens the test MSE increases.
Bias Variance Trade-off
Vertical dotted line shows flexibility level for smallest MSE

Classification Setting

  • Training Error Rate = $\frac{1}{n}\sum\limits_{i=1}^{n}I(y_i\ne\hat{y_i})$
    • $I$ is indicator variable that is either $1$ or $0$
  • Test Error Rate = $ Ave [I(y_0\ne\hat{y_0})]$
  • Good classifier is one for which test error is smallest

Bayes Classifier

K-Nearest Neighbors (KNN)

  • Given a positive integer $K$ and a test observation $x_0$

    • KNN identifies $K$ points in the training data that are close to $x_0$, represented by $\mathbb{N}_0$
    • It then estimates the conditional probability for class $j$ as the fraction of points in $\mathbb{N_0}$ whose response values equal $j$
      • $Pr(Y=jX=x_0) = \frac{1}{K}\sum\limits_{i\in \mathbb{N_0}} I(y_i = j)$
    • Finally, KNN classifies the test observation $x_0$ to the class with the largest probability
    Applying KNN with K=3 to all of possible points give decision boundary
Comparison of $K=1$ and $K=100$; black is KNN decision boundary
  • When K = 1, the decision boundary is overly flexible.
    • This corresponds to a classifier that has low bias but very high variance.
  • As K grows, the method becomes less flexible and produces a decision boundary that is close to linear.
    • This corresponds to a low-variance but high-bias classifier.
  • On this simulated data set, neither K = 1 nor K = 100 give good predictions: they have test error rates of 0.1695 and 0.1925, respectively.
KNN training error rate (blue) vs test error rate (orange) using 1/K on log scale

In both the regression and classification settings,

  • choosing the correct level of flexibility is critical to the success of any statistical learning method.

The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.