6 minute read


This lesson is from An Introduction to Statistical Learning


  • Resampling methods
    • involve repeatedly drawing samples from a training set and refitting on each sample to obtain additional information about the fitted model
    • allow us to obtain information that would not be available from fitting the model only once using the original training sample
    • can be computationally expensive
      • nowadays computational requirements of resampling methods generally are not prohibitive
  • Common resampling methods
    • cross-validation
    • bootstrap.
  • Cross-validation
    • estimate the test error associated with a given statistical learning method in order to
      • evaluate its performance, or
        • called model assessment
      • select the appropriate level of flexibility
        • called model selection
  • Bootstrap
    • is used in several contexts
    • most commonly to provide a measure of accuracy of a
      • parameter estimate or
      • a given selection statistical learning method


  • Test error
    • average error that results from using a statistical learning method to predict the response on a new observation
    • Given a data set, the use of a particular statistical learning method is warranted
      • if it results in a low test error.
    • can be easily calculated if a designated test set is available
  • What if Test Set not available?
    • Approach 1
      • make a mathematical adjustment to the training error rate in order to estimate the test error rate.
      • Such approaches are discussed in Chapter 6.
    • Approach 2
    • estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

Validation Set Approach

Split into training (blue) and validation (beige) sets
  • Example: Auto
    • It appears non-linear relationship between mpg and horsepower
    • model predicting mpg using horsepower and horsepower^2 gives better results
    • Will cubic or higher-order terms will provide even better
      • include and check the $p$-values
      • or apply validation method
    • Validation Method
      • Split 392 observations randomly in two sets
        • Training set of 196 observations
        • Valid set of 196 observations
Left: using single split; Right: using different random splits; shows variability in estimated test MSE using validation approach
  • Cross-Validation approach is refinement of this approach

Leave-One-Out Cross-Validation (LOOCV)

  • Test set: $1$ observation e.g. $(x_1,y_1)$
  • Training Set: $n-1$ observations e.g. $(x_2,y_2), (x_2,y_3), … (x_n,y_n)$
  • MSEs
    • $MSE_i = (y_i - \hat{y_i})^2 $
    • $CV_{(n)} = \frac{1}{n} \sum\limits_{i=1}^{n} MSE_i $
  • Disadvantages
    • $MSE_i$ is poor estimate because it is highly variable since it depends on single observation
  • Advantages
    • Far less bias, since n-1 are in training dataset
      • in contrast to raining with n/2 observations
    • Doesn’t overestimate the test error rate
      • in comparison to validation approach
    • Same results when performing multiple times
      • validation approach will give different results due to randomness
Auto dataset with mpg ~ horsepower
  • Left
    • Mean LOOCV error curve
  • Right
    • 10-fold CV run nine times, each with different random split of data into 10 parts
    • Nine slightly different curves

k-Fold Cross-Validation

5-fold CV split into five non-overlapping groups
  • $CV_{(k)} = \frac{1}{k} \sum\limits_{i=1}^{k} MSE_i $
  • LOOCV is special case of $k$-fold CV with $k=n$
  • Preferred is $k=5 ~or~ 10$
Auto dataset with mpg ~ horsepower
  • Left
    • Mean LOOCV error curve
  • Right
    • 10-fold CV run nine times
      • each with different random split of data into 10 parts
    • Nine slightly different curves
    • Lower Variability

Bias-Variance Trade-Off for k-Fold Cross-Validation

  • $k$-fold has computational advantage
  • Most important advantage is $k$-fold CV gives better test error estimate than LOOCV, how?
  • Validation Approach
    • 50% Training and 50% Validation
    • Overestimate test error rate
  • Leave-One-Out Cross-Validation (LOOCV)
    • $n-1$ Training and $1$ Validation
    • Unbiased estimate of test error
  • $k$-fold CV
    • $k=5$ or $10$
    • $\frac{(k-1)n}{k}$ Training and $\frac{n}{k}$ for validation
    • Intermediate level of bias
  • If we see Bias
    • LOOCV is better than $k$=fold CV
    • However, we need to see Variance also
  • Variance
    • LOOCV has higher variance than $k$-fold CV when $k<n$
    • Why?
      • When we perform LOOCV, we are in effect averaging the outputs of $n$ fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other.
      • In contrast, when we perform $k$-fold CV with k < n, we are averaging the outputs of $k$ fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller.
      • Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.
  • To summarize, there is a bias-variance trade-off associated with the choice of $k$ in $k$-fold cross-validation.
  • Typically, given these considerations, one performs $k$-fold cross validation using $k = 5$ or $k = 10$, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Cross-Validation on Classification Problems

  • Leave-One-Out Cross-Validation (LOOCV)
    • $CV_{(n)} =\frac{1}{n} \sum\limits_{i=1}^{n} Err_i $ where $Err_i = I(y_i \ne \hat{y}_i) $
  • Similarly, we can define Validation and $k$-fold CV
Logistic Regression on two-dimensional data. Bayes boundary in purple dashed line with error rate of 0.133. Error rate for Linear (0.201), Quadratic (0.197), Cubic (0.160), Degree 4 (0.162)
  • Bayes bounday has error rate of 0.133

  • Linear Model

    • 0.201
    • substantially higher since linear has limited flexibility
  • Non-linear Models

    $ log\left( \frac{p}{1-p} \right) = \beta_0 + \beta_1X_1 + \beta_2X_1^2 + \beta_3X_1^3 + \beta_4X_1^4$

    • Error rate decreases to 0.160 for cubic polynomial and then increases slightly
  • In practice, we may not have Bayes decision boundary and we can use CV to make the decision

    Training Error (Blue), Test Error (Brown), 10-fold CV (Black)
    • Training error reduces as flexibility increases
      • no monotonic decrease but decreases overall as complexity increases
    • Test Error has $U$ shaped
    • 10-fold CV provides good estimates of test error
      • minimum with degree 4 polynomial while actual is minimum with degree 3
    • In fact, using fourth-order polynomials would likely lead to good test set performance, as the true test error rate is approximately the same for third, fourth, fifth, and sixth-order polynomials.