Chapter 1: Introduction

5 minute read

Published:

Chapter 2: Statistical Learning

What is Statistical Learning

  • Task: investigate the association between advertising and sales of a particular product

  • Data: ad budget (in TV, Radio, and Newspaper) and sales for 200 markets

  • Sales being predicted for TV, Radio, Newspaper

    Goal: Develop an accurate model that can be used to predict sales on the basis of the three media budgets.

  • $X$: Predictor, independent variables, features, input variables

    • TV, Radio, Newspaper
    • $X = (X_1, X_2, …, X_p)$
  • $Y$: Response, dependent variable

    • Sales
  • Relationship:

    • \[Y = f(X) + \epsilon\]
    • $f$ is fixed, but an unknown function of $X_1, X_2, …, X_p$

    • $\epsilon$ is a random error term independent of $X$​ and mean zero
  • Some errors are +ve and some -ve. Overall errors have approximately a mean zero
    Two Predictors and one response. Some errors are +ve and some -ve. Overall errors have approximately a mean zero

    In essence, statistical learning refers to a set of approaches for estimating $f$.

Why Estimate $f$

  • Two Reasons

    • Prediction
    • Inference
  • Prediction: $\hat{f}$ is black box

    • In many cases, $X$ is known and $Y$ is unknown. Since, $\epsilon$ averages to $0$

      • \[\hat{Y} = \hat{f}(X)\]
    • Accuracy of $\hat{Y}$ as a prediction of $Y$ depends on:

      • reducible error

        • since $\hat{f}$ is not a perfect estimate of $f$ but can be made perfect using a different technique
      • irreducible error

        • $Y$ is also a function of $\epsilon$. Further, $\epsilon$ may contain unmeasured variables that are useful in predicting $Y$
      • \[E(Y - \hat{Y})^2 &= E[f(X) + \epsilon - \hat{f}(X)]^2 \\ &= [f(X) - \hat{f}(X)]^2 + Var(\epsilon) \\ &= \text{Reducibile} + \text{irreducibile}\]
  • Inference: $\hat{f}$ is not a black box

    • understanding association between $Y$ and $X_1, …, X_p$
    • Questions that can be answered:
      • Which predictors are associated with the response?
        • maybe a subset
      • What is the relationship between the response and each predictor?
        • maybe positive or opposite
      • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
    • Inference on Advertising data:
      • Which media are associated with sales?
      • Which media generate the biggest boost in sales? or
      • How large of an increase in sales is associated with a given increase in TV advertising?
    • Real Estate Data
      • How much extra will a house with a river view be worth?
        • Inference
      • Is this house under or over-valued?
        • Prediction
    • Goal: Prediction or Inference or Combination
      • Linear model: simple and interpretable inference but less accurate prediction
      • Non-linear Approaches: More accurate predictions at the cost of inference and less interpretable model

How do we estimate $f$

  • Using Training Data

    • ${(x_1, y_1), (x_2, y_2), . . . , (x_n, y_n)}$, where $xi = (x_{i1}, x_{i2}, . . . ,x_{ip})^T$
    • Find $\hat{f}$ such that $Y \approx \hat{f}(X)$
    • The method can be Parametric or Non-parametric
  • Parametric Methods

    • Two steps approach

      • Step 1: make assumptions about the functional form, e.g. shape

        • e.g. Linear Model: $f$ is linear in $X$

        • \[f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + · · · + \beta_pX_p \label{eq:lm}\]
          • Now, instead of estimating entirely arbitrary p-dimensional function f(X), one only needs to estimate the $p + 1$ coefficients $\beta_0, \beta_1, . . . , \beta_p$.
      • Step 2: Train the model on the data

        • Linear model in Equation $\eqref{eq:lm}$ can be trained by (ordinary) least squares
    • Problem is to reduced to find a set of parameters => parametric

    • However, the model may not match the true form of $f$. We can choose flexible model with greater number of parameters. However, this can overfit that means it can fit to noise or errors too closely.

    • Parametric Approach applied to Income Data

      $income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority$

    • True data as in figure 2.3 has some curvature while 2.4 doesn’t capture that. However it still captures the relationship.
  • Non-parametric Methods

    • No assumption about $f$

      • However, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$.
    • For example, thin-plate spline

    • Thin-plate Spline, need to select level of smoothnessThin-plate Spline, with low level of smoothness (zero error) - overfitting

The Trade-Off Between Prediction Accuracy and Model Interpretability

Low Flexibility => High Interpretability and High Flexibility => Low Interpretability

Supervised Versus Unsupervised Learning

Regression Versus Classification Problems

Assessing Model Accuracy

Measuring the Quality of Fit

  • Compute for test data
\[\text{mean squared error (MSE)} = \frac{1}{n}\sum_{i=1}^{n}[y_i - \hat{f}(x_i)]^2\]
Left: Actual (Black), Linear Regression (orange), Split fit 1(blue), Split fit 2(green)
Right: Train and Test Error for Linear Regression (orange), Split fit 1(blue), Split fit 2(green)
Another example: Linear Regression has both errors small
Linear Regression is poor. Both errors are high.

The Bias-Variance Trade-Off

\[\text{Expected test MSE at } x_0 = E[y_0 - \hat{f}(x_0)]^2 = Var[\hat{f}(x_0)] + [Bias\hat{f}(x_0)]^2 + Var(\epsilon)\]
  • To minimize the test MSE, we need to minimize Var and Bias. Variance is non-negative and Squared Bias is also non-negative. Thus minimum expected test is $Var(\epsilon)$, irreducible term.
  • Variance
    • refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set.
    • Ideally, $\hat{f}{training_data_1} \approx \hat{f}{training_data_2}$
    • High Var
      • small changes in training data => large changes in $\hat{f}$
      • In general, more flexible statistical methods => higher variance.
  • Bias
    • refers to the error that is introduced by approximating a real-life problem e.g. assumption that there is linear relationship. Irrespective of the amount of training data, there will always be error.
  • Trade-off
    • Linear Regression can have more bias. Thus, we need more flexible methods to accurately represent the data but that can lead to high variance