4 minute read


This lesson is from An Introduction to Statistical Learning

  • Predicting qualitative or categorical response variable
    • referred to as classifying that observation, since it involves assigning the observation to a category, or class
  • Often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification.
    • In this sense they also behave like regression methods
  • Methods
    • logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes, and K-nearest neighbors
  • Computer-intensive classification methods
    • generalized additive models, trees, random forests, and boosting; and support vector machines


  • Examples

    1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?

    2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.

    3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.

  • Training Observations

    • $(x_1,y_1), …, (x_n,y_n)$
  • Example: Default Dataset

    • whether individual will default on credit card payment based on annual income and monthly credit card balance
Left: 10K Balance Vs Income (orange defaulted)
  • Left
    • 3% default of total - Plotted fraction of non-default and all defaulted
      • defaulted users have higher credit card balance
  • Center
    • Distribution of balance split by default
  • Right
    • Distribution of income split by default

Why not Linear Regression

  • Predicting Medical Condition

    • Three possible diagnosis

      • stroke, drug overdose, and epileptic seizure \(Y = \begin{cases} 1 &\text{if stroke} \\ 2 &\text{if drug overdose} \\ 3 &\text{if epileptic seizure} \end{cases}\)

      • Least Square Regression could fit to predict $Y$ as 1, 2, or 3

      • However, coding implies ordering of outcomes, while there is no order

        • assumption is that gap been stroke and drug overdose is same as drug overdose and epileptic seizure
      • If we change the coding the model will be different

      • If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable.

      • Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression

    • For a binary (two level) qualitative response

      • the situation is better

      • For binary instance, perhaps there are only two possibilities for the patient’s medical condition: stroke and drug overdose.

      • We could then potentially use the dummy variable approach to code the response as follows: \(Y = \begin{cases} 0 &\text{if stroke} \\ 1 &\text{if drug overdose} \end{cases}\)

      • We could then fit a linear regression to this binary response, and predict drug overdose if Y >0.5 and stroke otherwise.

      • In the binary case it is not hard to show that even if we flip the above coding, linear regression will produce the same final predictions

    • For a binary response with a 0/1 coding as above

      • regression by least squares is not completely unreasonable:
      • it can be shown that the X$\hat{\beta}$ obtained using linear regression is in fact an estimate of Pr(drug overdoseX) in this special case.
      • However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities
      • Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates.
      • Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA)
  • Two Reasons for not to perform classification using a regression method

    • a regression method cannot accommodate a qualitative response with more than two classes
    • a regression method will not provide meaningful estimates of Pr(YX), even with just two classes.
Left: Linear Regression shows negative probability and Right: Logistic Regression