Chapter 1: Introduction

8 minute read

Published:

Chapter 2: Statistical Learning

Chapter 3: Linear Regression

Chapter 4: Logistic Regression

the response variable is qualitative

An Overview of Classification

  • In the Defaults dataset, 3% of users default, and we can predict if the user will default based on balance and income.

Why not Linear Regression

  • We can encode the values to numeric. However, the difference between numerals will not correspond to the difference between values. In some cases, it may be true, but not in general.

  • For binary, we can use a dummy variable $0/1$ and can fit a linear regression and predict $val1 ~if~ \hat{Y} > 0.5$

    • However, linear regression values can be more than one, so it is hard to interpret.

    ## Logistic Function:

  • Dataset: Defaults

    • Response Variable: default
    • Probability of default given balance = $Pr(default = Yesbalance)$ or $p(balance)$ will range between $0$ and $1$.
    • If $p(balance) > 0.5$ then default, or the company may also choose $p(balance) > 0.1$ as default

Logistic Model

  • How to model $p(X) = Pr(Y = 1X)$​
    • We can use $0/1$ encoding for response and apply linear regression

    • \[p(X) = \beta_0 + \beta_1X\]
    • However, some values of response can be -ve or greater than $1$. Thus, we need a function that gives values between $0$ and $1$. In Logistic Regression, we use Logistic Function

    • Logistic Function:

      • use $exponential$ to convert all -ve to +ve
    \[p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\]
    left using linear regression with 0/1 and right using logistic regression (S curve)
    • For low balances, the prob is now close to $0$ and for high balance, it is close to $1$
    • Logistic Regression will always produce $S$ curve. Thus, sensible probs, regardless of the values of $X$
  • We also find

    • \[\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1X}\]
    • The $\frac{p(X)}{1-p(X)}$ is called $odds$ and can take any values $0$ to $\infty$.

      • Odd $0$ means low prob and $\infty$ means high prob
      • On average 1 in 5 person will default.
        • $p = \frac{1}{5} = 0.2 \implies odds = \frac{0.2}{0.8} = \frac{1}{4}$
        • On average 1 in 5 persons will default with odds of 1/4
      • On average 9 in 10 person will default.
        • $p = \frac{9}{10} = 0.9 \implies odds = \frac{0.9}{0.1} = 9$
        • On average 9 in 10 persons will default with odds of 9
    • \[\text{Log Odds or Logits} = log\left[\frac{p(X)}{1-p(X)}\right] = \beta_0 + \beta_1X\]
      • Logit is linear in X
        • increasing $X$ by one unit
          • changes the log odds by $\beta_1$
          • or multiply the odds by $e^{\beta_1}$

Estimating the Regression Coefficients

  • maximum likelihood is preferred for Logistic Regression

    • Estimate $\beta_0$ and $\beta_1$ such that the predicted probability $\hat{p}(x_i)$ of default matches with individuals
  • $likelihood ~function$:

    • \[\mathcal{l} (\beta_0, \beta_1) = \prod_{y_i=1}p(x_i) \prod_{y_i'=0}[1-p(x_i')]\]
    • The estimates $\hat{\beta_0}$ and $\hat{\beta_1}$​ are selected to maximize the likelihood function.
y = df.default == "Yes" # y is a boolean

X = ["balance"]
X = MS(X).fit_transform(df)

model = sm.GLM(y, X, family=sm.families.Binomial()).fit()

display(summarize(model))
\[&coef & std err & z & P>|z| \\intercept & -10.6513 & 0.361 & -29.491 & 0.0 \\balance & 0.0055 & 0.000 & 24.952 & 0.0\]
  • a one-unit increase in balance is associated with an increase in the log odds of default by 0.0055 units.
  • The z-statistic plays the same role as the t-statistic in the linear regression output.
  • There is indeed an association between balance and the probability of default. The estimated intercept is typically not of interest; its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data (in this case, the overall default rate).

Making Predictions

  • What is the probability for an individual with a balance of $1000?

    • $X = 1000$
    • $\beta_0 = -10.6513$ and $\beta_1 = 0.0055$
    • $\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = 0.00576$
    • Less than 1%
  • What is the probability for an individual with a balance of $2000?

    • $X = 2000$
    • $\beta_0 = -10.6513$ and $\beta_1 = 0.0055$
    • $\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = 0.586$
    • 58.6 %
  •   import numpy as np
      X = 1000
      beta_0 = -10.6513
      beta_1 = 0.0055
      y = beta_0 + beta_1 * X
      p = np.exp(y) / (1 + np.exp(y))
      print(p)
    
y = df.default == "Yes" # y is a boolean

X = ["student"]

df["student"] = df["student"].astype('category')

X = MS(X).fit_transform(df)

model = sm.GLM(y, X, family=sm.families.Binomial()).fit()
display(summarize(model))
\[& coef & std err & z & P>|z| \\ intercept & -3.5041 & 0.071 & -49.554 & 0.0 \\ student[Yes] & 0.4049 & 0.115 & 3.520 & 0.0\]
  • If you are a student, log odds will increase by 0.40 and the p-value is high. This indicates that students tend to have higher default probabilities than non-students

  • What is the probability of default if the Student

    • $Student[Yes] = 1$
    • $\beta_0 = -3.5041$ and $\beta_1 = 0.4049$
    • $\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = 0.04314$
    • 4%
  • What is the probability of default if the not a Student

    • $Student[Yes] = 1$
    • $\beta_0 = -3.5041$ and $\beta_1 = 0.4049$
    • $\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = 0.02919$
    • 2.9%
  •   import numpy as np
      studen_yes = 0
      beta_0 = -3.5041
      beta_1 = 0.4049
      y = beta_0 + beta_1 * studen_yes
      p = np.exp(y) / (1 + np.exp(y))
      print(p)
    

Multiple Logistic Regression

\[\text{Logistic Function}: p(X) = \frac{e^{\beta_0 + \beta_1X_1 +...+ \beta_1X_1}}{1 + e^{\beta_0 + \beta_1X_1 +...+ \beta_1X_1}} \\ \text{Log Odds or Logits} = log\left[\frac{p(X)}{1-p(X)}\right] = \beta_0 + \beta_1X_1 + ... + \beta_pX_p\]
y = df.default == "Yes" # y is a boolean

X = ["balance", "income", "student"]

df["student"] = df["student"].astype('category')

X = MS(X).fit_transform(df)

model = sm.GLM(y, X, family=sm.families.Binomial()).fit()
display(summarize(model))
\[& coef & std err & z & P>|z| \\ intercept & -10.869000 & 0.492000 & -22.079 & 0.000 \\ balance & 0.005700 & 0.000000 & 24.737 & 0.000 \\ income & 0.000003 & 0.000008 & 0.370 & 0.712 \\ student[Yes] & -0.646800 & 0.236000 & -2.738 & 0.006\]
  • Analysis

    • balance and student are associated with the default
      • students are less likely to default since coefficient is negative (surprising result)
        • The overall student default rate is higher than the non-student default rate
        • The negative coefficient for student in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default than a non-student.
    • A student is riskier than a non-student if no information about the student’s credit card balance is available. However, that student is less risky than a non-student with the same credit card balance!
  • Confounding

    • Results obtained using one predictor may differ from those obtained using multiple predictors, especially when there is a correlation among the predictors.
  • Predictions

    •   intercept = model.params["intercept"]
              
        beta_balance = model.params["balance"]
        beta_income = model.params["income"]
        beta_student_yes = model.params["student[Yes]"]
              
        X_balance = 1500
        X_income = 40 # for 40K as per the table
        X_student_yes = 1
              
        y = intercept + beta_balance * X_balance + beta_income * X_income + beta_student_yes * X_student_yes
        p = np.exp(y) / (1 + np.exp(y))
        print(p)
      
    • A student with a credit card balance of $1,500 and an income of $40,000

      • intercept = -10.869000

      • beta_balance = 0.005700; beta_income = 0.000003; beta_student_yes = -0.646800

      • X_student_yes = 1; X_balance = 1500; X_income = 40

      • \[p(X) = \frac{e^{\beta_0 + \beta_1X_1 +...+ \beta_1X_1}}{1 + e^{\beta_0 + \beta_1X_1 +...+ \beta_1X_1}} = 0.052\]
      • 5.2%
    • A non-student with a credit card balance of $1,500 and an income of $40,000

      • $p(X) = 0.105$
      • 10.5%

Multinomial Logistic Regression

more than two classes in the response variable

home = os.path.join(os.path.expanduser("~"), "datasets", "ISLP")
df = pd.read_csv(os.path.join(home, "Default.csv"))

y = df.default # y is not a boolean

X = ["balance", "income", "student"]

df["student"] = df["student"].astype('category')

X = MS(X).fit_transform(df)

model = sm.MNLogit(y, X).fit()
display(summarize(model))
summarize(model).to_latex()
\[& coef & std err & z & P>|z| \\ default=Yes & & & & \\ intercept & -10.869000 & 0.492000 & -22.079 & 0.000 \\ balance & 0.005700 & 0.000000 & 24.737 & 0.000 \\ income & 0.000003 & 0.000008 & 0.370 & 0.712 \\ student[Yes] & -0.646800 & 0.236000 & -2.738 & 0.006\]