# Logistic Regression

** Published:**

This lesson is from An Introduction to Statistical Learning

# Introduction

- Default dataset
- Response Variable, $Y$ is default that can have $Yes$ or $No$
- Logistic regression
- models probability that $Y$ belongs to a particular category
- i.e. models the probability of default
$P$(default = Yes Balance) or simply $p$(balance) - e.g. If $p$(balance) > 0.5 then default = Yes
- or If $p$(balance) > 0.1 then default = Yes, if a company is conservative

## Logistic Model

How to model the relationship between $p(X)$ and $X$

Linear Regression?

$p(X) = \beta_0 + \beta_1X$

- This is represented by Left Panel

Negative Prob if balance close to zero

Greater than 1 if balance is high

Stratight line fit to binary response can always predict $<0$ or $>1$

We need to model $p(X)$ that gives values between $0$ and $1$ instead of straight line

Logistic Function

- $ p(X) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}} $
- How to fit this model?
- Method of Maximum Likelihood

- Right panel shows fit of Logistic Regression to Default dataset
- After Manipulation
- $ \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1X} $
- Odds = $\frac{p(X)}{1-p(X)}$
- can vary between $0$ to $\infty$

- $1$ in $5$ will default with odd of 1/4
- $p(X) = 0.2$
- Odds = $0.2/0.8 = 1/4$ will default

- 9 in 10 will default with odd of 9
- $p(X) = 0.9$
- Odds = $0.9/0.1 = 9$

- Taking log
- $ log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X $
- LHS is log odds or logit
- Thus, logit is linear in $X$

- Logistic regression has a logit that is linear in $X$
- Increasing $X$ by $1$ unit increases log odds by $\beta_1$
- or it multiplies odds by $e^{\beta_1}$

- However, because the relationship between p(X) and X is not a straight line, $\beta_1$ does not correspond to the change in p(X) associated with a one-unit increase in X.
- The amount that p(X) changes due to a one-unit change in X depends on the current value of X.
- But regardless of the value of X, if $\beta_1$ is positive then increasing X will be associated with increasing p(X), and if $\beta_1$ is negative then increasing X will be associated with decreasing p(X).
- The fact that there is not a straight-line relationship between p(X) and X, and the fact that the rate of change in p(X) per unit change in X depends on the current value of X, can also be seen by inspection of the right-hand panel graph.

- Increasing $X$ by $1$ unit increases log odds by $\beta_1$

## Estimating Regression Coefficients

- Model
- $ log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X $

- Coefficients $\beta_0,~\beta_1$ can be determined using non-linear least squares regression, however, maximum likelihood is preferred due to its statistical properties
- Intution to find coefficients using maximum likelihood
- We want to estimate $\beta_0,~\beta_1$ such that model gives $p(X)=1$ for defaulted and $p(X)=0$ if not defaulted

- Likelihood Function
- $l(\beta_0, \beta_1) = \prod\limits_{i: y_i=1} p(x_i) \prod\limits_{j: y_j=o} (1-p(x_j)) $
- Estimates $\hat{\beta_0}, \hat{\beta_1}$ are chosen to maximize this likelihood function

- Maximum Likelihood is a general approach that is used to fit non-linear models
- Least squares approach in linear regression is in fact a special case of maximum likelihood

Coefficient | Std Error | z-statistic | p-value | |
---|---|---|---|---|

Intercept | -10.6513 | 0.3612 | -29.5 | <0.0001 |

Balance | 0.0055 | 0.0002 | 24.9 | <0.0001 |

- $\hat{\beta_1} = 0.0055 $
- increase in balance is associated with an increase in the probability of default
- a one-unit increase in balance implies increase in log odds of default by 0.0055 units

- $z$-statistics
- $\hat{\beta_1} / SE(\hat{\beta_1})$
- $\hat{\beta_0} / SE(\hat{\beta_0})$
- same as $t$-statistics in linear regression

- Large absolute value provides evidence against Null, $\beta_1=0$
- In this example, Null can be rejected
- In other words, we conclude that there is indeed an association between balance and probability of default.
- The estimated intercept in is typically not of interest; its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data (in this case, the overall default rate).

## Making Predictions

- Let balance is USD 1000
- Now $\hat{\beta_0} = -10.6513$ and $\hat{\beta_1} = 0.0055 $
- $ \hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1+e^{\hat{\beta_0} + \hat{\beta_1}X}} = 0.00576$
- below 1%

- Let balance is USD 2000
- $\hat{p}(X) = 0.586$
- 58.6%, much higher

### Qualitative Predictor Variable

student

Yes or No

Dummy Variable

Coefficient Std Error z-statistic p-value Intercept -3.5041 0.0707 -49.55 <0.0001 student [Yes] 0.4049 0.1150 3.52 <0.0004 Coefficient is +ve and p-value significant

$\hat{Pr}(\text{default student=Yes}) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}} = 0.0431$ $\hat{Pr}(\text{default student=No}) = \frac{e^{-3.5041}}{1+e^{-3.5041}} = 0.0292$ - 4% vs 2%

## Multiple Logistic Regression

- Generalize binary response
- $ log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X_1 + \beta_2X_2 +.. + \beta_pX_p $
- $p(X) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 +.. + \beta_pX_p}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2 +.. + \beta_pX_p}}$
- where $X = (X_1,X_2,…,X_p)$ are $p$ predictors

- $\beta_0, \beta_1, . . . , \beta_p$ can be estimated using Maximum likelihood method

Coefficient | Std Error | z-statistic | p-value | |
---|---|---|---|---|

Intercept | -10.8690 | 0.4923 | -22.08 | <0.0001 |

Balance | 0.0057 | 0.0002 | 24.74 | <0.0001 |

Income | 0.0030 | 0.0082 | 0.37 | 0.7115 |

Student[yes] | -0.6468 | 0.2362 | -2.74 | 0.0062 |

- Student as dummy variable
- Income in thousands
- Results
- $p$-value for Balance and Student is small but is negative for Student
- Students are less likely to default than non-students
- Negative sign indicates
- for a fixed value of balance and income, a student is less likely to default than a non-student

- However, it was positive in binary logistic regression
- Why?

- $p$-value for Balance and Student is small but is negative for Student

Students (Orange) and Non-students (blue); Horizontal Broken Lines shows overall default rate; Solid lines display default rate as function of balance |

- Solid lines indicates
- default rate of student is at or below non-students

- Broken line
- shows default rates averaged over all values of balance and income
- indicates opposite

- Why
- Variable student and balance are correlated as in box plot
- Students tend to hold higher levels of balance/debt (Orange Box)
- Higher balance is associated with default (Left solid lines)
- Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.
- This is an important distinction for a credit card company that is trying to determine to whom they should offer credit.
- A student is riskier than a non-student if no information about the student’s credit card balance is available.
- However, that student is less risky than a non-student with the same credit card balance!

- Confounding
- This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant.
- As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.
- In general, this phenomenon is known as confounding

- Predictions
- $p(X) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3}}$
- $\beta_0 = -10.869,~\beta_1=0.0057,~\beta_2=0.003,~\beta_3=-0.6468$
- Student with balance of USD 1500 and Income of USD 40,000
- $X_1=1500,~ X_2=40,~ X_3=1$
- $\hat{p}(X) = 0.058$
- $5.8$%

- Non-student with balance of USD 1500 and Income of USD 40,000
- $X_1=1500,~ X_2=40,~ X_3=0$
- $\hat{p}(X) = 0.105$
- $10.5$%

## Multinomial Logistic Regression

- Logistic Regression allows K=2, two levels of response variables
- Multinomial logistic regression is for K>2
- Select a single class to serve as the baseline, $k^{th}$ class
$Pr(Y=1 X=x) = \frac{e^{\beta_{10} + \beta_{11}X_1 + \beta_{12}X_2 + … +\beta_{1p}X_p}}{1+ \sum_\limits{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}X_1 + \beta_{l2}X_2 + … +\beta_{lp}X_p}}$ $Pr(Y=2 X=x) = \frac{e^{\beta_{20} + \beta_{21}X_1 + \beta_{22}X_2 + … +\beta_{2p}X_p}}{1+\sum_\limits{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}X_1 + \beta_{l2}X_2 + … +\beta_{lp}X_p}}$ - …
$Pr(Y=k X=x) = \frac{e^{\beta_{k0} + \beta_{k1}X_1 + \beta_{k2}X_2 + … +\beta_{kp}X_p}}{1+\sum_\limits{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}X_1 + \beta_{l2}X_2 + … +\beta_{lp}X_p}}$ - for $k=1,2,…K-1$

- We can show
$log\left( \frac{Pr(Y=k X=x)}{Pr(Y=K X=x)} \right) = \beta_{k0} + \beta_{k1}X_1 + \beta_{k2}X_2 + … +\beta_{kp}X_p $ - implies
- Log odds between any pair of classes is linear in the features

- Example
- Three Levels: stroke, drug overdose, epileptic seizure
- Fit two multinomial logistic regression
- stroke as the baseline
- drug overdose as the baseline

- coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same
- coefficients interpretation
- If epileptic seizure is baseline
- $\beta_{stroke0}$
- log odds of stroke versus epileptic seizure, given that $x_1 = x_2 =. . . = x_p = 0$
- Furthermore, a one-unit increase in $X_j$ is associated with a $\beta_{strokej}$ increase in the log odds of stroke over epileptic seizure.
Stated another way, if $X_j$ increases by one unit, then $\frac{Pr(Y = \text{stroke} X = x)}{Pr(Y = \text{epileptic seizure} X = x)}$ increases by $e^{\beta{strokej}} $

- Softmax Coding
- rather than estimating coefficients for $K − 1$ classes, we actually estimate coefficients for all $K$ classes
- rather than selecting a baseline class, we treat all K classes symmetrically
$Pr(Y=k X=x) = \frac{e^{\beta_{k0} + \beta_{k1}X_1 + \beta_{k2}X_2 + … +\beta_{kp}X_p}}{1+\sum_\limits{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}X_1 + \beta_{l2}X_2 + … +\beta_{lp}X_p}}$ for $k=1,2,…K$

- We can show log odds ratio between kth and k`the classes is
$log\left( \frac{Pr(Y=k X=x)}{Pr(Y=k’ X=x)} \right) = (\beta_{k0}-\beta_{k’0}) + (\beta_{k1}-\beta_{k’1})x_1 + … +(\beta_{kp}-\beta_{k’p})x_p $