Logistic Regression Lab

4 minute read

Published:

This lesson is from An Introduction to Statistical Learning

Stock Market Data

  • This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005.
  • Predictors
    • Year
    • percentage returns for each of the five previous trading days, Lag1 through Lag5.
    • We have also recorded Volume (the number of shares traded on the previous day, in billions),
    • Today (the percentage return on the date in question)
  • Response
    • Our goal is to predict Direction (a qualitative response) using the other features
library(ISLR2)
Smarket = ISLR2::Smarket
attach(Smarket)

print(names(Smarket)) 
# "Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", 
# "Volume", "Today", "Direction"

print(dim(Smarket)) # 1250, 9

print(summary(Smarket))
# pairwise correlations among the predictors
cor(Smarket) # Error since Direction variable is qualitative

# No correlation between the lag variables and today’s returns
# substantial correlation is between Year and Volume
cor(Smarket[, -9])

# Volume is increasing over time
plot(Volume)

Fit a Logistic Regression

  • predict Direction using Lag1 through Lag5 and Volume.
  • The glm() function can be used to fit many types of generalized linear models, including logistic regression.
  • The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model
glm.fits <- glm(
  Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data = Smarket , family = binomial
)

print(summary(glm.fits))
  • Smallest $p$-value $0.145$ is associated with Lag1
    • Lag1 Coefficient is $-0.073074$
    • if the market had a positive return yesterday, then it is less likely to go up today
    • However, $p$-value is very large, thus, no evidence
# Access Coefficients
print(summary(glm.fits)$coef)
print(coef(glm.fits))
print(summary(glm.fits)$coef[, 1])
  • Predict
    • The predict() function can be used to predict the probability that the market will go up, given values of the predictors.
    • The type = "response" option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit.
    • If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
    • Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
glm.probs <- predict(glm.fits , type="response") # training probs

print(glm.probs[1:10]) # 10 probs of training dataset
print(contrasts(Direction)) # Coding

# Convert probs to levels
glm.pred <- rep("Down", 1250) # vector of 1250 "Down" elements
glm.pred[glm.probs > .5] = "Up"
print(glm.pred[1:10]) 

table(glm.pred , Direction) # Confusion Matrix

print(mean(glm.pred == Direction)) # Training Accuracy
  • Training Accuracy
    • 52.2%
    • misleading since tested on the same data
      • 47.8% is training error
  • Test data
    • Train Data: 2001-2004
    • Test data: 2005
# Split the dataset

print(dim(Smarket)) # 1250 9

train = (Year < 2005) # Vector of 1250 True False

print(length(train))
print(train[1:10])
print(tail(train, 10))

print(dim(Smarket[train, ])) # 998 

Smarket.2005 <- Smarket[!train, ] # Test data
print(dim(Smarket.2005)) # 252 9

Direction.2005 <- Direction[!train] # True Response for Test Up/Down
# Fit using Train Data and Verify using Test Data

glm.fits = glm(
  Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data=Smarket, subset = train, family=binomial
)

# Test Probabilities
glm.probs = predict(glm.fits , Smarket.2005, type="response")

# Convert Test Probabilities to Levels
glm.pred = rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"

table(glm.pred , Direction.2005)

print(mean(glm.pred == Direction.2005)) # Test Accuracy
print(mean(glm.pred != Direction.2005)) # Test Error 52%
  • Test error of 52% is not good
    • Predictors with non-significant $p-$values may be removed to improve the model
# Lag1 and Lag2 as Predictor
glm.fits = glm(
  Direction ~ Lag1 + Lag2,
  data=Smarket, subset = train, family=binomial
)

# Test Probabilities
glm.probs = predict(glm.fits , Smarket.2005, type="response")

# Convert Test Probabilities to Levels
glm.pred = rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"

table(glm.pred , Direction.2005)

print(mean(glm.pred == Direction.2005)) # Test Accuracy 56%
print(mean(glm.pred != Direction.2005)) # Test Error 44%

  • Improvement

  • Predict for particular values of two days

    • Day1: Lag1 = 1.2 and Lag2 = 1.1
    • Day2: Lag1 = 1.5, Lag2 = -0.8
day1 = data.frame(Lag1=1.2 , Lag2=1.1)
predict(glm.fits, day1, type = "response") # 0.4791462 

day2 = data.frame(Lag1=1.5 , Lag2=-0.8)
predict(glm.fits, day2, type = "response") # 0.4960939 

newdata = data.frame(Lag1 = c(1.2 , 1.5) , Lag2 = c(1.1 , -0.8))
predict(glm.fits, newdata, type = "response")