Logistic Regression Lab
Published:
This lesson is from An Introduction to Statistical Learning
Stock Market Data
- This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005.
- Predictors
- Year
- percentage returns for each of the five previous trading days, Lag1 through Lag5.
- We have also recorded Volume (the number of shares traded on the previous day, in billions),
- Today (the percentage return on the date in question)
- Response
- Our goal is to predict Direction (a qualitative response) using the other features
library(ISLR2)
Smarket = ISLR2::Smarket
attach(Smarket)
print(names(Smarket))
# "Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5",
# "Volume", "Today", "Direction"
print(dim(Smarket)) # 1250, 9
print(summary(Smarket))
# pairwise correlations among the predictors
cor(Smarket) # Error since Direction variable is qualitative
# No correlation between the lag variables and today’s returns
# substantial correlation is between Year and Volume
cor(Smarket[, -9])
# Volume is increasing over time
plot(Volume)
Fit a Logistic Regression
- predict Direction using Lag1 through Lag5 and Volume.
- The
glm()
function can be used to fit many types of generalized linear models, including logistic regression. - The syntax of the
glm()
function is similar to that oflm()
, except that we must pass in the argumentfamily = binomial
in order to tell R to run a logistic regression rather than some other type ofgeneralized linear model
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket , family = binomial
)
print(summary(glm.fits))
- Smallest $p$-value $0.145$ is associated with Lag1
- Lag1 Coefficient is $-0.073074$
- if the market had a positive return yesterday, then it is less likely to go up today
- However, $p$-value is very large, thus, no evidence
# Access Coefficients
print(summary(glm.fits)$coef)
print(coef(glm.fits))
print(summary(glm.fits)$coef[, 1])
- Predict
- The predict() function can be used to predict the probability that the market will go up, given values of the predictors.
- The
type = "response"
option tells R to output probabilities of the formP(Y = 1|X)
, as opposed to other information such as the logit. - If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
- Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
glm.probs <- predict(glm.fits , type="response") # training probs
print(glm.probs[1:10]) # 10 probs of training dataset
print(contrasts(Direction)) # Coding
# Convert probs to levels
glm.pred <- rep("Down", 1250) # vector of 1250 "Down" elements
glm.pred[glm.probs > .5] = "Up"
print(glm.pred[1:10])
table(glm.pred , Direction) # Confusion Matrix
print(mean(glm.pred == Direction)) # Training Accuracy
- Training Accuracy
- 52.2%
- misleading since tested on the same data
- 47.8% is training error
- Test data
- Train Data: 2001-2004
- Test data: 2005
# Split the dataset
print(dim(Smarket)) # 1250 9
train = (Year < 2005) # Vector of 1250 True False
print(length(train))
print(train[1:10])
print(tail(train, 10))
print(dim(Smarket[train, ])) # 998
Smarket.2005 <- Smarket[!train, ] # Test data
print(dim(Smarket.2005)) # 252 9
Direction.2005 <- Direction[!train] # True Response for Test Up/Down
# Fit using Train Data and Verify using Test Data
glm.fits = glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data=Smarket, subset = train, family=binomial
)
# Test Probabilities
glm.probs = predict(glm.fits , Smarket.2005, type="response")
# Convert Test Probabilities to Levels
glm.pred = rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
print(mean(glm.pred == Direction.2005)) # Test Accuracy
print(mean(glm.pred != Direction.2005)) # Test Error 52%
- Test error of 52% is not good
- Predictors with non-significant $p-$values may be removed to improve the model
# Lag1 and Lag2 as Predictor
glm.fits = glm(
Direction ~ Lag1 + Lag2,
data=Smarket, subset = train, family=binomial
)
# Test Probabilities
glm.probs = predict(glm.fits , Smarket.2005, type="response")
# Convert Test Probabilities to Levels
glm.pred = rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred , Direction.2005)
print(mean(glm.pred == Direction.2005)) # Test Accuracy 56%
print(mean(glm.pred != Direction.2005)) # Test Error 44%
Improvement
Predict for particular values of two days
- Day1: Lag1 = 1.2 and Lag2 = 1.1
- Day2: Lag1 = 1.5, Lag2 = -0.8
day1 = data.frame(Lag1=1.2 , Lag2=1.1)
predict(glm.fits, day1, type = "response") # 0.4791462
day2 = data.frame(Lag1=1.5 , Lag2=-0.8)
predict(glm.fits, day2, type = "response") # 0.4960939
newdata = data.frame(Lag1 = c(1.2 , 1.5) , Lag2 = c(1.1 , -0.8))
predict(glm.fits, newdata, type = "response")