This lesson is from An Introduction to Statistical Learning
Stock Market Data
- This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005.
- percentage returns for each of the five previous trading days, Lag1 through Lag5.
- We have also recorded Volume (the number of shares traded on the previous day, in billions),
- Today (the percentage return on the date in question)
- Our goal is to predict Direction (a qualitative response) using the other features
library(ISLR2) Smarket = ISLR2::Smarket attach(Smarket) print(names(Smarket)) # "Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", # "Volume", "Today", "Direction" print(dim(Smarket)) # 1250, 9 print(summary(Smarket))
# pairwise correlations among the predictors cor(Smarket) # Error since Direction variable is qualitative # No correlation between the lag variables and today’s returns # substantial correlation is between Year and Volume cor(Smarket[, -9]) # Volume is increasing over time plot(Volume)
Fit a Logistic Regression
- predict Direction using Lag1 through Lag5 and Volume.
glm()function can be used to fit many types of generalized linear models, including logistic regression.
- The syntax of the
glm()function is similar to that of
lm(), except that we must pass in the argument
family = binomialin order to tell R to run a logistic regression rather than some other type of
generalized linear model
glm.fits <- glm( Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket , family = binomial ) print(summary(glm.fits))
- Smallest $p$-value $0.145$ is associated with Lag1
- Lag1 Coefficient is $-0.073074$
- if the market had a positive return yesterday, then it is less likely to go up today
- However, $p$-value is very large, thus, no evidence
# Access Coefficients print(summary(glm.fits)$coef) print(coef(glm.fits)) print(summary(glm.fits)$coef[, 1])
- The predict() function can be used to predict the probability that the market will go up, given values of the predictors.
type = "response"option tells R to output probabilities of the form
P(Y = 1|X), as opposed to other information such as the logit.
- If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
- Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
glm.probs <- predict(glm.fits , type="response") # training probs print(glm.probs[1:10]) # 10 probs of training dataset print(contrasts(Direction)) # Coding # Convert probs to levels glm.pred <- rep("Down", 1250) # vector of 1250 "Down" elements glm.pred[glm.probs > .5] = "Up" print(glm.pred[1:10]) table(glm.pred , Direction) # Confusion Matrix print(mean(glm.pred == Direction)) # Training Accuracy
- Training Accuracy
- misleading since tested on the same data
- 47.8% is training error
- Test data
- Train Data: 2001-2004
- Test data: 2005
# Split the dataset print(dim(Smarket)) # 1250 9 train = (Year < 2005) # Vector of 1250 True False print(length(train)) print(train[1:10]) print(tail(train, 10)) print(dim(Smarket[train, ])) # 998 Smarket.2005 <- Smarket[!train, ] # Test data print(dim(Smarket.2005)) # 252 9 Direction.2005 <- Direction[!train] # True Response for Test Up/Down
# Fit using Train Data and Verify using Test Data glm.fits = glm( Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, subset = train, family=binomial ) # Test Probabilities glm.probs = predict(glm.fits , Smarket.2005, type="response") # Convert Test Probabilities to Levels glm.pred = rep("Down", 252) glm.pred[glm.probs > .5] <- "Up" table(glm.pred , Direction.2005) print(mean(glm.pred == Direction.2005)) # Test Accuracy print(mean(glm.pred != Direction.2005)) # Test Error 52%
- Test error of 52% is not good
- Predictors with non-significant $p-$values may be removed to improve the model
# Lag1 and Lag2 as Predictor glm.fits = glm( Direction ~ Lag1 + Lag2, data=Smarket, subset = train, family=binomial ) # Test Probabilities glm.probs = predict(glm.fits , Smarket.2005, type="response") # Convert Test Probabilities to Levels glm.pred = rep("Down", 252) glm.pred[glm.probs > .5] <- "Up" table(glm.pred , Direction.2005) print(mean(glm.pred == Direction.2005)) # Test Accuracy 56% print(mean(glm.pred != Direction.2005)) # Test Error 44%
Predict for particular values of two days
- Day1: Lag1 = 1.2 and Lag2 = 1.1
- Day2: Lag1 = 1.5, Lag2 = -0.8
day1 = data.frame(Lag1=1.2 , Lag2=1.1) predict(glm.fits, day1, type = "response") # 0.4791462 day2 = data.frame(Lag1=1.5 , Lag2=-0.8) predict(glm.fits, day2, type = "response") # 0.4960939 newdata = data.frame(Lag1 = c(1.2 , 1.5) , Lag2 = c(1.1 , -0.8)) predict(glm.fits, newdata, type = "response")