Chapter 1: Introduction
Published:
Chapter 2: Statistical Learning
Chapter 3: Linear Regression
Predicting a quantitative response
- Advertising data
- Sales(in thousands of units) = function of advertising budgets for TV, radio, and newspaper media.
- Questions
- Is there a relationship between advertising budget and sales?
- How strong is the relationship between advertising budget and sales?
- Which media are associated with sales?
- How large is the association between each medium and sales?
- How accurately can we predict future sales?
- Is the relationship linear?
- Is there synergy among the advertising media?
Simple Linear Regression: Single Predictor
Response $Y$ based on a single predictor $X$:
- \[Y \approx \beta_0 + \beta_1X\]
$\beta_0$ (intercept) and $\beta_1$ (slope) are model coefficients or parameters.
Use training data for:
- \[\hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X\]
- a
Estimating the Coefficients
minimizing the least squares
Residual: $e_i = y_i−\hat{y}_i$
Residual Sum of Squares: $RSS = e_1^2 + e_2^2 + … + e_n^2$
- \[RSS = [y_1− \hat{\beta_0}− \hat{\beta_1}x_1]^2 + [y_2− \hat{\beta_0}− \hat{\beta_1}x_2]^2 + ... + [y_n− \hat{\beta_0}− \hat{\beta_1}x_n]^2\]
Using calculus:
- \[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \\ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\\ \bar{x} = \frac{\sum{x}}{n}, ~\text{and}~ \bar{y} = \frac{\sum{y}}{n}\]
Assessing the Accuracy of the Coefficient Estimates
True relationship:
- \[Y = \beta_0 + \beta_1 X + \epsilon\]
The analogy with estimating the population mean $\mu$ from the sample mean $\hat{\mu}$
$\hat{\mu}$ from one data set may under or overestimate $\mu$. However, if we estimate $\hat{\mu}$ over a large number of datasets, then on average $\hat{\mu}$ will be equal to $\mu$. Similarly, on average, estimates of the coefficients will be exact if huge data sets.
Standard Error, $SE(\hat{\mu})$: average amount $\hat{\mu}$ differs from the actual value of $\mu$
- \[Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}\]
- SE reduces as $n$ increases.
Similarly
- \[SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right] \\ SE(\hat{\beta_1})^2 = \sigma^2\left[\frac{1}{n} + \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right]; ~\sigma = Var(\epsilon)\]
Residual Standard Error: $RSE = \sqrt{\frac{RSS}{n-2}}$
Standard Errors can be used to compute confidence intervals. For linear regression, the 95% confidence interval for $\beta_1$ approx. takes the form:
- \[\hat{\beta_1} \pm 2\times SE(\hat{\beta_1}); \hat{\beta_0} \pm 2\times SE(\hat{\beta_0})\]
- There is a 95% chance that the true value of $\beta_1$ and $\beta_0$ lies in the above interval.
Usage of Standard Error for Hypothesis Test
- \[H_0 = \text{There is no relationship between } X \text{ and } Y \\ \beta_1 = 0\]
- \[H_a = \text{There is some relationship between } X \text{ and } Y \\ \beta_1 \ne 0\]
Compute $t-statistic$: number of standard deviations $\hat{\beta_1}$ is away from $0$
- \[t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\]
- We can compute $p-value$: small value will indicate the association and we reject the Null
import numpy as np import pandas as pd from matplotlib.pyplot import subplots import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF from statsmodels.stats.anova import anova_lm from ISLP import load_data from ISLP.models import (ModelSpec as MS, summarize, poly) import warnings warnings.simplefilter(action='ignore', category=FutureWarning) data = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0) print(data.shape, data.columns) display(data.head()) _ = pd.plotting.scatter_matrix(ads) model = MS(['TV']) X = model.fit_transform(data) print(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() summarize(results)
Coefficient Std Error t-statistic p=value Intercept 7.0326 0.458 15.360 0.0 TV 0.0475 0.003 17.668 0.0 - Conclusion:
- $\beta_0 \ne 0; \beta_1 \ne 0$
Assessing the Accuracy of the Model
Once null is rejected, then we need to assess the model.
Linear Regression
- $residual~ standard~ error (RSE)$
- $R^2 statistic$
# compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26 print(f"R^2 = {results.rsquared:.3f}") # 0.612 print(f"F-statistic = {results.fvalue:.1f}") # 312.1 # F-statistic indicates whether the regression model provides a better fit print(f"p-value = {results.f_pvalue:.3f}") # 0.000
$Residual~ Standard~ Error (RSE)$: measure of lack of fit
is an estimate of the standard deviation of $\epsilon$
- \[RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{n-2}}\]
- \[RSS = \sum_{i=1}^{n} (y_i - \hat{y_i})^2\]
$RSE = 3.26$
The model can be off by about 3.26 (thousands of units) on average.
mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 3.26 error = residual_standard_error / mean_sales # 0.232 print(f"Error = {error:.3f}")
- Thus, the percentage error can be 23%
$R^2 statistic$: proportion of variance explained
The measure of the linear relationship between $X$ and $Y$
RSE of 3.26 doesn’t indicate if it is good or bad
$R^2$ is always between $0$ and $1$ and is independent of the scale of $Y$. It is better than RSE but still can be challenging to determine what a good value of $R^2$ can be e.g. in Physics, that data is expected to come from a linear model, and it can be close to $1$. However, in marketing, psychology, and biology, the linear model is a rough approximation, and there are measurement errors; the value of 0.1 may be realistic.
- \[R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} \\ \text{where } RSS = \sum(y_i - \hat{y_i})^2 ~and~ TSS = \sum (y_i - \bar{y})^2\]
TSS is the total variance in $Y$
Correlation $r = Cor(X,Y)$ also measures the relationship and can be shown as $r^2 = R^2$
square of correlation = R-squared
- \[r = Cor(X,Y) = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sqrt{\sum[x_i-\bar{x}]^2} \sqrt{\sum[y_i-\bar{y}]^2}}\]
# Regression model with TV as the only predictor model = MS(['TV']) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """ {'coef': {'intercept': 7.0326, 'TV': 0.0475}, 'std err': {'intercept': 0.458, 'TV': 0.003}, 't': {'intercept': 15.36, 'TV': 17.668}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26 print(f"R^2 = {results.rsquared:.3f}") # 0.612 print(f"F-statistic = {results.fvalue:.1f}") # 312.1 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 3.26 error = residual_standard_error / mean_sales # 0.232 print(f"Error = {error:.3f}")
# Regression model with radio as the only predictor model = MS(['radio']) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 9.3116, 'radio': 0.2025}, 'std err': {'intercept': 0.563, 'radio': 0.02}, 't': {'intercept': 16.542, 'radio': 9.921}, 'P>|t|': {'intercept': 0.0, 'radio': 0.0}} """ print(results.params) # Intercept = 9.31, radio = 0.202 # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 4.27 print(f"R^2 = {results.rsquared:.3f}") # 0.332 print(f"F-statistic = {results.fvalue:.1f}") # 98.4 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 4.27 error = residual_standard_error / mean_sales # 0.305 print(f"Error = {error:.3f}")
# Regression model with newspaper as the only predictor model = MS(['newspaper']) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 12.3514, 'newspaper': 0.0547}, 'std err': {'intercept': 0.621, 'newspaper': 0.017}, 't': {'intercept': 19.876, 'newspaper': 3.3}, 'P>|t|': {'intercept': 0.0, 'newspaper': 0.001}} """ print(results.params) # Intercept = 12.35, newspaper = 0.054 # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 5.09 print(f"R^2 = {results.rsquared:.3f}") # 0.052 print(f"F-statistic = {results.fvalue:.1f}") # 10.9 print(f"p-value = {results.f_pvalue:.3f}") # 0.001 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 5.09 error = residual_standard_error / mean_sales # 0.363 print(f"Error = {error:.3f}") # 0.363
Multiple Linear Regression
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p + \epsilon\]Estimating the Regression Coefficients
\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2+...+\hat{\beta_p}X_p + \epsilon\]Choose coefficients that minimizes the sum of squared residuals
- \[RSS &= \sum [y_i - \hat{y_i}]^2 \\ &= \sum [y_i - \hat{\beta_0} - \hat{\beta_1}x_{i1} - \hat{\beta_2}x_{i2}-...-- \hat{\beta_p}x_{ip}]^2\]
# multiple linear regression model = MS(["TV", "radio", "newspaper"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9389, 'TV': 0.0458, 'radio': 0.1885, 'newspaper': -0.001}, 'std err': {'intercept': 0.312, 'TV': 0.001, 'radio': 0.009, 'newspaper': 0.006}, 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 print(f"Error = {error:.3f}") # 0.120
Correlation Matrix
# Correlation matrix display(data.corr()) display(data.corr().to_dict()) """{'TV': {'TV': 1.0, 'radio': 0.05480866446583008, 'newspaper': 0.05664787496505698, 'sales': 0.7822244248616067}, 'radio': {'TV': 0.05480866446583008, 'radio': 1.0, 'newspaper': 0.35410375076117534, 'sales': 0.5762225745710555}, 'newspaper': {'TV': 0.05664787496505698, 'radio': 0.35410375076117534, 'newspaper': 1.0, 'sales': 0.22829902637616545}, 'sales': {'TV': 0.7822244248616067, 'radio': 0.5762225745710555, 'newspaper': 0.22829902637616545, 'sales': 1.0}} """
Comments
- Multiple Linear Regression:
- $TV=0.0458,~ radio=0.1885,~ newspaper = -0.001$ with $p-values = 0,~ 0,~ 0.86$
- For a given TV and newspaper amount, spending additional $$1000$ can increase sales by $189$
- $TV=0.0458,~ radio=0.1885,~ newspaper = -0.001$ with $p-values = 0,~ 0,~ 0.86$
- Simple Linear Regression:
- $TV = 0.0475,~radio=0.2025,~newspaper=0.0547$ with $p-values = 0,~ 0,~ 0.001 $
- The coefficient of the newspaper is different, with a significant difference in p-value.
- sales and newspaper have no relationship in multiple linear, while there is a relationship in linear
- The correlation matrix shows a relationship between radio and newspaper as $0.35$
- It means newspaper gets credit for association between radio and sales
- Multiple Linear Regression:
- Other examples: Shark attacks vs ice-cream
Some Important Questions
- Is at least one of the predictors $ X_1,X_2, . . . ,X_p$ useful in predicting the response?
- Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Is at least one of the predictors $ X_1,X_2, . . . ,X_p$ useful in predicting the response?
$H_0: \beta_1 = \beta_2 = … = \beta_p = 0$
$H_a:$ at least one $\beta_j$ is non zero
Compute $F-statistic$:
- \[F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-a}} \\ \text{where } TSS = \sum [y_i - \bar{y}]^2 \text{ and } RSS = \sum [y_i - \hat{y_i}]^2\]
- If there is no relationship, then $F-statistic$ will be close to $1$. If $H_a$ is true, then it will be greater than 1
As above, $F-statistic$ for multiple linear regression is $570$ which is large. It means at least one of the predictors is related to sales. However, when $n$ is large the $F-statistic$ that is just larger than 1 is sufficient for $H_a$. If $n$ is small then larger $F-statistic$ is needed to reject $H_0$ in favor of $H_a$.
For any $n$ or $p$ the $p-value$ can be used. Since the $p-value$ for multiple linear regression is $0$. Thus, $H_0$ can be rejected.
# multiple linear regression
model = MS(["TV", "radio", "newspaper"])
X = model.fit_transform(data)
display(X[:4])
y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())
"""{'coef': {'intercept': 2.9389,
'TV': 0.0458,
'radio': 0.1885,
'newspaper': -0.001},
'std err': {'intercept': 0.312,
'TV': 0.001,
'radio': 0.009,
'newspaper': 0.006},
't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
"""
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
print(f"R^2 = {results.rsquared:.3f}") # 0.897
print(f"F-statistic = {results.fvalue:.1f}") # 570.3
print(f"p-value = {results.f_pvalue:.3f}") # 0.000
Two: Deciding on Important Variables
- The first step in multiple regression is to find if any of the predictors is related to response using $p-value$. We can look at individual $p-values$, but if $p-value$ is large then we are likely to make false discoveries.
- $Variable~ Selection$
- $2^p$ models that contain subsets of $p$ variables. Thus, three classical approaches:
- Forward Selection:
- Start with a null model (intercept but no predictor)
- Fit $p$ simple linear regression and add variable that results in lower $RSS$
- Add the next variable to the multiple linear regression and see RSS
- Stop based on some rule
- Backward selection:
- Start with all variables in the model, remove the variable with largest $p-value$
- continue till the stopping rule.
- cannot be used if $p>n$
- Mixed selection:
- combination of forward and backward
- start with null and add variable that provides best fit (forward). Continue the process.
- As we add, if the $p-value$ of one of the variable increases then we remove that
Backward Selection
# multiple linear regression model = MS(["TV", "radio", "newspaper"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9389, 'TV': 0.0458, 'radio': 0.1885, 'newspaper': -0.001}, 'std err': {'intercept': 0.312, 'TV': 0.001, 'radio': 0.009, 'newspaper': 0.006}, 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 print(f"Error = {error:.3f}") # 0.120
# multiple linear regression model = MS(["TV", "radio"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9211, 'TV': 0.0458, 'radio': 0.188}, 'std err': {'intercept': 0.294, 'TV': 0.001, 'radio': 0.008}, 't': {'intercept': 9.919, 'TV': 32.909, 'radio': 23.382}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 print(f"Error = {error:.3f}") # 0.120
Three: Model Fit
- Fraction of Variance Explained: $RSE$ and $R^2$
- In Simple Regression: $R^2 = r^2 = [Cor(X,Y)]^2$
- In Multiple Regression: $R^2 = r^2 = [Cor(Y, \hat{Y}]^2)$
- $R^2$ close to $1$ means the model explains a large portion of the variance.
# multiple linear regression
model = MS(["TV", "radio", "newspaper"])
X = model.fit_transform(data)
y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.686
print(f"R^2 = {results.rsquared:.5f}") # 0.89721
print(f"F-statistic = {results.fvalue:.2f}") # 570.27
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000
mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120
########################
# multiple linear regression
model = MS(["TV", "radio"])
X = model.fit_transform(data)
y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.681
print(f"R^2 = {results.rsquared:.5f}") # 0.89719
print(f"F-statistic = {results.fvalue:.2f}") # 859.62
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000
mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120
When “TV”, “radio”, “newspaper”, $R^2 = 0.89721$, $RSE = 1.686$
When “TV”, “radio”, $R^2 = 0.89719$, $RSE = 1.681$
When “TV”, $R^2 = 0.61188$, $RSE = 3.259$
There is a small increase when newspaper is added and a significant increase when radio is added.
Why RSE increased when newspaper is added given that RSS must decrease
- \[RSE = \sqrt{\frac{RSS}{n-p-1}}\]
- Models with more variables can have higher RSE if the decrease in RSS is small relative to increase in $p$
Four: Predictions
- Three uncertainties
- Least square plane $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 +…+ \hat{\beta_p}X_p$ is only an estimate for the true population regression plane $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +…+ \beta_pX_p$
- The inaccuracy in coefficient estimates is related to $reducibel~error$. We can compute $confidence ~interval$ to determine how close $\hat{Y}$ will be close to $f(X)$
- Model bias due to the assumption of the linear model
- Even if true coefficients are known, then also response value cannot be predicted due to $\epsilon$
- We can use prediction interval to answer as how much $Y$ varies from $\hat{Y}$
- Least square plane $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 +…+ \hat{\beta_p}X_p$ is only an estimate for the true population regression plane $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +…+ \beta_pX_p$
# multiple linear regression
model = MS(["TV", "radio"])
X = model.fit_transform(data)
y = data['sales']
display(X[:4])
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
new_df = pd.DataFrame({"TV":[100], "radio":[20]})
model = MS(["TV", "radio"])
newX = model.fit_transform(new_df)
display(newX)
new_predictions = results.get_prediction(newX);
print(new_predictions.predicted_mean) # [11.25646595]
# We can produce confidence intervals for the predicted values
print(new_predictions.conf_int(alpha=0.05)) # [[10.98525445 11.52767746]]
# Prediction intervals are computing by setting obs=True:
print(new_predictions.conf_int(obs=True, alpha=0.05)) # [[ 7.92961607 14.58331584]]
- $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city
- the 95 % confidence interval is [10,985, 11,528].
- We interpret this to mean that 95 % of intervals of this form will contain the true value of f(X).
- Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
- 95 % prediction interval is [7,930, 14,580].
- We interpret this to mean that 95 % of intervals of this form will contain the true value of Y for this city.
- 95 % prediction interval is [7,930, 14,580].
- Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.
- the 95 % confidence interval is [10,985, 11,528].
Other Considerations in the Regression Model
Qualitative Predictors
data = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
print(data.shape)
print(data.dtypes.to_dict())
"""{'Income': dtype('float64'),
'Limit': dtype('int64'),
'Rating': dtype('int64'),
'Cards': dtype('int64'),
'Age': dtype('int64'),
'Education': dtype('int64'),
'Own': dtype('O'),
'Student': dtype('O'),
'Married': dtype('O'),
'Region': dtype('O'),
'Balance': dtype('int64')}"""
display(data.head())
Predictors with Only Two Levels
Create a $dummy ~variable$ with $0$ or $1$ for $predictor$ ($factor$)
- \[x_i= \begin{cases} 1 & \text{if own hosuse} \\ 0 & \text{otherwise} \end{cases}\]
- \[y_i = \beta_0 + \beta_1x_i + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if own hosuse} \\ \beta_0 + \epsilon_i & \text{otherwise} \end{cases}\]
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
factors = ["Own"]
for col in factors:
df[col] = df[col].astype('category')
X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[Yes]': 19.7331},
'std err': {'intercept': 33.128, 'Own[Yes]': 46.051},
't': {'intercept': 15.389, 'Own[Yes]': 0.429},
'P>|t|': {'intercept': 0.0, 'Own[Yes]': 0.669}}
"""
- The average credit card debt for
- owner[No] is $$ 509.80$ (intercept)
- owner[yes] is estimated to be $$509.80 + $19.73 = $529.53$
- However, $p-value$ of $own$ is high, so there is no statistical evidence for the difference in average credit card balance and house ownership.
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
df['Own'] = df['Own'].map({'Yes': 0, 'No': 1})
factors = ["Own"]
for col in factors:
df[col] = df[col].astype('category')
X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 529.5362, 'Own[1]': -19.7331},
'std err': {'intercept': 31.988, 'Own[1]': 46.051},
't': {'intercept': 16.554, 'Own[1]': -0.429},
'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""
- If Own[No] is encoded as $1$, then values will be different.
- The average credit card debt for
- owners [Yes(0)] is estimated to be $$529.53$ (intercept)
- Owners[No(1)] are estimated to carry $$-19.73$ (coefficient) in additional debt for a total of $$529.53 - $19.73 = $509.80$
- This same as we get when Own[Yes] was coded as 1.
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
df['Own'] = df['Own'].map({'Yes': 1, 'No': -1})
factors = ["Own"]
for col in factors:
df[col] = df[col].astype('category')
X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[1]': 19.7331},
'std err': {'intercept': 33.128, 'Own[1]': 46.051},
't': {'intercept': 15.389, 'Own[1]': 0.429},
'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""
- If we encode as 1 and -1
- The average credit card debt for
- non-owners own[No(-1)] is estimated to be $$509.80$ (intercept)
- Owners are estimated to carry $$19.73$ (coefficient) in additional debt for a total of $$509.80 + $19.73 = $529.53$
- It is important to note that the final predictions for the credit balances of owners and non-owners will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.
Qualitative Predictors with More than Two Levels
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
#print(df.columns)
factors = ["Region"]
for col in factors:
df[col] = df[col].astype('category')
X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 531.0,
'Region[South]': -12.5025,
'Region[West]': -18.6863},
'std err': {'intercept': 46.319,
'Region[South]': 56.681,
'Region[West]': 65.021},
't': {'intercept': 11.464, 'Region[South]': -0.221, 'Region[West]': -0.287},
'P>|t|': {'intercept': 0.0, 'Region[South]': 0.826, 'Region[West]': 0.774}}
"""
pass
- Balance[“East”] = 531 (intercept)
- Region[South] will have 12.50 less than East. However, p-value is large.
- Region[West] will have 18.68 less than East. However, p-value is large.
Adding Interaction
# Adding Interaction
df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0)
#print(df.columns)
factors = ["TV", "radio", ("TV", "radio")]
response = "sales"
X = MS(factors).fit_transform(df)
display(X[:3])
y = df[response]
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 6.7502,
'TV': 0.0191,
'radio': 0.0289,
'TV:radio': 0.0011},
'std err': {'intercept': 0.248,
'TV': 0.002,
'radio': 0.009,
'TV:radio': 5.24e-05},
't': {'intercept': 27.233, 'TV': 12.699, 'radio': 3.241, 'TV:radio': 20.727},
'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.001, 'TV:radio': 0.0}}
"""
# Polynomial Regression
#df = pd.read_csv("https://www.statlearning.com/s/Auto.csv")
"""
print(df[df.horsepower=="?"].shape)
df = df[df.horsepower!="?"]
df.horsepower = df.horsepower.astype(float)
"""
df = load_data('Auto')
predictors = [poly("horsepower", degree=2)]
response = "mpg"
X = MS(predictors).fit_transform(df)
y = df[response]
model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 23.4459,
'poly(horsepower, degree=2)[0]': -120.1377,
'poly(horsepower, degree=2)[1]': 44.0895},
'std err': {'intercept': 0.221,
'poly(horsepower, degree=2)[0]': 4.374,
'poly(horsepower, degree=2)[1]': 4.374},
't': {'intercept': 106.13,
'poly(horsepower, degree=2)[0]': -27.467,
'poly(horsepower, degree=2)[1]': 10.08},
'P>|t|': {'intercept': 0.0,
'poly(horsepower, degree=2)[0]': 0.0,
'poly(horsepower, degree=2)[1]': 0.0}}
"""