Published:

Chapter 3: Linear Regression

Predicting a quantitative response

• Sales(in thousands of units) = function of advertising budgets for TV, radio, and newspaper media.
• Questions
• Is there a relationship between advertising budget and sales?
• How strong is the relationship between advertising budget and sales?
• Which media are associated with sales?
• How large is the association between each medium and sales?
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?

Simple Linear Regression: Single Predictor

• Response $Y$ based on a single predictor $X$:

• $Y \approx \beta_0 + \beta_1X$
• $\beta_0$ (intercept) and $\beta_1$ (slope) are model coefficients or parameters.

• Use training data for:

• $\hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X$
• a

Estimating the Coefficients

• minimizing the least squares

• Residual: $e_i = y_i−\hat{y}_i$

• Residual Sum of Squares: $RSS = e_1^2 + e_2^2 + … + e_n^2$

• $RSS = [y_1− \hat{\beta_0}− \hat{\beta_1}x_1]^2 + [y_2− \hat{\beta_0}− \hat{\beta_1}x_2]^2 + ... + [y_n− \hat{\beta_0}− \hat{\beta_1}x_n]^2$
• Using calculus:

• $\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \\ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\\ \bar{x} = \frac{\sum{x}}{n}, ~\text{and}~ \bar{y} = \frac{\sum{y}}{n}$

Assessing the Accuracy of the Coefficient Estimates

• True relationship:

• $Y = \beta_0 + \beta_1 X + \epsilon$
• The analogy with estimating the population mean $\mu$ from the sample mean $\hat{\mu}$

• $\hat{\mu}$ from one data set may under or overestimate $\mu$. However, if we estimate $\hat{\mu}$ over a large number of datasets, then on average $\hat{\mu}$ will be equal to $\mu$. Similarly, on average, estimates of the coefficients will be exact if huge data sets.

• Standard Error, $SE(\hat{\mu})$: average amount $\hat{\mu}$ differs from the actual value of $\mu$

• $Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$
• SE reduces as $n$ increases.
• Similarly

• $SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right] \\ SE(\hat{\beta_1})^2 = \sigma^2\left[\frac{1}{n} + \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right]; ~\sigma = Var(\epsilon)$
• Residual Standard Error: $RSE = \sqrt{\frac{RSS}{n-2}}$

• Standard Errors can be used to compute confidence intervals. For linear regression, the 95% confidence interval for $\beta_1$ approx. takes the form:

• $\hat{\beta_1} \pm 2\times SE(\hat{\beta_1}); \hat{\beta_0} \pm 2\times SE(\hat{\beta_0})$
• There is a 95% chance that the true value of $\beta_1$ and $\beta_0$ lies in the above interval.
• Usage of Standard Error for Hypothesis Test

• $H_0 = \text{There is no relationship between } X \text{ and } Y \\ \beta_1 = 0$
• $H_a = \text{There is some relationship between } X \text{ and } Y \\ \beta_1 \ne 0$
• Compute $t-statistic$: number of standard deviations $\hat{\beta_1}$ is away from $0$

• $t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}$
• We can compute $p-value$: small value will indicate the association and we reject the Null
•   import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from ISLP.models import (ModelSpec as MS, summarize, poly)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

print(data.shape, data.columns)

model = MS(['TV'])
X = model.fit_transform(data)
print(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)

•  CoefficientStd Errort-statisticp=value
Intercept7.03260.45815.3600.0
TV0.04750.00317.6680.0
• Conclusion:
• $\beta_0 \ne 0; \beta_1 \ne 0$

Assessing the Accuracy of the Model

• Once null is rejected, then we need to assess the model.

• Linear Regression

• $residual~ standard~ error (RSE)$
• $R^2 statistic$
•   # compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
print(f"R^2 = {results.rsquared:.3f}") # 0.612
print(f"F-statistic = {results.fvalue:.1f}") # 312.1
# F-statistic indicates whether the regression model provides a better fit

print(f"p-value = {results.f_pvalue:.3f}") # 0.000

• $Residual~ Standard~ Error (RSE)$: measure of lack of fit

• is an estimate of the standard deviation of $\epsilon$

• $RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{n-2}}$
• $RSS = \sum_{i=1}^{n} (y_i - \hat{y_i})^2$
• $RSE = 3.26$

• The model can be off by about 3.26 (thousands of units) on average.

•   mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 3.26
error = residual_standard_error / mean_sales # 0.232
print(f"Error = {error:.3f}")

• Thus, the percentage error can be 23%
• $R^2 statistic$: proportion of variance explained

• The measure of the linear relationship between $X$ and $Y$

• RSE of 3.26 doesn’t indicate if it is good or bad

• $R^2$ is always between $0$ and $1$ and is independent of the scale of $Y$. It is better than RSE but still can be challenging to determine what a good value of $R^2$ can be e.g. in Physics, that data is expected to come from a linear model, and it can be close to $1$. However, in marketing, psychology, and biology, the linear model is a rough approximation, and there are measurement errors; the value of 0.1 may be realistic.

• $R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} \\ \text{where } RSS = \sum(y_i - \hat{y_i})^2 ~and~ TSS = \sum (y_i - \bar{y})^2$
• TSS is the total variance in $Y$

• Correlation $r = Cor(X,Y)$ also measures the relationship and can be shown as $r^2 = R^2$

• square of correlation = R-squared

• $r = Cor(X,Y) = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sqrt{\sum[x_i-\bar{x}]^2} \sqrt{\sum[y_i-\bar{y}]^2}}$
•   # Regression model with TV as the only predictor
model = MS(['TV'])
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())
"""
{'coef': {'intercept': 7.0326, 'TV': 0.0475},
'std err': {'intercept': 0.458, 'TV': 0.003},
't': {'intercept': 15.36, 'TV': 17.668},
'P>|t|': {'intercept': 0.0, 'TV': 0.0}}
"""

# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
print(f"R^2 = {results.rsquared:.3f}") # 0.612
print(f"F-statistic = {results.fvalue:.1f}") # 312.1
print(f"p-value = {results.f_pvalue:.3f}") # 0.000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 3.26
error = residual_standard_error / mean_sales # 0.232
print(f"Error = {error:.3f}")

•   # Regression model with radio as the only predictor
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())
'std err': {'intercept': 0.563, 'radio': 0.02},
"""

print(results.params) # Intercept = 9.31, radio = 0.202

# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 4.27
print(f"R^2 = {results.rsquared:.3f}") # 0.332
print(f"F-statistic = {results.fvalue:.1f}") # 98.4
print(f"p-value = {results.f_pvalue:.3f}") # 0.000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 4.27
error = residual_standard_error / mean_sales # 0.305
print(f"Error = {error:.3f}")

•   # Regression model with newspaper as the only predictor
model = MS(['newspaper'])
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())
"""{'coef': {'intercept': 12.3514, 'newspaper': 0.0547},
'std err': {'intercept': 0.621, 'newspaper': 0.017},
't': {'intercept': 19.876, 'newspaper': 3.3},
'P>|t|': {'intercept': 0.0, 'newspaper': 0.001}}
"""

print(results.params) # Intercept = 12.35, newspaper = 0.054

# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 5.09
print(f"R^2 = {results.rsquared:.3f}") # 0.052
print(f"F-statistic = {results.fvalue:.1f}") # 10.9
print(f"p-value = {results.f_pvalue:.3f}") # 0.001

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 5.09
error = residual_standard_error / mean_sales # 0.363
print(f"Error = {error:.3f}") # 0.363


Multiple Linear Regression

$Y = \beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p + \epsilon$

Estimating the Regression Coefficients

$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2+...+\hat{\beta_p}X_p + \epsilon$
• Choose coefficients that minimizes the sum of squared residuals

• $RSS &= \sum [y_i - \hat{y_i}]^2 \\ &= \sum [y_i - \hat{\beta_0} - \hat{\beta_1}x_{i1} - \hat{\beta_2}x_{i2}-...-- \hat{\beta_p}x_{ip}]^2$
•   # multiple linear regression
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())

"""{'coef': {'intercept': 2.9389,
'TV': 0.0458,
'newspaper': -0.001},
'std err': {'intercept': 0.312,
'TV': 0.001,
'newspaper': 0.006},
't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
"""

# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
print(f"R^2 = {results.rsquared:.3f}") # 0.897
print(f"F-statistic = {results.fvalue:.1f}") # 570.3
print(f"p-value = {results.f_pvalue:.3f}") # 0.000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120
print(f"Error = {error:.3f}") # 0.120

• Correlation Matrix

•   # Correlation matrix
display(data.corr())
display(data.corr().to_dict())
"""{'TV': {'TV': 1.0,
'newspaper': 0.05664787496505698,
'sales': 0.7822244248616067},
'newspaper': 0.35410375076117534,
'sales': 0.5762225745710555},
'newspaper': {'TV': 0.05664787496505698,
'newspaper': 1.0,
'sales': 0.22829902637616545},
'sales': {'TV': 0.7822244248616067,
'newspaper': 0.22829902637616545,
'sales': 1.0}}
"""


• Multiple Linear Regression:
• $TV=0.0458,~ radio=0.1885,~ newspaper = -0.001$ with $p-values = 0,~ 0,~ 0.86$
• For a given TV and newspaper amount, spending additional $$1000 can increase sales by 189 • Simple Linear Regression: • TV = 0.0475,~radio=0.2025,~newspaper=0.0547 with p-values = 0,~ 0,~ 0.001  • The coefficient of the newspaper is different, with a significant difference in p-value. • sales and newspaper have no relationship in multiple linear, while there is a relationship in linear • The correlation matrix shows a relationship between radio and newspaper as 0.35 • It means newspaper gets credit for association between radio and sales • Other examples: Shark attacks vs ice-cream Some Important Questions • Is at least one of the predictors  X_1,X_2, . . . ,X_p​ useful in predicting the response? • Do all the predictors help to explain Y, or is only a subset of the predictors useful? • How well does the model fit the data? • Given a set of predictor values, what response value should we predict, and how accurate is our prediction? Is at least one of the predictors  X_1,X_2, . . . ,X_p useful in predicting the response? • H_0: \beta_1 = \beta_2 = … = \beta_p = 0 • H_a: at least one \beta_j is non zero • Compute F-statistic: • $F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-a}} \\ \text{where } TSS = \sum [y_i - \bar{y}]^2 \text{ and } RSS = \sum [y_i - \hat{y_i}]^2$ • If there is no relationship, then F-statistic will be close to 1. If H_a is true, then it will be greater than 1 • As above, F-statistic for multiple linear regression is 570 which is large. It means at least one of the predictors is related to sales. However, when n is large the F-statistic that is just larger than 1 is sufficient for H_a. If n is small then larger F-statistic is needed to reject H_0 in favor of H_a. • For any n or p the p-value can be used. Since the p-value for multiple linear regression is 0. Thus, H_0 can be rejected. # multiple linear regression model = MS(["TV", "radio", "newspaper"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9389, 'TV': 0.0458, 'radio': 0.1885, 'newspaper': -0.001}, 'std err': {'intercept': 0.312, 'TV': 0.001, 'radio': 0.009, 'newspaper': 0.006}, 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000  Two: Deciding on Important Variables • The first step in multiple regression is to find if any of the predictors is related to response using p-value. We can look at individual p-values, but if p-value is large then we are likely to make false discoveries. • Variable~ Selection​ • 2^p models that contain subsets of p variables. Thus, three classical approaches: • Forward Selection: • Start with a null model (intercept but no predictor) • Fit p simple linear regression and add variable that results in lower RSS​ • Add the next variable to the multiple linear regression and see RSS • Stop based on some rule • Backward selection: • Start with all variables in the model, remove the variable with largest p-value • continue till the stopping rule. • cannot be used if p>n • Mixed selection: • combination of forward and backward • start with null and add variable that provides best fit (forward). Continue the process. • As we add, if the p-value of one of the variable increases then we remove that • Backward Selection •  # multiple linear regression model = MS(["TV", "radio", "newspaper"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9389, 'TV': 0.0458, 'radio': 0.1885, 'newspaper': -0.001}, 'std err': {'intercept': 0.312, 'TV': 0.001, 'radio': 0.009, 'newspaper': 0.006}, 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 print(f"Error = {error:.3f}") # 0.120  •  # multiple linear regression model = MS(["TV", "radio"]) X = model.fit_transform(data) display(X[:4]) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) display(summarize(results).to_dict()) """{'coef': {'intercept': 2.9211, 'TV': 0.0458, 'radio': 0.188}, 'std err': {'intercept': 0.294, 'TV': 0.001, 'radio': 0.008}, 't': {'intercept': 9.919, 'TV': 32.909, 'radio': 23.382}, 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0}} """ # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69 print(f"R^2 = {results.rsquared:.3f}") # 0.897 print(f"F-statistic = {results.fvalue:.1f}") # 570.3 print(f"p-value = {results.f_pvalue:.3f}") # 0.000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 print(f"Error = {error:.3f}") # 0.120  Three: Model Fit • Fraction of Variance Explained: RSE and R^2​ • In Simple Regression: R^2 = r^2 = [Cor(X,Y)]^2 • In Multiple Regression: R^2 = r^2 = [Cor(Y, \hat{Y}]^2)​ • R^2 close to 1​ means the model explains a large portion of the variance. # multiple linear regression model = MS(["TV", "radio", "newspaper"]) X = model.fit_transform(data) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.686 print(f"R^2 = {results.rsquared:.5f}") # 0.89721 print(f"F-statistic = {results.fvalue:.2f}") # 570.27 print(f"p-value = {results.f_pvalue:.4f}") # 0.0000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120 ######################## # multiple linear regression model = MS(["TV", "radio"]) X = model.fit_transform(data) y = data['sales'] model = sm.OLS(y, X) results = model.fit() display(summarize(results)) # compute residual standard error, R^2, F-statistic, p-value # mean squared error of the residuals is mse_resid print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.681 print(f"R^2 = {results.rsquared:.5f}") # 0.89719 print(f"F-statistic = {results.fvalue:.2f}") # 859.62 print(f"p-value = {results.f_pvalue:.4f}") # 0.0000 mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 1.69 error = residual_standard_error / mean_sales # 0.120  • When “TV”, “radio”, “newspaper”, R^2 = 0.89721, RSE = 1.686 • When “TV”, “radio”, R^2 = 0.89719, RSE = 1.681 • When “TV”, R^2 = 0.61188, RSE = 3.259 • There is a small increase when newspaper is added and a significant increase when radio is added. • Why RSE increased when newspaper is added given that RSS must decrease • $RSE = \sqrt{\frac{RSS}{n-p-1}}$ • Models with more variables can have higher RSE if the decrease in RSS is small relative to increase in p​ Four: Predictions • Three uncertainties • Least square plane \hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 +…+ \hat{\beta_p}X_p is only an estimate for the true population regression plane Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +…+ \beta_pX_p • The inaccuracy in coefficient estimates is related to reducibel~error. We can compute confidence ~interval to determine how close \hat{Y} will be close to f(X) • Model bias due to the assumption of the linear model • Even if true coefficients are known, then also response value cannot be predicted due to \epsilon • We can use prediction interval to answer as how much Y varies from \hat{Y} # multiple linear regression model = MS(["TV", "radio"]) X = model.fit_transform(data) y = data['sales'] display(X[:4]) model = sm.OLS(y, X) results = model.fit() display(summarize(results)) new_df = pd.DataFrame({"TV":[100], "radio":[20]}) model = MS(["TV", "radio"]) newX = model.fit_transform(new_df) display(newX) new_predictions = results.get_prediction(newX); print(new_predictions.predicted_mean) # [11.25646595] # We can produce confidence intervals for the predicted values print(new_predictions.conf_int(alpha=0.05)) # [[10.98525445 11.52767746]] # Prediction intervals are computing by setting obs=True: print(new_predictions.conf_int(obs=True, alpha=0.05)) # [[ 7.92961607 14.58331584]]  • 100,000 is spent on TV advertising and 20,000 is spent on radio advertising in each city • the 95 % confidence interval is [10,985, 11,528]. • We interpret this to mean that 95 % of intervals of this form will contain the true value of f(X). • Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city • 95 % prediction interval is [7,930, 14,580]. • We interpret this to mean that 95 % of intervals of this form will contain the true value of Y for this city. • Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations. Other Considerations in the Regression Model Qualitative Predictors data = pd.read_csv("https://www.statlearning.com/s/Credit.csv") print(data.shape) print(data.dtypes.to_dict()) """{'Income': dtype('float64'), 'Limit': dtype('int64'), 'Rating': dtype('int64'), 'Cards': dtype('int64'), 'Age': dtype('int64'), 'Education': dtype('int64'), 'Own': dtype('O'), 'Student': dtype('O'), 'Married': dtype('O'), 'Region': dtype('O'), 'Balance': dtype('int64')}""" display(data.head())  Predictors with Only Two Levels • Create a dummy ~variable with 0 or 1 for predictor (factor) • $x_i= \begin{cases} 1 & \text{if own hosuse} \\ 0 & \text{otherwise} \end{cases}$ • $y_i = \beta_0 + \beta_1x_i + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if own hosuse} \\ \beta_0 + \epsilon_i & \text{otherwise} \end{cases}$ # simple linear regression df = pd.read_csv("https://www.statlearning.com/s/Credit.csv") factors = ["Own"] for col in factors: df[col] = df[col].astype('category') X = MS(factors).fit_transform(df) display(X[:3]) y = data['Balance'] model = sm.OLS(y, X).fit() display(summarize(model)) display(summarize(model).to_dict()) """ {'coef': {'intercept': 509.8031, 'Own[Yes]': 19.7331}, 'std err': {'intercept': 33.128, 'Own[Yes]': 46.051}, 't': {'intercept': 15.389, 'Own[Yes]': 0.429}, 'P>|t|': {'intercept': 0.0, 'Own[Yes]': 0.669}} """  • The average credit card debt for • owner[No] is$$ 509.80$(intercept) • owner[yes] is estimated to be $$509.80 + 19.73 = 529.53 • However, p-value of own is high, so there is no statistical evidence for the difference in average credit card balance and house ownership. # simple linear regression df = pd.read_csv("https://www.statlearning.com/s/Credit.csv") df['Own'] = df['Own'].map({'Yes': 0, 'No': 1}) factors = ["Own"] for col in factors: df[col] = df[col].astype('category') X = MS(factors).fit_transform(df) display(X[:3]) y = data['Balance'] model = sm.OLS(y, X).fit() display(summarize(model)) display(summarize(model).to_dict()) """ {'coef': {'intercept': 529.5362, 'Own[1]': -19.7331}, 'std err': {'intercept': 31.988, 'Own[1]': 46.051}, 't': {'intercept': 16.554, 'Own[1]': -0.429}, 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}} """  • If Own[No] is encoded as 1​, then values will be different. • The average credit card debt for • owners [Yes(0)] is estimated to be$$529.53$ (intercept)
• Owners[No(1)] are estimated to carry $$-19.73 (coefficient) in additional debt for a total of$$529.53 - $19.73 =$509.80$• This same as we get when Own[Yes] was coded as 1. # simple linear regression df = pd.read_csv("https://www.statlearning.com/s/Credit.csv") df['Own'] = df['Own'].map({'Yes': 1, 'No': -1}) factors = ["Own"] for col in factors: df[col] = df[col].astype('category') X = MS(factors).fit_transform(df) display(X[:3]) y = data['Balance'] model = sm.OLS(y, X).fit() display(summarize(model)) display(summarize(model).to_dict()) """ {'coef': {'intercept': 509.8031, 'Own[1]': 19.7331}, 'std err': {'intercept': 33.128, 'Own[1]': 46.051}, 't': {'intercept': 15.389, 'Own[1]': 0.429}, 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}} """  • If we encode as 1 and -1 • The average credit card debt for • non-owners own[No(-1)] is estimated to be $$509.80 (intercept) • Owners are estimated to carry$$19.73$ (coefficient) in additional debt for a total of 509.80 + $19.73 =$529.53\$​
• It is important to note that the final predictions for the credit balances of owners and non-owners will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Qualitative Predictors with More than Two Levels

# simple linear regression

#print(df.columns)
factors = ["Region"]

for col in factors:
df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 531.0,
'Region[South]': -12.5025,
'Region[West]': -18.6863},
'std err': {'intercept': 46.319,
'Region[South]': 56.681,
'Region[West]': 65.021},
't': {'intercept': 11.464, 'Region[South]': -0.221, 'Region[West]': -0.287},
'P>|t|': {'intercept': 0.0, 'Region[South]': 0.826, 'Region[West]': 0.774}}
"""
pass

• Balance[“East”] = 531 (intercept)
• Region[South] will have 12.50 less than East. However, p-value is large.
• Region[West] will have 18.68 less than East. However, p-value is large.
# Adding Interaction

#print(df.columns)
response = "sales"

X = MS(factors).fit_transform(df)
display(X[:3])
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 6.7502,
'TV': 0.0191,
'std err': {'intercept': 0.248,
'TV': 0.002,
"""

# Polynomial Regression
"""
print(df[df.horsepower=="?"].shape)
df = df[df.horsepower!="?"]
df.horsepower = df.horsepower.astype(float)
"""

predictors = [poly("horsepower", degree=2)]
response = "mpg"

X = MS(predictors).fit_transform(df)
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())

"""
{'coef': {'intercept': 23.4459,
'poly(horsepower, degree=2)[0]': -120.1377,
'poly(horsepower, degree=2)[1]': 44.0895},
'std err': {'intercept': 0.221,
'poly(horsepower, degree=2)[0]': 4.374,
'poly(horsepower, degree=2)[1]': 4.374},
't': {'intercept': 106.13,
'poly(horsepower, degree=2)[0]': -27.467,
'poly(horsepower, degree=2)[1]': 10.08},
'P>|t|': {'intercept': 0.0,
'poly(horsepower, degree=2)[0]': 0.0,
'poly(horsepower, degree=2)[1]': 0.0}}
"""


Tags: