Chapter 1: Introduction

21 minute read

Published: August 01, 2021

Chapter 2: Statistical Learning

Chapter 3: Linear Regression

Predicting a quantitative response

Advertising data
- Sales(in thousands of units) = function of advertising budgets for TV, radio, and newspaper media.
- Questions
  - Is there a relationship between advertising budget and sales?
  - How strong is the relationship between advertising budget and sales?
  - Which media are associated with sales?
  - How large is the association between each medium and sales?
  - How accurately can we predict future sales?
  - Is the relationship linear?
  - Is there synergy among the advertising media?

Simple Linear Regression: Single Predictor

Response $Y$ based on a single predictor $X$:
- \[Y \approx \beta_0 + \beta_1X\]
- $\beta_0$ (intercept) and $\beta_1$ (slope) are model coefficients or parameters.
- Use training data for:
- \[\hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X\]
- a

Estimating the Coefficients

minimizing the least squares
Residual: $e_i = y_i−\hat{y}_i$
Residual Sum of Squares: $RSS = e_1^2 + e_2^2 + … + e_n^2$
\[RSS = [y_1− \hat{\beta_0}− \hat{\beta_1}x_1]^2 + [y_2− \hat{\beta_0}− \hat{\beta_1}x_2]^2 + ... + [y_n− \hat{\beta_0}− \hat{\beta_1}x_n]^2\]
Using calculus:
- \[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \\ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\\ \bar{x} = \frac{\sum{x}}{n}, ~\text{and}~ \bar{y} = \frac{\sum{y}}{n}\]

Assessing the Accuracy of the Coefficient Estimates

True relationship:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
The analogy with estimating the population mean $\mu$ from the sample mean $\hat{\mu}$
- $\hat{\mu}$ from one data set may under or overestimate $\mu$. However, if we estimate $\hat{\mu}$ over a large number of datasets, then on average $\hat{\mu}$ will be equal to $\mu$. Similarly, on average, estimates of the coefficients will be exact if huge data sets.
- Standard Error, $SE(\hat{\mu})$: average amount $\hat{\mu}$ differs from the actual value of $\mu$
  - \[Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}\]
  - SE reduces as $n$ increases.
Similarly
- \[SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right] \\ SE(\hat{\beta_1})^2 = \sigma^2\left[\frac{1}{n} + \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right]; ~\sigma = Var(\epsilon)\]
Residual Standard Error: $RSE = \sqrt{\frac{RSS}{n-2}}$
Standard Errors can be used to compute confidence intervals. For linear regression, the 95% confidence interval for $\beta_1$ approx. takes the form:
- \[\hat{\beta_1} \pm 2\times SE(\hat{\beta_1}); \hat{\beta_0} \pm 2\times SE(\hat{\beta_0})\]
- There is a 95% chance that the true value of $\beta_1$ and $\beta_0$ lies in the above interval.
Usage of Standard Error for Hypothesis Test
- \[H_0 = \text{There is no relationship between } X \text{ and } Y \\ \beta_1 = 0\]
- \[H_a = \text{There is some relationship between } X \text{ and } Y \\ \beta_1 \ne 0\]
- Compute $t-statistic$: number of standard deviations $\hat{\beta_1}$ is away from $0$
  - \[t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\]
- We can compute $p-value$: small value will indicate the association and we reject the Null

  import numpy as np
  import pandas as pd
  from matplotlib.pyplot import subplots
  import statsmodels.api as sm
  from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
  from statsmodels.stats.anova import anova_lm
  from ISLP import load_data
  from ISLP.models import (ModelSpec as MS, summarize, poly)
    
  import warnings
  warnings.simplefilter(action='ignore', category=FutureWarning)
    
  data = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0)
  print(data.shape, data.columns)
  display(data.head())
    
  _ = pd.plotting.scatter_matrix(ads)
    
  model = MS(['TV'])
  X = model.fit_transform(data)
  print(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  summarize(results)

Coefficient Std Error t-statistic p=value
Intercept 7.0326 0.458 15.360 0.0
TV 0.0475 0.003 17.668 0.0
Conclusion:
- $\beta_0 \ne 0; \beta_1 \ne 0$

	Coefficient	Std Error	t-statistic	p=value
Intercept	7.0326	0.458	15.360	0.0
TV	0.0475	0.003	17.668	0.0

Assessing the Accuracy of the Model

Once null is rejected, then we need to assess the model.
Linear Regression
- $residual~ standard~ error (RSE)$
- $R^2 statistic$

  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
  print(f"R^2 = {results.rsquared:.3f}") # 0.612
  print(f"F-statistic = {results.fvalue:.1f}") # 312.1
  # F-statistic indicates whether the regression model provides a better fit
    
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000

$Residual~ Standard~ Error (RSE)$: measure of lack of fit
- is an estimate of the standard deviation of $\epsilon$
- \[RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{n-2}}\]
- \[RSS = \sum_{i=1}^{n} (y_i - \hat{y_i})^2\]
- $RSE = 3.26$
  - The model can be off by about 3.26 (thousands of units) on average.
  - mean_sales = np.mean(data.sales) # 14.0225 residual_standard_error = np.sqrt(results.mse_resid) # 3.26 error = residual_standard_error / mean_sales # 0.232 print(f"Error = {error:.3f}")
  - Thus, the percentage error can be 23%
$R^2 statistic$: proportion of variance explained
- The measure of the linear relationship between $X$ and $Y$
- RSE of 3.26 doesn’t indicate if it is good or bad
- $R^2$ is always between $0$ and $1$ and is independent of the scale of $Y$. It is better than RSE but still can be challenging to determine what a good value of $R^2$ can be e.g. in Physics, that data is expected to come from a linear model, and it can be close to $1$. However, in marketing, psychology, and biology, the linear model is a rough approximation, and there are measurement errors; the value of 0.1 may be realistic.
- \[R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} \\ \text{where } RSS = \sum(y_i - \hat{y_i})^2 ~and~ TSS = \sum (y_i - \bar{y})^2\]
- TSS is the total variance in $Y$
- Correlation $r = Cor(X,Y)$ also measures the relationship and can be shown as $r^2 = R^2$
  - square of correlation = R-squared
  - \[r = Cor(X,Y) = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sqrt{\sum[x_i-\bar{x}]^2} \sqrt{\sum[y_i-\bar{y}]^2}}\]

  # Regression model with TV as the only predictor
  model = MS(['TV'])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
  """
  {'coef': {'intercept': 7.0326, 'TV': 0.0475},
   'std err': {'intercept': 0.458, 'TV': 0.003},
   't': {'intercept': 15.36, 'TV': 17.668},
   'P>|t|': {'intercept': 0.0, 'TV': 0.0}}
  """
    
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
  print(f"R^2 = {results.rsquared:.3f}") # 0.612
  print(f"F-statistic = {results.fvalue:.1f}") # 312.1
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 3.26
  error = residual_standard_error / mean_sales # 0.232
  print(f"Error = {error:.3f}")

  # Regression model with radio as the only predictor
  model = MS(['radio'])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
  """{'coef': {'intercept': 9.3116, 'radio': 0.2025},
   'std err': {'intercept': 0.563, 'radio': 0.02},
   't': {'intercept': 16.542, 'radio': 9.921},
   'P>|t|': {'intercept': 0.0, 'radio': 0.0}}
   """
    
  print(results.params) # Intercept = 9.31, radio = 0.202
    
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 4.27
  print(f"R^2 = {results.rsquared:.3f}") # 0.332
  print(f"F-statistic = {results.fvalue:.1f}") # 98.4
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 4.27
  error = residual_standard_error / mean_sales # 0.305
  print(f"Error = {error:.3f}")

  # Regression model with newspaper as the only predictor
  model = MS(['newspaper'])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
  """{'coef': {'intercept': 12.3514, 'newspaper': 0.0547},
   'std err': {'intercept': 0.621, 'newspaper': 0.017},
   't': {'intercept': 19.876, 'newspaper': 3.3},
   'P>|t|': {'intercept': 0.0, 'newspaper': 0.001}}
   """
    
  print(results.params) # Intercept = 12.35, newspaper = 0.054
    
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 5.09
  print(f"R^2 = {results.rsquared:.3f}") # 0.052
  print(f"F-statistic = {results.fvalue:.1f}") # 10.9
  print(f"p-value = {results.f_pvalue:.3f}") # 0.001
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 5.09
  error = residual_standard_error / mean_sales # 0.363
  print(f"Error = {error:.3f}") # 0.363

Multiple Linear Regression

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p + \epsilon\]

Estimating the Regression Coefficients

\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2+...+\hat{\beta_p}X_p + \epsilon\]

Choose coefficients that minimizes the sum of squared residuals
- \[RSS &= \sum [y_i - \hat{y_i}]^2 \\ &= \sum [y_i - \hat{\beta_0} - \hat{\beta_1}x_{i1} - \hat{\beta_2}x_{i2}-...-- \hat{\beta_p}x_{ip}]^2\]

  # multiple linear regression
  model = MS(["TV", "radio", "newspaper"])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
    
  """{'coef': {'intercept': 2.9389,
    'TV': 0.0458,
    'radio': 0.1885,
    'newspaper': -0.001},
   'std err': {'intercept': 0.312,
    'TV': 0.001,
    'radio': 0.009,
    'newspaper': 0.006},
   't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
   'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
   """
     
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
  print(f"R^2 = {results.rsquared:.3f}") # 0.897
  print(f"F-statistic = {results.fvalue:.1f}") # 570.3
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 1.69
  error = residual_standard_error / mean_sales # 0.120
  print(f"Error = {error:.3f}") # 0.120

Correlation Matrix

  # Correlation matrix
  display(data.corr())
  display(data.corr().to_dict())
  """{'TV': {'TV': 1.0,
    'radio': 0.05480866446583008,
    'newspaper': 0.05664787496505698,
    'sales': 0.7822244248616067},
   'radio': {'TV': 0.05480866446583008,
    'radio': 1.0,
    'newspaper': 0.35410375076117534,
    'sales': 0.5762225745710555},
   'newspaper': {'TV': 0.05664787496505698,
    'radio': 0.35410375076117534,
    'newspaper': 1.0,
    'sales': 0.22829902637616545},
   'sales': {'TV': 0.7822244248616067,
    'radio': 0.5762225745710555,
    'newspaper': 0.22829902637616545,
    'sales': 1.0}}
    """

Comments
- Multiple Linear Regression:
  - $TV=0.0458,~ radio=0.1885,~ newspaper = -0.001$ with $p-values = 0,~ 0,~ 0.86$
    - For a given TV and newspaper amount, spending additional $$1000$ can increase sales by $189$
- Simple Linear Regression:
  - $TV = 0.0475,~radio=0.2025,~newspaper=0.0547$ with $p-values = 0,~ 0,~ 0.001 $
- The coefficient of the newspaper is different, with a significant difference in p-value.
  - sales and newspaper have no relationship in multiple linear, while there is a relationship in linear
  - The correlation matrix shows a relationship between radio and newspaper as $0.35$
    - It means newspaper gets credit for association between radio and sales
Other examples: Shark attacks vs ice-cream

Some Important Questions

Is at least one of the predictors $ X_1,X_2, . . . ,X_p$ useful in predicting the response?
Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Is at least one of the predictors $ X_1,X_2, . . . ,X_p$ useful in predicting the response?

$H_0: \beta_1 = \beta_2 = … = \beta_p = 0$
$H_a:$ at least one $\beta_j$ is non zero
Compute $F-statistic$:
- \[F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-a}} \\ \text{where } TSS = \sum [y_i - \bar{y}]^2 \text{ and } RSS = \sum [y_i - \hat{y_i}]^2\]
- If there is no relationship, then $F-statistic$ will be close to $1$. If $H_a$ is true, then it will be greater than 1
As above, $F-statistic$ for multiple linear regression is $570$ which is large. It means at least one of the predictors is related to sales. However, when $n$ is large the $F-statistic$ that is just larger than 1 is sufficient for $H_a$. If $n$ is small then larger $F-statistic$ is needed to reject $H_0$ in favor of $H_a$.
For any $n$ or $p$ the $p-value$ can be used. Since the $p-value$ for multiple linear regression is $0$. Thus, $H_0$ can be rejected.

# multiple linear regression
model = MS(["TV", "radio", "newspaper"])
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())

"""{'coef': {'intercept': 2.9389,
  'TV': 0.0458,
  'radio': 0.1885,
  'newspaper': -0.001},
 'std err': {'intercept': 0.312,
  'TV': 0.001,
  'radio': 0.009,
  'newspaper': 0.006},
 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
 """
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
print(f"R^2 = {results.rsquared:.3f}") # 0.897
print(f"F-statistic = {results.fvalue:.1f}") # 570.3
print(f"p-value = {results.f_pvalue:.3f}") # 0.000

Two: Deciding on Important Variables

The first step in multiple regression is to find if any of the predictors is related to response using $p-value$. We can look at individual $p-values$, but if $p-value$ is large then we are likely to make false discoveries.
$Variable~ Selection$
- $2^p$ models that contain subsets of $p$ variables. Thus, three classical approaches:
- Forward Selection:
  - Start with a null model (intercept but no predictor)
  - Fit $p$ simple linear regression and add variable that results in lower $RSS$
  - Add the next variable to the multiple linear regression and see RSS
  - Stop based on some rule
- Backward selection:
  - Start with all variables in the model, remove the variable with largest $p-value$
  - continue till the stopping rule.
  - cannot be used if $p>n$
- Mixed selection:
  - combination of forward and backward
  - start with null and add variable that provides best fit (forward). Continue the process.
  - As we add, if the $p-value$ of one of the variable increases then we remove that
Backward Selection

  # multiple linear regression
  model = MS(["TV", "radio", "newspaper"])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
  """{'coef': {'intercept': 2.9389,
    'TV': 0.0458,
    'radio': 0.1885,
    'newspaper': -0.001},
   'std err': {'intercept': 0.312,
    'TV': 0.001,
    'radio': 0.009,
    'newspaper': 0.006},
   't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
   'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
  """
     
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
  print(f"R^2 = {results.rsquared:.3f}") # 0.897
  print(f"F-statistic = {results.fvalue:.1f}") # 570.3
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 1.69
  error = residual_standard_error / mean_sales # 0.120
  print(f"Error = {error:.3f}") # 0.120

  # multiple linear regression
  model = MS(["TV", "radio"])
  X = model.fit_transform(data)
  display(X[:4])
    
  y = data['sales']
  model = sm.OLS(y, X)
  results = model.fit()
  display(summarize(results))
  display(summarize(results).to_dict())
  """{'coef': {'intercept': 2.9211, 'TV': 0.0458, 'radio': 0.188},
   'std err': {'intercept': 0.294, 'TV': 0.001, 'radio': 0.008},
   't': {'intercept': 9.919, 'TV': 32.909, 'radio': 23.382},
   'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0}}
  """
     
  # compute residual standard error, R^2, F-statistic, p-value
  # mean squared error of the residuals is mse_resid
  print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
  print(f"R^2 = {results.rsquared:.3f}") # 0.897
  print(f"F-statistic = {results.fvalue:.1f}") # 570.3
  print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  mean_sales = np.mean(data.sales) # 14.0225
  residual_standard_error = np.sqrt(results.mse_resid) # 1.69
  error = residual_standard_error / mean_sales # 0.120
  print(f"Error = {error:.3f}") # 0.120

Three: Model Fit

Fraction of Variance Explained: $RSE$ and $R^2$
In Simple Regression: $R^2 = r^2 = [Cor(X,Y)]^2$
In Multiple Regression: $R^2 = r^2 = [Cor(Y, \hat{Y}]^2)$
$R^2$ close to $1$ means the model explains a large portion of the variance.

# multiple linear regression
model = MS(["TV", "radio", "newspaper"])

X = model.fit_transform(data)
y = data['sales']

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))

print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.686
print(f"R^2 = {results.rsquared:.5f}") # 0.89721
print(f"F-statistic = {results.fvalue:.2f}") # 570.27
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120

########################
# multiple linear regression
model = MS(["TV", "radio"])

X = model.fit_transform(data)
y = data['sales']

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
 
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.681
print(f"R^2 = {results.rsquared:.5f}") # 0.89719
print(f"F-statistic = {results.fvalue:.2f}") # 859.62
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120

When “TV”, “radio”, “newspaper”, $R^2 = 0.89721$, $RSE = 1.686$
When “TV”, “radio”, $R^2 = 0.89719$, $RSE = 1.681$
When “TV”, $R^2 = 0.61188$, $RSE = 3.259$
There is a small increase when newspaper is added and a significant increase when radio is added.
Why RSE increased when newspaper is added given that RSS must decrease
- \[RSE = \sqrt{\frac{RSS}{n-p-1}}\]
- Models with more variables can have higher RSE if the decrease in RSS is small relative to increase in $p$

Four: Predictions

Three uncertainties
- Least square plane $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 +…+ \hat{\beta_p}X_p$ is only an estimate for the true population regression plane $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +…+ \beta_pX_p$
  - The inaccuracy in coefficient estimates is related to $reducibel~error$. We can compute $confidence ~interval$ to determine how close $\hat{Y}$ will be close to $f(X)$
- Model bias due to the assumption of the linear model
- Even if true coefficients are known, then also response value cannot be predicted due to $\epsilon$
  - We can use prediction interval to answer as how much $Y$ varies from $\hat{Y}$

# multiple linear regression
model = MS(["TV", "radio"])

X = model.fit_transform(data)
y = data['sales']
display(X[:4])

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))

new_df = pd.DataFrame({"TV":[100], "radio":[20]})
model = MS(["TV", "radio"])
newX = model.fit_transform(new_df)
display(newX)

new_predictions = results.get_prediction(newX);
print(new_predictions.predicted_mean) # [11.25646595]

# We can produce confidence intervals for the predicted values
print(new_predictions.conf_int(alpha=0.05)) # [[10.98525445 11.52767746]]

# Prediction intervals are computing by setting obs=True:
print(new_predictions.conf_int(obs=True, alpha=0.05)) # [[ 7.92961607 14.58331584]]

$100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city
- the 95 % confidence interval is [10,985, 11,528].
  - We interpret this to mean that 95 % of intervals of this form will contain the true value of f(X).
- Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
  - 95 % prediction interval is [7,930, 14,580].
    - We interpret this to mean that 95 % of intervals of this form will contain the true value of Y for this city.
- Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Other Considerations in the Regression Model

Qualitative Predictors

data = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
print(data.shape)
print(data.dtypes.to_dict())
"""{'Income': dtype('float64'),
 'Limit': dtype('int64'),
 'Rating': dtype('int64'),
 'Cards': dtype('int64'),
 'Age': dtype('int64'),
 'Education': dtype('int64'),
 'Own': dtype('O'),
 'Student': dtype('O'),
 'Married': dtype('O'),
 'Region': dtype('O'),
 'Balance': dtype('int64')}"""
display(data.head())

Predictors with Only Two Levels

Create a $dummy ~variable$ with $0$ or $1$ for $predictor$ ($factor$)
\[x_i= \begin{cases} 1 & \text{if own hosuse} \\ 0 & \text{otherwise} \end{cases}\]
\[y_i = \beta_0 + \beta_1x_i + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if own hosuse} \\ \beta_0 + \epsilon_i & \text{otherwise} \end{cases}\]

# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[Yes]': 19.7331},
 'std err': {'intercept': 33.128, 'Own[Yes]': 46.051},
 't': {'intercept': 15.389, 'Own[Yes]': 0.429},
 'P>|t|': {'intercept': 0.0, 'Own[Yes]': 0.669}}
"""

The average credit card debt for
- owner[No] is $$ 509.80$ (intercept)
- owner[yes] is estimated to be $$509.80 + $19.73 = $529.53$
However, $p-value$ of $own$ is high, so there is no statistical evidence for the difference in average credit card balance and house ownership.

# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

df['Own'] = df['Own'].map({'Yes': 0, 'No': 1})  

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 529.5362, 'Own[1]': -19.7331},
 'std err': {'intercept': 31.988, 'Own[1]': 46.051},
 't': {'intercept': 16.554, 'Own[1]': -0.429},
 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""

If Own[No] is encoded as $1$, then values will be different.
The average credit card debt for
- owners [Yes(0)] is estimated to be $$529.53$ (intercept)
- Owners[No(1)] are estimated to carry $$-19.73$ (coefficient) in additional debt for a total of $$529.53 - $19.73 = $509.80$
This same as we get when Own[Yes] was coded as 1.

# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

df['Own'] = df['Own'].map({'Yes': 1, 'No': -1})  

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[1]': 19.7331},
 'std err': {'intercept': 33.128, 'Own[1]': 46.051},
 't': {'intercept': 15.389, 'Own[1]': 0.429},
 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""

If we encode as 1 and -1
The average credit card debt for
- non-owners own[No(-1)] is estimated to be $$509.80$ (intercept)
- Owners are estimated to carry $$19.73$ (coefficient) in additional debt for a total of $$509.80 + $19.73 = $529.53$
It is important to note that the final predictions for the credit balances of owners and non-owners will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Qualitative Predictors with More than Two Levels

# simple linear regression

df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
#print(df.columns)
factors = ["Region"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 531.0,
  'Region[South]': -12.5025,
  'Region[West]': -18.6863},
 'std err': {'intercept': 46.319,
  'Region[South]': 56.681,
  'Region[West]': 65.021},
 't': {'intercept': 11.464, 'Region[South]': -0.221, 'Region[West]': -0.287},
 'P>|t|': {'intercept': 0.0, 'Region[South]': 0.826, 'Region[West]': 0.774}}
"""
pass

Balance[“East”] = 531 (intercept)
Region[South] will have 12.50 less than East. However, p-value is large.
Region[West] will have 18.68 less than East. However, p-value is large.

Adding Interaction

# Adding Interaction

df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0)
#print(df.columns)
factors = ["TV", "radio", ("TV", "radio")]
response = "sales"

X = MS(factors).fit_transform(df)
display(X[:3])
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 6.7502,
  'TV': 0.0191,
  'radio': 0.0289,
  'TV:radio': 0.0011},
 'std err': {'intercept': 0.248,
  'TV': 0.002,
  'radio': 0.009,
  'TV:radio': 5.24e-05},
 't': {'intercept': 27.233, 'TV': 12.699, 'radio': 3.241, 'TV:radio': 20.727},
 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.001, 'TV:radio': 0.0}}
 """

# Polynomial Regression
#df = pd.read_csv("https://www.statlearning.com/s/Auto.csv")
"""
print(df[df.horsepower=="?"].shape)
df = df[df.horsepower!="?"]
df.horsepower = df.horsepower.astype(float)
"""

df = load_data('Auto')
predictors = [poly("horsepower", degree=2)]
response = "mpg"

X = MS(predictors).fit_transform(df)
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())

"""
{'coef': {'intercept': 23.4459,
  'poly(horsepower, degree=2)[0]': -120.1377,
  'poly(horsepower, degree=2)[1]': 44.0895},
 'std err': {'intercept': 0.221,
  'poly(horsepower, degree=2)[0]': 4.374,
  'poly(horsepower, degree=2)[1]': 4.374},
 't': {'intercept': 106.13,
  'poly(horsepower, degree=2)[0]': -27.467,
  'poly(horsepower, degree=2)[1]': 10.08},
 'P>|t|': {'intercept': 0.0,
  'poly(horsepower, degree=2)[0]': 0.0,
  'poly(horsepower, degree=2)[1]': 0.0}}
"""

Share on

Twitter Facebook LinkedIn