Chapter 1: Introduction

21 minute read

Published:

Chapter 2: Statistical Learning

Chapter 3: Linear Regression

Predicting a quantitative response

  • Advertising data
    • Sales(in thousands of units) = function of advertising budgets for TV, radio, and newspaper media.
    • Questions
      • Is there a relationship between advertising budget and sales?
      • How strong is the relationship between advertising budget and sales?
      • Which media are associated with sales?
      • How large is the association between each medium and sales?
      • How accurately can we predict future sales?
      • Is the relationship linear?
      • Is there synergy among the advertising media?

Simple Linear Regression: Single Predictor

  • Response $Y$ based on a single predictor $X$:

    • \[Y \approx \beta_0 + \beta_1X\]
    • $\beta_0$ (intercept) and $\beta_1$ (slope) are model coefficients or parameters.

    • Use training data for:

    • \[\hat{Y} \approx \hat{\beta_0} + \hat{\beta_1}X\]
    • a

Estimating the Coefficients

  • minimizing the least squares

  • Residual: $e_i = y_i−\hat{y}_i$

  • Residual Sum of Squares: $RSS = e_1^2 + e_2^2 + … + e_n^2$

  • \[RSS = [y_1− \hat{\beta_0}− \hat{\beta_1}x_1]^2 + [y_2− \hat{\beta_0}− \hat{\beta_1}x_2]^2 + ... + [y_n− \hat{\beta_0}− \hat{\beta_1}x_n]^2\]
  • Using calculus:

    • \[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \\ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\\ \bar{x} = \frac{\sum{x}}{n}, ~\text{and}~ \bar{y} = \frac{\sum{y}}{n}\]

Assessing the Accuracy of the Coefficient Estimates

  • True relationship:

  • \[Y = \beta_0 + \beta_1 X + \epsilon\]
  • The analogy with estimating the population mean $\mu$ from the sample mean $\hat{\mu}$

    • $\hat{\mu}$ from one data set may under or overestimate $\mu$. However, if we estimate $\hat{\mu}$ over a large number of datasets, then on average $\hat{\mu}$ will be equal to $\mu$. Similarly, on average, estimates of the coefficients will be exact if huge data sets.

    • Standard Error, $SE(\hat{\mu})$: average amount $\hat{\mu}$ differs from the actual value of $\mu$

      • \[Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}\]
      • SE reduces as $n$ increases.
  • Similarly

    • \[SE(\hat{\beta_0})^2 = \sigma^2\left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right] \\ SE(\hat{\beta_1})^2 = \sigma^2\left[\frac{1}{n} + \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\right]; ~\sigma = Var(\epsilon)\]
  • Residual Standard Error: $RSE = \sqrt{\frac{RSS}{n-2}}$

  • Standard Errors can be used to compute confidence intervals. For linear regression, the 95% confidence interval for $\beta_1$ approx. takes the form:

    • \[\hat{\beta_1} \pm 2\times SE(\hat{\beta_1}); \hat{\beta_0} \pm 2\times SE(\hat{\beta_0})\]
    • There is a 95% chance that the true value of $\beta_1$ and $\beta_0$ lies in the above interval.
  • Usage of Standard Error for Hypothesis Test

    • \[H_0 = \text{There is no relationship between } X \text{ and } Y \\ \beta_1 = 0\]
    • \[H_a = \text{There is some relationship between } X \text{ and } Y \\ \beta_1 \ne 0\]
    • Compute $t-statistic$: number of standard deviations $\hat{\beta_1}$ is away from $0$

      • \[t = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\]
    • We can compute $p-value$: small value will indicate the association and we reject the Null
  •   import numpy as np
      import pandas as pd
      from matplotlib.pyplot import subplots
      import statsmodels.api as sm
      from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
      from statsmodels.stats.anova import anova_lm
      from ISLP import load_data
      from ISLP.models import (ModelSpec as MS, summarize, poly)
        
      import warnings
      warnings.simplefilter(action='ignore', category=FutureWarning)
        
      data = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0)
      print(data.shape, data.columns)
      display(data.head())
        
      _ = pd.plotting.scatter_matrix(ads)
        
      model = MS(['TV'])
      X = model.fit_transform(data)
      print(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      summarize(results)
    
  •  CoefficientStd Errort-statisticp=value
    Intercept7.03260.45815.3600.0
    TV0.04750.00317.6680.0
  • Conclusion:
    • $\beta_0 \ne 0; \beta_1 \ne 0$

Assessing the Accuracy of the Model

  • Once null is rejected, then we need to assess the model.

  • Linear Regression

    • $residual~ standard~ error (RSE)$
    • $R^2 statistic$
  •   # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
      print(f"R^2 = {results.rsquared:.3f}") # 0.612
      print(f"F-statistic = {results.fvalue:.1f}") # 312.1
      # F-statistic indicates whether the regression model provides a better fit
        
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
    
  • $Residual~ Standard~ Error (RSE)$: measure of lack of fit

    • is an estimate of the standard deviation of $\epsilon$

    • \[RSE = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{n-2}}\]
    • \[RSS = \sum_{i=1}^{n} (y_i - \hat{y_i})^2\]
    • $RSE = 3.26$

      • The model can be off by about 3.26 (thousands of units) on average.

      •   mean_sales = np.mean(data.sales) # 14.0225
          residual_standard_error = np.sqrt(results.mse_resid) # 3.26
          error = residual_standard_error / mean_sales # 0.232
          print(f"Error = {error:.3f}")
        
      • Thus, the percentage error can be 23%
  • $R^2 statistic$: proportion of variance explained

    • The measure of the linear relationship between $X$ and $Y$

    • RSE of 3.26 doesn’t indicate if it is good or bad

    • $R^2$ is always between $0$ and $1$ and is independent of the scale of $Y$. It is better than RSE but still can be challenging to determine what a good value of $R^2$ can be e.g. in Physics, that data is expected to come from a linear model, and it can be close to $1$. However, in marketing, psychology, and biology, the linear model is a rough approximation, and there are measurement errors; the value of 0.1 may be realistic.

    • \[R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} \\ \text{where } RSS = \sum(y_i - \hat{y_i})^2 ~and~ TSS = \sum (y_i - \bar{y})^2\]
    • TSS is the total variance in $Y$

    • Correlation $r = Cor(X,Y)$ also measures the relationship and can be shown as $r^2 = R^2$

      • square of correlation = R-squared

      • \[r = Cor(X,Y) = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sqrt{\sum[x_i-\bar{x}]^2} \sqrt{\sum[y_i-\bar{y}]^2}}\]
  •   # Regression model with TV as the only predictor
      model = MS(['TV'])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
      """
      {'coef': {'intercept': 7.0326, 'TV': 0.0475},
       'std err': {'intercept': 0.458, 'TV': 0.003},
       't': {'intercept': 15.36, 'TV': 17.668},
       'P>|t|': {'intercept': 0.0, 'TV': 0.0}}
      """
        
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 3.26
      print(f"R^2 = {results.rsquared:.3f}") # 0.612
      print(f"F-statistic = {results.fvalue:.1f}") # 312.1
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 3.26
      error = residual_standard_error / mean_sales # 0.232
      print(f"Error = {error:.3f}")
    
  •   # Regression model with radio as the only predictor
      model = MS(['radio'])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
      """{'coef': {'intercept': 9.3116, 'radio': 0.2025},
       'std err': {'intercept': 0.563, 'radio': 0.02},
       't': {'intercept': 16.542, 'radio': 9.921},
       'P>|t|': {'intercept': 0.0, 'radio': 0.0}}
       """
        
      print(results.params) # Intercept = 9.31, radio = 0.202
        
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 4.27
      print(f"R^2 = {results.rsquared:.3f}") # 0.332
      print(f"F-statistic = {results.fvalue:.1f}") # 98.4
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 4.27
      error = residual_standard_error / mean_sales # 0.305
      print(f"Error = {error:.3f}")
    
  •   # Regression model with newspaper as the only predictor
      model = MS(['newspaper'])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
      """{'coef': {'intercept': 12.3514, 'newspaper': 0.0547},
       'std err': {'intercept': 0.621, 'newspaper': 0.017},
       't': {'intercept': 19.876, 'newspaper': 3.3},
       'P>|t|': {'intercept': 0.0, 'newspaper': 0.001}}
       """
        
      print(results.params) # Intercept = 12.35, newspaper = 0.054
        
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 5.09
      print(f"R^2 = {results.rsquared:.3f}") # 0.052
      print(f"F-statistic = {results.fvalue:.1f}") # 10.9
      print(f"p-value = {results.f_pvalue:.3f}") # 0.001
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 5.09
      error = residual_standard_error / mean_sales # 0.363
      print(f"Error = {error:.3f}") # 0.363
    

Multiple Linear Regression

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p + \epsilon\]

Estimating the Regression Coefficients

\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2+...+\hat{\beta_p}X_p + \epsilon\]
  • Choose coefficients that minimizes the sum of squared residuals

    • \[RSS &= \sum [y_i - \hat{y_i}]^2 \\ &= \sum [y_i - \hat{\beta_0} - \hat{\beta_1}x_{i1} - \hat{\beta_2}x_{i2}-...-- \hat{\beta_p}x_{ip}]^2\]
  •   # multiple linear regression
      model = MS(["TV", "radio", "newspaper"])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
        
      """{'coef': {'intercept': 2.9389,
        'TV': 0.0458,
        'radio': 0.1885,
        'newspaper': -0.001},
       'std err': {'intercept': 0.312,
        'TV': 0.001,
        'radio': 0.009,
        'newspaper': 0.006},
       't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
       'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
       """
         
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
      print(f"R^2 = {results.rsquared:.3f}") # 0.897
      print(f"F-statistic = {results.fvalue:.1f}") # 570.3
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 1.69
      error = residual_standard_error / mean_sales # 0.120
      print(f"Error = {error:.3f}") # 0.120
    
  • Correlation Matrix

    •   # Correlation matrix
        display(data.corr())
        display(data.corr().to_dict())
        """{'TV': {'TV': 1.0,
          'radio': 0.05480866446583008,
          'newspaper': 0.05664787496505698,
          'sales': 0.7822244248616067},
         'radio': {'TV': 0.05480866446583008,
          'radio': 1.0,
          'newspaper': 0.35410375076117534,
          'sales': 0.5762225745710555},
         'newspaper': {'TV': 0.05664787496505698,
          'radio': 0.35410375076117534,
          'newspaper': 1.0,
          'sales': 0.22829902637616545},
         'sales': {'TV': 0.7822244248616067,
          'radio': 0.5762225745710555,
          'newspaper': 0.22829902637616545,
          'sales': 1.0}}
          """
      
  • Comments

    • Multiple Linear Regression:
      • $TV=0.0458,~ radio=0.1885,~ newspaper = -0.001$ with $p-values = 0,~ 0,~ 0.86$
        • For a given TV and newspaper amount, spending additional $$1000$ can increase sales by $189$
    • Simple Linear Regression:
      • $TV = 0.0475,~radio=0.2025,~newspaper=0.0547$ with $p-values = 0,~ 0,~ 0.001 $
    • The coefficient of the newspaper is different, with a significant difference in p-value.
      • sales and newspaper have no relationship in multiple linear, while there is a relationship in linear
      • The correlation matrix shows a relationship between radio and newspaper as $0.35$
        • It means newspaper gets credit for association between radio and sales
  • Other examples: Shark attacks vs ice-cream

Some Important Questions

  • Is at least one of the predictors $ X_1,X_2, . . . ,X_p$​ useful in predicting the response?
  • Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
  • How well does the model fit the data?
  • Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Is at least one of the predictors $ X_1,X_2, . . . ,X_p$ useful in predicting the response?

  • $H_0: \beta_1 = \beta_2 = … = \beta_p = 0$

  • $H_a:$ at least one $\beta_j$ is non zero

  • Compute $F-statistic$:

    • \[F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-a}} \\ \text{where } TSS = \sum [y_i - \bar{y}]^2 \text{ and } RSS = \sum [y_i - \hat{y_i}]^2\]
    • If there is no relationship, then $F-statistic$ will be close to $1$. If $H_a$ is true, then it will be greater than 1
  • As above, $F-statistic$ for multiple linear regression is $570$ which is large. It means at least one of the predictors is related to sales. However, when $n$ is large the $F-statistic$ that is just larger than 1 is sufficient for $H_a$. If $n$ is small then larger $F-statistic$ is needed to reject $H_0$ in favor of $H_a$.

  • For any $n$ or $p$ the $p-value$ can be used. Since the $p-value$ for multiple linear regression is $0$. Thus, $H_0$ can be rejected.

# multiple linear regression
model = MS(["TV", "radio", "newspaper"])
X = model.fit_transform(data)
display(X[:4])

y = data['sales']
model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
display(summarize(results).to_dict())

"""{'coef': {'intercept': 2.9389,
  'TV': 0.0458,
  'radio': 0.1885,
  'newspaper': -0.001},
 'std err': {'intercept': 0.312,
  'TV': 0.001,
  'radio': 0.009,
  'newspaper': 0.006},
 't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
 """
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
print(f"R^2 = {results.rsquared:.3f}") # 0.897
print(f"F-statistic = {results.fvalue:.1f}") # 570.3
print(f"p-value = {results.f_pvalue:.3f}") # 0.000

Two: Deciding on Important Variables

  • The first step in multiple regression is to find if any of the predictors is related to response using $p-value$. We can look at individual $p-values$, but if $p-value$ is large then we are likely to make false discoveries.
  • $Variable~ Selection$​
    • $2^p$ models that contain subsets of $p$ variables. Thus, three classical approaches:
    • Forward Selection:
      • Start with a null model (intercept but no predictor)
      • Fit $p$ simple linear regression and add variable that results in lower $RSS$​
      • Add the next variable to the multiple linear regression and see RSS
      • Stop based on some rule
    • Backward selection:
      • Start with all variables in the model, remove the variable with largest $p-value$
      • continue till the stopping rule.
      • cannot be used if $p>n$
    • Mixed selection:
      • combination of forward and backward
      • start with null and add variable that provides best fit (forward). Continue the process.
      • As we add, if the $p-value$ of one of the variable increases then we remove that
  • Backward Selection

  •   # multiple linear regression
      model = MS(["TV", "radio", "newspaper"])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
      """{'coef': {'intercept': 2.9389,
        'TV': 0.0458,
        'radio': 0.1885,
        'newspaper': -0.001},
       'std err': {'intercept': 0.312,
        'TV': 0.001,
        'radio': 0.009,
        'newspaper': 0.006},
       't': {'intercept': 9.422, 'TV': 32.809, 'radio': 21.893, 'newspaper': -0.177},
       'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0, 'newspaper': 0.86}}
      """
         
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
      print(f"R^2 = {results.rsquared:.3f}") # 0.897
      print(f"F-statistic = {results.fvalue:.1f}") # 570.3
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 1.69
      error = residual_standard_error / mean_sales # 0.120
      print(f"Error = {error:.3f}") # 0.120
    
  •   # multiple linear regression
      model = MS(["TV", "radio"])
      X = model.fit_transform(data)
      display(X[:4])
        
      y = data['sales']
      model = sm.OLS(y, X)
      results = model.fit()
      display(summarize(results))
      display(summarize(results).to_dict())
      """{'coef': {'intercept': 2.9211, 'TV': 0.0458, 'radio': 0.188},
       'std err': {'intercept': 0.294, 'TV': 0.001, 'radio': 0.008},
       't': {'intercept': 9.919, 'TV': 32.909, 'radio': 23.382},
       'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.0}}
      """
         
      # compute residual standard error, R^2, F-statistic, p-value
      # mean squared error of the residuals is mse_resid
      print(f"RSE = {np.sqrt(results.mse_resid):.2f}") # 1.69
      print(f"R^2 = {results.rsquared:.3f}") # 0.897
      print(f"F-statistic = {results.fvalue:.1f}") # 570.3
      print(f"p-value = {results.f_pvalue:.3f}") # 0.000
        
      mean_sales = np.mean(data.sales) # 14.0225
      residual_standard_error = np.sqrt(results.mse_resid) # 1.69
      error = residual_standard_error / mean_sales # 0.120
      print(f"Error = {error:.3f}") # 0.120
    

Three: Model Fit

  • Fraction of Variance Explained: $RSE$ and $R^2$​
  • In Simple Regression: $R^2 = r^2 = [Cor(X,Y)]^2$
  • In Multiple Regression: $R^2 = r^2 = [Cor(Y, \hat{Y}]^2)$​
  • $R^2$ close to $1$​ means the model explains a large portion of the variance.
# multiple linear regression
model = MS(["TV", "radio", "newspaper"])

X = model.fit_transform(data)
y = data['sales']

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))

print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.686
print(f"R^2 = {results.rsquared:.5f}") # 0.89721
print(f"F-statistic = {results.fvalue:.2f}") # 570.27
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120

########################
# multiple linear regression
model = MS(["TV", "radio"])

X = model.fit_transform(data)
y = data['sales']

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))
 
# compute residual standard error, R^2, F-statistic, p-value
# mean squared error of the residuals is mse_resid
print(f"RSE = {np.sqrt(results.mse_resid):.3f}") # 1.681
print(f"R^2 = {results.rsquared:.5f}") # 0.89719
print(f"F-statistic = {results.fvalue:.2f}") # 859.62
print(f"p-value = {results.f_pvalue:.4f}") # 0.0000

mean_sales = np.mean(data.sales) # 14.0225
residual_standard_error = np.sqrt(results.mse_resid) # 1.69
error = residual_standard_error / mean_sales # 0.120
  • When “TV”, “radio”, “newspaper”, $R^2 = 0.89721$, $RSE = 1.686$

  • When “TV”, “radio”, $R^2 = 0.89719$, $RSE = 1.681$

  • When “TV”, $R^2 = 0.61188$, $RSE = 3.259$

  • There is a small increase when newspaper is added and a significant increase when radio is added.

  • Why RSE increased when newspaper is added given that RSS must decrease

    • \[RSE = \sqrt{\frac{RSS}{n-p-1}}\]
    • Models with more variables can have higher RSE if the decrease in RSS is small relative to increase in $p$​

Four: Predictions

  • Three uncertainties
    • Least square plane $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 +…+ \hat{\beta_p}X_p$ is only an estimate for the true population regression plane $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 +…+ \beta_pX_p$
      • The inaccuracy in coefficient estimates is related to $reducibel~error$. We can compute $confidence ~interval$ to determine how close $\hat{Y}$ will be close to $f(X)$
    • Model bias due to the assumption of the linear model
    • Even if true coefficients are known, then also response value cannot be predicted due to $\epsilon$
      • We can use prediction interval to answer as how much $Y$ varies from $\hat{Y}$
# multiple linear regression
model = MS(["TV", "radio"])

X = model.fit_transform(data)
y = data['sales']
display(X[:4])

model = sm.OLS(y, X)
results = model.fit()
display(summarize(results))

new_df = pd.DataFrame({"TV":[100], "radio":[20]})
model = MS(["TV", "radio"])
newX = model.fit_transform(new_df)
display(newX)

new_predictions = results.get_prediction(newX);
print(new_predictions.predicted_mean) # [11.25646595]

# We can produce confidence intervals for the predicted values
print(new_predictions.conf_int(alpha=0.05)) # [[10.98525445 11.52767746]]

# Prediction intervals are computing by setting obs=True:
print(new_predictions.conf_int(obs=True, alpha=0.05)) # [[ 7.92961607 14.58331584]]
  • $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city
    • the 95 % confidence interval is [10,985, 11,528].
      • We interpret this to mean that 95 % of intervals of this form will contain the true value of f(X).
    • Prediction interval can be used to quantify the prediction uncertainty surrounding sales for a particular city
      • 95 % prediction interval is [7,930, 14,580].
        • We interpret this to mean that 95 % of intervals of this form will contain the true value of Y for this city.
    • Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Other Considerations in the Regression Model

Qualitative Predictors

data = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
print(data.shape)
print(data.dtypes.to_dict())
"""{'Income': dtype('float64'),
 'Limit': dtype('int64'),
 'Rating': dtype('int64'),
 'Cards': dtype('int64'),
 'Age': dtype('int64'),
 'Education': dtype('int64'),
 'Own': dtype('O'),
 'Student': dtype('O'),
 'Married': dtype('O'),
 'Region': dtype('O'),
 'Balance': dtype('int64')}"""
display(data.head())

Predictors with Only Two Levels

  • Create a $dummy ~variable$ with $0$ or $1$ for $predictor$ ($factor$)

  • \[x_i= \begin{cases} 1 & \text{if own hosuse} \\ 0 & \text{otherwise} \end{cases}\]
  • \[y_i = \beta_0 + \beta_1x_i + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if own hosuse} \\ \beta_0 + \epsilon_i & \text{otherwise} \end{cases}\]
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[Yes]': 19.7331},
 'std err': {'intercept': 33.128, 'Own[Yes]': 46.051},
 't': {'intercept': 15.389, 'Own[Yes]': 0.429},
 'P>|t|': {'intercept': 0.0, 'Own[Yes]': 0.669}}
"""
  • The average credit card debt for
    • owner[No] is $$ 509.80$ (intercept)
    • owner[yes] is estimated to be $$509.80 + $19.73 = $529.53$
  • However, $p-value$ of $own$ is high, so there is no statistical evidence for the difference in average credit card balance and house ownership.
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

df['Own'] = df['Own'].map({'Yes': 0, 'No': 1})  

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 529.5362, 'Own[1]': -19.7331},
 'std err': {'intercept': 31.988, 'Own[1]': 46.051},
 't': {'intercept': 16.554, 'Own[1]': -0.429},
 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""
  • If Own[No] is encoded as $1$​, then values will be different.
  • The average credit card debt for
    • owners [Yes(0)] is estimated to be $$529.53$ (intercept)
    • Owners[No(1)] are estimated to carry $$-19.73$ (coefficient) in additional debt for a total of $$529.53 - $19.73 = $509.80$
  • This same as we get when Own[Yes] was coded as 1.
# simple linear regression
df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

df['Own'] = df['Own'].map({'Yes': 1, 'No': -1})  

factors = ["Own"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 509.8031, 'Own[1]': 19.7331},
 'std err': {'intercept': 33.128, 'Own[1]': 46.051},
 't': {'intercept': 15.389, 'Own[1]': 0.429},
 'P>|t|': {'intercept': 0.0, 'Own[1]': 0.669}}
"""
  • If we encode as 1 and -1
  • The average credit card debt for
    • non-owners own[No(-1)] is estimated to be $$509.80$ (intercept)
    • Owners are estimated to carry $$19.73$ (coefficient) in additional debt for a total of $$509.80 + $19.73 = $529.53$​
  • It is important to note that the final predictions for the credit balances of owners and non-owners will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Qualitative Predictors with More than Two Levels

# simple linear regression

df = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
#print(df.columns)
factors = ["Region"]

for col in factors:
    df[col] = df[col].astype('category')

X = MS(factors).fit_transform(df)
display(X[:3])
y = data['Balance']

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 531.0,
  'Region[South]': -12.5025,
  'Region[West]': -18.6863},
 'std err': {'intercept': 46.319,
  'Region[South]': 56.681,
  'Region[West]': 65.021},
 't': {'intercept': 11.464, 'Region[South]': -0.221, 'Region[West]': -0.287},
 'P>|t|': {'intercept': 0.0, 'Region[South]': 0.826, 'Region[West]': 0.774}}
"""
pass
  • Balance[“East”] = 531 (intercept)
  • Region[South] will have 12.50 less than East. However, p-value is large.
  • Region[West] will have 18.68 less than East. However, p-value is large.
Adding Interaction
# Adding Interaction

df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col=0)
#print(df.columns)
factors = ["TV", "radio", ("TV", "radio")]
response = "sales"

X = MS(factors).fit_transform(df)
display(X[:3])
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())
"""
{'coef': {'intercept': 6.7502,
  'TV': 0.0191,
  'radio': 0.0289,
  'TV:radio': 0.0011},
 'std err': {'intercept': 0.248,
  'TV': 0.002,
  'radio': 0.009,
  'TV:radio': 5.24e-05},
 't': {'intercept': 27.233, 'TV': 12.699, 'radio': 3.241, 'TV:radio': 20.727},
 'P>|t|': {'intercept': 0.0, 'TV': 0.0, 'radio': 0.001, 'TV:radio': 0.0}}
 """
# Polynomial Regression
#df = pd.read_csv("https://www.statlearning.com/s/Auto.csv")
"""
print(df[df.horsepower=="?"].shape)
df = df[df.horsepower!="?"]
df.horsepower = df.horsepower.astype(float)
"""

df = load_data('Auto')
predictors = [poly("horsepower", degree=2)]
response = "mpg"

X = MS(predictors).fit_transform(df)
y = df[response]

model = sm.OLS(y, X).fit()
display(summarize(model))
display(summarize(model).to_dict())

"""
{'coef': {'intercept': 23.4459,
  'poly(horsepower, degree=2)[0]': -120.1377,
  'poly(horsepower, degree=2)[1]': 44.0895},
 'std err': {'intercept': 0.221,
  'poly(horsepower, degree=2)[0]': 4.374,
  'poly(horsepower, degree=2)[1]': 4.374},
 't': {'intercept': 106.13,
  'poly(horsepower, degree=2)[0]': -27.467,
  'poly(horsepower, degree=2)[1]': 10.08},
 'P>|t|': {'intercept': 0.0,
  'poly(horsepower, degree=2)[0]': 0.0,
  'poly(horsepower, degree=2)[1]': 0.0}}
"""