ANOVA

10 minute read

Published:

This post explains ANOVA.

F-table

| | | | ———————————————————— | ———————————————————— |

Comparing Samples

Product AProduct BProduct C
124065
154545
105030
146040
  • Which of the products have significantly different prices
    • Product A and Product B
    • Product A and Product C
    • Product B and Product C
    • No significant difference
  • How many t-tests would we need to compare for 4 samples A, B, C, D
    • 6 (AB, AC, AD, BC, BD, AD)
    • $\binom{n}{2} = \frac{n!}{2!(n-2)!}$

$tStatistic = \frac{\bar{X_1}-\bar{X_2}}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}$

Compare three or more samples = $\frac{distance/variability~between~means}{error}$

  • To compare three or more samples

    • Find the average squared deviation of each sample mean from the total mean
      • like std
    • Total Mean or Grand Mean $\bar{X_G}$
    • Samples sizes equal
      • Mean of all Means
        • $\bar{X_G} = \text{Mean of sample means} = \frac{\bar{X_1} + \bar{X_2}+…+\bar{X_n}}{n}$
      • Compute Mean of all values
        • $\bar{X_G} = \frac{X_1 + X_2+…+X_N}{N}$
    • If not equal
      • Compute Mean of all values
        • $\bar{X_G} = \frac{X_1 + X_2+…+X_N}{N}$

Between-group Variability

  • Conclusions from the deviation of each sample mean from the mean of means
    • Smaller the distance between sample means (mean of groups are close to each other)
      • less likely population means will differ significantly
    • Greater the distance between sample means (mean of groups are far from each other)
      • more likely population means will differ significantly

Within-group Variability

  • In which situation are the means significantly different

    • Less VariablityMore Variablity
    • The smaller the variability of each individual sample (chances are no overlap)

      • the more likely population means will differ significantly
    • Greater the variability of each individual sample (since chances are for overlap)

      • the less likely population means will differ significantly

ANalysis Of VAriance (ANOVA)

  • One Test to compare $n$ means
  • One-way ANOVA
    • One Independent Variable
  • Two-way ANOVA
    • Two Independent Variable

One-way ANOVA

  • $H_0: \mu_1 = \mu_2 = \mu_3$
  • $H_A:$ At least one pair of samples is significantly different
  • $F = \frac{between-group~variability}{within-group~variability}$
    • If we get small statistic
      • Within-group > Between-group
        • means are not significantly different from each other
        • Fail to reject the Null
    • If we get large statistic
      • Between-group > Within-group
      • means are significantly different from each other
      • Reject Null
  • Higher Within-group variability is in favour of Null Hypothesis
  • Higher Between-group variability is in favor of Alternate Hypothesis

ANOVA

  • Between-group variability $ = \frac{n\Sigma(\bar{X_k}-\bar{X_G})^2}{k-1} $
  • Within-group Variability $ = \frac{\Sigma(X_i-\bar{X_k})^2}{N-k} $, where $N$ is the total number of values from all samples and $k$ the number of Samples
  • $F = \frac{\frac{n\Sigma(\bar{X_k}-\bar{X_G})^2}{k-1}}{\frac{\Sigma(X_i-\bar{X_k})^2}{N-k}} = \frac{\frac{SS_{betweeb}}{df_{between}}}{\frac{SS_{within}}{df_{within}}} = \frac{MS_{between}}{MS_{within}}$
    • SS - Sum of Squared
    • MS - Mean Squared
  • $df_{total} = df_{between} + df_{within} = N - 1$
  • $SS_{total} = \Sigma(x_i-\bar{X_G})^2 = SS_{between} + SS_{within}$
  • $F$-statistic is never negative
    • not symmatrical
    • positively skewed
    • Peakes at 1 since if no difference between population means then between-group and within group will be same
    • Always No Direction $\ne$
    • Critical region on right side only

Example One-way ANOVA

Same sample size

  • Is there significant differences in prices of items from three brands based on data of prices of some random shirts
    • [15, 12, 14, 11]
    • [39, 45, 48, 60]
    • [65, 45, 32, 38]
  • Hypothesis

    • $H_0: \mu_1 = \mu_2 = \mu_3$
    • $H_A:$ At least one pair of samples is significantly different
  • Samples
    • brand1 = [15, 12, 14, 11]

    • brand2 = [39, 45, 48, 60]

    • brand3 = [65, 45, 32, 38]

  • Means
    • $\bar{X_1} = 13;~ \bar{X_2} = 48;~ \bar{X_3} = 45$
    • $\bar{X_G} = \frac{13 + 48 + 45}{3} = 35.33$ (equal size)
    • $ \bar{X_G} = \frac{15+12+14+11+39+45+48+60+65+45+32+38}{12} = 35.33$ (if unequal)
  • SS_Between
    • $ n \Sigma (\bar{X}_k - \bar{X}_G)^2 $
    • $SS_{between} = 4 * [ (13-35.33)^2 + (48-35.33)^2 + (45-35.33)^2 ] = 3010.67$
  • df_between
    • $ df_{between} = k - 1 = 2 $; k is the number of groups
  • SS_Within
    • $ \Sigma (X_i - \bar{X}_k)^2 $
    • $ SS_{within}(1) = (15-13)^2 + (12-13)^2 + (14-13)^2 + (11-13)^2 $
    • $ SS_{within}(2) = (39-48)^2 + (45-48)^2 + (48-48)^2 + (60-48)^2$
    • $ SS_{within}(3) = (65-45)^2 + (45-45)^2 + (32-45)^2 + (38-45)^2 $
    • $ SS_{within} = SS_{within}(1) + SS_{within}(2) + SS_{within}(3) $
    • $ SS_{within} = 862.00 $
  • df_within
    • $ df_{within} = N - k = 12 - 3 = 9 $
  • MS_between
    • $MS_{between} = \frac{SS_{between}}{df_{between}} = \frac{3010.67}{2} = 1505.33 $
  • MS_within
    • $MS_{within} = \frac{SS_{within}}{df_{within}} = \frac{862}{9} = 95.78 $
  • Fstatistic
    • $ Fstatistic = \frac{MS_{between}}{MS_{within}} = \frac{1505.33}{95.78} = 15.72 $
  • Tables
  • Since $F_{statistic} = 15.72 > F_{critical} = 4.2665 $
    • Reject Null
  • F-table doesn’t give p-value

Different sample size

  • Is there significant differences in prices of items from three brands based on data of prices of some random shirts:
    • Brand 1: [15, 12, 14]
    • Brand 2: [39, 45, 48, 60]
    • Brand 3: [65, 45, 32]
  • Use α = 1% to test if there is a significant difference. Show all of the computations and reasoning.
  • Answers
    • (i) Write the Hypothesis Test:
      • $H0: \mu1 = \mu2 = \mu3$
      • $Ha:$ not all three population means are equal or
        • $Ha:$ At least one pair of samples is significantly different
    • (ii) What is the value of Sum of Squared between (SS_between)
      • Ans: 2435.167
        • $n = [3, 4, 3]; \bar{X} = [13.67, 48, 47.33]$
        • $\bar{X}_G = \frac{15 + 12 + 14 + 39 + 45 + 48 + 60 + 65 + 45 + 32}{10} = 37.5$
        • $SS_{\text{between}} = 3(13.67-37.5)^2 + 4(48-37.5)^2 + 3*(47.33-37.5)^2$
        • $SS_{\text{between}} = 2435.167$
    • (iii) What is the value of the degree of freedom between (df_between)
      • Ans: $k-1 = 2$
    • (iv) What is the value of Mean Squared between (MS_between)
      • Ans: $ MS_{between} = \frac{SS_{between}}{df_{between}} = \frac{2435.167}{2}=1217.583$
    • (v) What is the value of Sum of Squared within (SS_within)
      • $SS_{within}1 = (15-13.67)^2 + (12-13.67)^2 + (14-13.67)^2 = 4.67$
      • $SS_{within}2 = (39-48)^2 + (45-48)^2 + (48-48)^2 + (60-48)^2 = 234$
      • $SS_{within}3 = (65-37.5)^2 + (45-37.5)^2 + (32-37.5)^2 = 552.67$
      • $SS_{within} = SS_{within}1 + SS_{within}2 + SS_{within}3$
      • $SS_{within} = 4.67 + 234 + 552.67 = 791.33 $
    • (vi) What is the value of degree of freedom within (df_within)
      • Ans: $ df_{within} = N-k = (3+4+3) - 3 = 7$
    • (vii) What is the value of Mean Squared within (MS_within)
      • Ans: $ MS_{within} = \frac{SS_{within}}{df_{within}} = \frac{791.33}{7} = 113.048$
    • (viii) What is the value of Fstatistic
      • Ans: $Fstatistic = \frac{MS_{between}}{MS_{within}} = \frac{1217.583}{113.048} = 10.771 $
    • (ix) What is the value of Fcritical
      • $Nr = df_{between} = 2; Dr = df_{within} = 7$
      • $Col = 2; Row = 9 => F(0.01, 2, 9) = 9.547$
    • (x) Do you reject the Null Hypothesis or Fail to Reject the Null. Also write the interpretation of rejecting the null or fail to reject the null.
      • Ans: $Reject$ Null in favor of Alternate
      • At least one pair of samples is significantly different

Two-way ANOVA

  • The only difference between one-way and two-way ANOVA is the number of independent variables. A one-way ANOVA has one independent variable, while a two-way ANOVA has two.
  • used to determine whether or not there is a statistically significant difference between the means of three or more independent groups that have been split on two factors.
  • The purpose of a two-way ANOVA is to determine how two factors impact a response variable, and to determine whether or not there is an interaction between the two factors on the response variable.
  • For example, suppose a botanist wants to explore how sunlight exposure and watering frequency affect plant growth.
  • She plants 30 seeds and lets them grow for two months under different conditions for sunlight exposure and watering frequency. After two months, she records the height of each plant.
  • Response variable
    • plant growth
  • Factors
    • sunlight exposure, watering frequency
  • Questions
    • Does sunlight exposure affect plant growth?
    • Does watering frequency affect plant growth?
    • Is there an interaction effect between sunlight exposure and watering frequency? (e.g. the effect that sunlight exposure has on the plants is dependent on watering frequency)
  • Two-Way ANOVA Assumptions
    1. Normality – The response variable is approximately normally distributed for each group.

    2. Equal Variances – The variances for each group should be roughly equal.

    3. Independence – The observations in each group are independent of each other and the observations within groups were obtained by a random sample.

Example

water = c('daily', 'daily', 'daily', 'daily', 'daily', 
          'daily', 'daily', 'daily', 'daily', 'daily', 
          'daily', 'daily', 'daily', 'daily', 'daily', 
          'weekly', 'weekly', 'weekly', 'weekly', 'weekly', 
          'weekly', 'weekly', 'weekly', 'weekly', 'weekly', 
          'weekly', 'weekly', 'weekly', 'weekly', 'weekly')

sun = c('low', 'low', 'low', 'low', 'low', 
        'med', 'med', 'med', 'med', 'med', 
        'high', 'high', 'high', 'high', 'high', 
        'low', 'low', 'low', 'low', 'low', 
        'med', 'med', 'med', 'med', 'med', 
        'high', 'high', 'high', 'high', 'high')

height = c(6, 6, 6, 5, 6, 5, 5, 6, 4, 5, 
           6, 6, 7, 8, 7, 3, 4, 4, 4, 5, 
           4, 4, 4, 4, 4, 5, 6, 6, 7, 8)

#create data frame
data <- data.frame(water = water,
                   sun = sun,
                   height = height)
print(data)

model <- aov(height ~ water * sun, data = data)
summary(model)
water = ['daily', 'daily', 'daily', 'daily', 'daily', 
         'daily', 'daily', 'daily', 'daily', 'daily', 
         'daily', 'daily', 'daily', 'daily', 'daily', 
         'weekly', 'weekly', 'weekly', 'weekly', 'weekly', 
         'weekly', 'weekly', 'weekly', 'weekly', 'weekly', 
         'weekly', 'weekly', 'weekly', 'weekly', 'weekly']

sun = ['low', 'low', 'low', 'low', 'low', 
       'med', 'med', 'med', 'med', 'med', 
       'high', 'high', 'high', 'high', 'high', 
       'low', 'low', 'low', 'low', 'low', 
       'med', 'med', 'med', 'med', 'med', 
       'high', 'high', 'high', 'high', 'high']

height = [6, 6, 6, 5, 6, 5, 5, 6, 4, 5, 
          6, 6, 7, 8, 7, 3, 4, 4, 4, 5, 
          4, 4, 4, 4, 4, 5, 6, 6, 7, 8]

data = {'water': water,'sun': sun,'height': height}
df = pd.DataFrame(data)
print(df.head(3))

# !pip3 install statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
 sum_sqdfFPR(>F) 
C(water)8.5333331.016.00000.000527***
C(sun)24.8666672.023.31250.000002***
C(water):C(sun)2.4666672.02.31250.120667 
Residual12.80000024.0NaNNaN 
  • Interpret the results.
    • We can see the following p-values for each of the factors in the table:
      • water: p-value = .000527
      • sun: p-value = .0000002
      • water*sun: p-value = .120667
    • Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.
    • since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.