Chi-Square

4 minute read

Published:

This post explains Chi-square Test.

Chapter 19 Witte

Chi-Square ($\chi^2$) Test for Qualitative (Nominal) Data

  • When data are qualitative with nominal measurement (not ordinal - ordered)
  • The chi-square test focuses on any discrepancies between these observed frequencies and the corresponding set of expected frequencies, which are derived from the null hypothesis.
  • One-variable $\chi^2$ - Goodness of fit
    • When data are distributed along a single qualitative variable, the one-variable $\chi^2$ test evaluates these discrepancies as a test for “goodness of fit.”
  • Two-variable $\chi^2$ - Test for independence
    • When data are cross-classified along two qualitative variables, the two-variable $\chi^2$ test evaluates these discrepancies as a “test of independence” or a lack of predictability between the two qualitative variables.

$ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $

One-variable $\chi^2$ - Goodness of fit

Example - Blood Types

  • Your blood belongs to one of four genetically determined types: O, A, B, or AB.
  • A bulletin issued by a large blood bank claims that these four blood types are distributed according to the following proportions:
    • $P_O = 0.44;~P_A = 0.41;~P_B = 0.10,~P_{AB} = 0.05 $
    • Values of population proportions always must sum to $1.00$.
  • Let’s treat this claim as a null hypothesis to be tested with a random sample of 100 students from a large university
FrequencyOABABTotal
$O_i$ (Sample)3838204100
$E_i$ (as per claim)4441105100
  • Evaluating Discrepancies
    • The crucial question is whether the discrepancies between observed and expected frequencies are small enough to be regarded as a common outcome, given that the null hypothesis is true. If so, the null hypothesis is retained.
    • Otherwise, if the discrepancies are large enough to qualify as a rare outcome, the null hypothesis is rejected.
  • $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = 11.24 $
  • $df = 3, \chi^2 = 11.24$
  • https://www.statisticshowto.com/tables/chi-squared-table-right-tail/
  • $ p = 0.01 < 0.05 $

  • Reject the Null
  • Distribution of blood types in the student population differs

Example - Zodiac sign

  • 256 visual artists were surveyed to find out their zodiac signs.

  • The results were: Aries (29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius (20), Pisces (23).

  • Test the hypothesis that zodiac signs are evenly distributed across visual artists.

    • Categories = [‘Aries’, ‘Taurus’, ‘Gemini’, ‘Cancer’, ‘Leo’, ‘Virgo’, ‘Libra’, ‘Scorpio’, ‘Sagittarius’, ‘Capricorn’, ‘Aquarius’, ‘Pisces’]
    • Observed = $O_i = [29, 24, 22, 19, 21, 18, 19, 20, 23, 18, 20, 23]$
    • $X_{obs} = 21.33$
    • Expected = $ E_i = [21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33] $
    • $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = 5.094 $
    • $df = 11, \chi^2 = 5.094$
    • https://www.statisticshowto.com/tables/chi-squared-table-right-tail/
    • $ p = 0.92 > 0.05 $ (between .90 and .95)

    • Fail to reject the Null
    • Zodiac signs are evenly distributed across visual artists.

Two-variable $\chi^2$ - Test for independence

StudentsMajor 1Major 2Major 3Total
Male393051120
Female21401980
Total607070200
  • Hypothesis
    • $ H_o: \text{Major and Gender have similar distribution in all majors i.e. no relationship i.e. are independent} $
    • $H_A: H_o: \text{Major and Gender are not independent}$
  • These are Observed Frequencies
  • Expected Frequencies
    • Total students = 200
      • 120 are Male and 80 are Female
      • 120/200 are Male and 80/200 are Female
      • 0.6 are Male and 0.4 are Female
      • In all majors we should have this proportion
        • In Major 1 there are 60 students so these must be in proportion of 0.6:0.4
          • 36 Male and 24 Female
        • In Major 2 there are 70 students so these must be in proportion of 0.6:0.4
          • 42 Male: 28 Female
        • In Major 3 there are 70 students so these must be in proportion of 0.6:0.4
          • 42 Male: 28 Female
StudentsMajor 1Major 2Major 3Total
Male $O_i$393051120
Male $E_i$$ 60 * \frac{120}{200} = 36$$ 70 * \frac{120}{200} = 42$$ 70 * \frac{120}{200} = 42$ 
Female $O_i$21401980
Female $E_i$$ 60 * \frac{80}{200} = 24$$ 70 * \frac{80}{200} = 28$$ 70 * \frac{80}{200} = 28$ 
Total607070200
  • $ \text{Expected Frequency} = \frac{\text{Row Total}* \text{Col Total}}{\text{Grand Total}} $
  • $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $
  • $ \chi^2 = \frac{(39-36)^2}{36} + \frac{(30-42)^2}{42} + \frac{(51-42)^2}{42} + \frac{(21-24)^2}{24} + \frac{(40-28)^2}{28} + \frac{(19-28)^2}{28}$
  • $ \chi^2 = 0.25 + 3.43 + 1.93 + 0.38 + 5.14 + 2.89 = 14.02$
  • Degree of Freedom
    • $df = (c-1)(r-1)$
      • $c$ is number of columns and $r$ is number of rows
    • $df = (3-1)(2-1) = 2 * 1 = 2$
  • https://naneja.github.io/files/statistics/tables.pdf
    • $df = 2; \chi^2 = 14.02$
    • $p = 0.001 < 0.05$
    • Reject the Null
      • There is relationship between Gender and Major