Chi-Square
Published:
This post explains Chi-square Test.
Chapter 19 Witte
Chi-Square ($\chi^2$) Test for Qualitative (Nominal) Data
- When data are qualitative with nominal measurement (not ordinal - ordered)
- The chi-square test focuses on any discrepancies between these observed frequencies and the corresponding set of expected frequencies, which are derived from the null hypothesis.
- One-variable $\chi^2$ - Goodness of fit
- When data are distributed along a single qualitative variable, the one-variable $\chi^2$ test evaluates these discrepancies as a test for “goodness of fit.”
- Two-variable $\chi^2$ - Test for independence
- When data are cross-classified along two qualitative variables, the two-variable $\chi^2$ test evaluates these discrepancies as a “test of independence” or a lack of predictability between the two qualitative variables.
$ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $
One-variable $\chi^2$ - Goodness of fit
Example - Blood Types
- Your blood belongs to one of four genetically determined types: O, A, B, or AB.
- A bulletin issued by a large blood bank claims that these four blood types are distributed according to the following proportions:
- $P_O = 0.44;~P_A = 0.41;~P_B = 0.10,~P_{AB} = 0.05 $
- Values of population proportions always must sum to $1.00$.
- Let’s treat this claim as a null hypothesis to be tested with a random sample of 100 students from a large university
Frequency | O | A | B | AB | Total |
---|---|---|---|---|---|
$O_i$ (Sample) | 38 | 38 | 20 | 4 | 100 |
$E_i$ (as per claim) | 44 | 41 | 10 | 5 | 100 |
- Evaluating Discrepancies
- The crucial question is whether the discrepancies between observed and expected frequencies are small enough to be regarded as a common outcome, given that the null hypothesis is true. If so, the null hypothesis is retained.
- Otherwise, if the discrepancies are large enough to qualify as a rare outcome, the null hypothesis is rejected.
- $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = 11.24 $
- $df = 3, \chi^2 = 11.24$
- https://www.statisticshowto.com/tables/chi-squared-table-right-tail/
$ p = 0.01 < 0.05 $
- Reject the Null
- Distribution of blood types in the student population differs
Example - Zodiac sign
256 visual artists were surveyed to find out their zodiac signs.
The results were: Aries (29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius (20), Pisces (23).
Test the hypothesis that zodiac signs are evenly distributed across visual artists.
- Categories = [‘Aries’, ‘Taurus’, ‘Gemini’, ‘Cancer’, ‘Leo’, ‘Virgo’, ‘Libra’, ‘Scorpio’, ‘Sagittarius’, ‘Capricorn’, ‘Aquarius’, ‘Pisces’]
- Observed = $O_i = [29, 24, 22, 19, 21, 18, 19, 20, 23, 18, 20, 23]$
- $X_{obs} = 21.33$
- Expected = $ E_i = [21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33, 21.33] $
- $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = 5.094 $
- $df = 11, \chi^2 = 5.094$
- https://www.statisticshowto.com/tables/chi-squared-table-right-tail/
$ p = 0.92 > 0.05 $ (between .90 and .95)
- Fail to reject the Null
- Zodiac signs are evenly distributed across visual artists.
Two-variable $\chi^2$ - Test for independence
Students | Major 1 | Major 2 | Major 3 | Total |
---|---|---|---|---|
Male | 39 | 30 | 51 | 120 |
Female | 21 | 40 | 19 | 80 |
Total | 60 | 70 | 70 | 200 |
- Hypothesis
- $ H_o: \text{Major and Gender have similar distribution in all majors i.e. no relationship i.e. are independent} $
- $H_A: H_o: \text{Major and Gender are not independent}$
- These are Observed Frequencies
- Expected Frequencies
- Total students = 200
- 120 are Male and 80 are Female
- 120/200 are Male and 80/200 are Female
- 0.6 are Male and 0.4 are Female
- In all majors we should have this proportion
- In Major 1 there are 60 students so these must be in proportion of 0.6:0.4
- 36 Male and 24 Female
- In Major 2 there are 70 students so these must be in proportion of 0.6:0.4
- 42 Male: 28 Female
- In Major 3 there are 70 students so these must be in proportion of 0.6:0.4
- 42 Male: 28 Female
- In Major 1 there are 60 students so these must be in proportion of 0.6:0.4
- Total students = 200
Students | Major 1 | Major 2 | Major 3 | Total |
---|---|---|---|---|
Male $O_i$ | 39 | 30 | 51 | 120 |
Male $E_i$ | $ 60 * \frac{120}{200} = 36$ | $ 70 * \frac{120}{200} = 42$ | $ 70 * \frac{120}{200} = 42$ | |
Female $O_i$ | 21 | 40 | 19 | 80 |
Female $E_i$ | $ 60 * \frac{80}{200} = 24$ | $ 70 * \frac{80}{200} = 28$ | $ 70 * \frac{80}{200} = 28$ | |
Total | 60 | 70 | 70 | 200 |
- $ \text{Expected Frequency} = \frac{\text{Row Total}* \text{Col Total}}{\text{Grand Total}} $
- $ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $
- $ \chi^2 = \frac{(39-36)^2}{36} + \frac{(30-42)^2}{42} + \frac{(51-42)^2}{42} + \frac{(21-24)^2}{24} + \frac{(40-28)^2}{28} + \frac{(19-28)^2}{28}$
- $ \chi^2 = 0.25 + 3.43 + 1.93 + 0.38 + 5.14 + 2.89 = 14.02$
- Degree of Freedom
- $df = (c-1)(r-1)$
- $c$ is number of columns and $r$ is number of rows
- $df = (3-1)(2-1) = 2 * 1 = 2$
- $df = (c-1)(r-1)$
- https://naneja.github.io/files/statistics/tables.pdf
- $df = 2; \chi^2 = 14.02$
- $p = 0.001 < 0.05$
- Reject the Null
- There is relationship between Gender and Major