Chi-square

1 minute read

Published:

This post explains Chi-square Test.

Chi-Square Statistic

$ \tilde{\chi}^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} $

  • Two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
    • A chi-square goodness of fit test determines if sample data matches a population.
    • A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
      • A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
      • A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

Example

  • 256 visual artists were surveyed to find out their zodiac sign. The results were: Aries (29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius (20), Pisces (23).

  • Test the hypothesis that zodiac signs are evenly distributed across visual artists.

    • categories = ['Aries', 'Taurus', 'Gemini', 'Cancer', 'Leo', 'Virgo', 'Libra', 'Scorpio', 'Sagittarius', 'Capricorn', 'Aquarius', 'Pisces']
      observed = [29, 24, 22, 19, 21, 18, 19, 20, 23, 18, 20, 23]
          
      mean = np.mean(observed) # 21.33
      expected = [mean] * len(categories) # repeat
          
      component = []
      for obs,exp in zip(observed, expected):
          c = (obs-exp)**2/exp
          component.append(c)
          
      chi_square_statistic = sum(component)
      print(f'chi_square_statistic = {chi_square_statistic:.3f}') # 5.094
          
      df = len(observed) - 1
      print(df) # 11
          
      # df = 11 and statistic = 5.094
      # https://www.statisticshowto.com/tables/chi-squared-table-right-tail/
      # p_value = between .90 and .95
          
      p_value = 1 - stats.chi2.cdf(chi_square_statistic , df)
      print(f'p_value = {p_value:.3f}') # 0.927
      
    • from scipy import stats
          
      # just pass the observed data
      chi_square_statistic, p_value = stats.chisquare(observed)
          
          
      print(f'chi_square_statistic={chi_square_statistic:.3f}, p_value={p_value:.3f}') 
      # chi_square_statistic=5.094, p_value=0.927
      
  • p-value = 0.927

    • Fail to reject the Null since p-value is very large in comparison to 0.01-0.05 (1%-5%)