Sampling Distributions

8 minute read

Published:

This post covers Sampling Distributions.

Inferential Statistics

  • Use sample statistic to estimate population parameter.
  • If we obtain a random sample and compute sample statistic, which is a random variable, however, population parameter is fixed
  • If the statistic is random variable, can we find its distribution? mean and standard deviation?

Sampling Distribution

  • sampling distribution of statistic is a probability distribution

  • Example: Pumpkin Weights

    • Population

      • PumpkinABCDEF
        Weight (pounds)19141591017

        Population Mean = 14

      • Mean of all Samples of size 2 (without replacement)

      • import itertools
        data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17}
              
        sample_size = 2
              
        pumpkins_samples = itertools.combinations(data, sample_size)
              
        samples_mean = []
        for sample in pumpkins_samples:
            sample_weight = []
            for i in range(sample_size):
                sample_weight.append(data[sample[i]])
            sample_mean = np.mean(sample_weight)
            samples_mean.append(sample_mean)
                  
        samples_mean = sorted(samples_mean)
        samples_mean_freq = collections.Counter(samples_mean)
              
        total = sum(samples_mean_freq.values(), 0.0)
              
        for k, v in samples_mean_freq.items():
            samples_mean_freq[k] = v/total
                  
        x = list(samples_mean_freq.keys())
        y = list(samples_mean_freq.values())
              
        plt.bar(x=list(range(len(y))), height=y);
        plt.xticks(ticks=list(range(len(y))), labels=x);
              
        plt.title('Sampling Distribution', fontsize=14)
        plt.xlabel('sample mean', fontsize=14)
        plt.ylabel('prob', fontsize=14);
        

  • Population Mean is 14 and there is only one case where sample mean equals population mean

  • Thus, we have error when we use sample mean to estimate population mean

  • Compute mean of all sample means

    * ```python
      print(np.mean(samples_mean)) # 14
    
    • Even though individual sample mean has error but expected value is right, exactly the population mean.

    • Overall average of sample mean is exactly the population mean if experiment is repeated

  • Sample Size = 5

    * ```python
      data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17}
        
      sample_size = 5
        
      pumpkins_samples = itertools.combinations(data, sample_size)
        
      samples_mean = []
      for sample in pumpkins_samples:
          sample_weight = []
          for i in range(sample_size):
              sample_weight.append(data[sample[i]])
          sample_mean = np.mean(sample_weight)
          samples_mean.append(sample_mean)
            
      samples_mean = sorted(samples_mean)
      samples_mean_freq = collections.Counter(samples_mean)
        
      total = sum(samples_mean_freq.values(), 0.0)
        
      for k, v in samples_mean_freq.items():
          samples_mean_freq[k] = v/total
            
      x = list(samples_mean_freq.keys())
      y = list(samples_mean_freq.values())
        
      plt.bar(x=list(range(len(y))), height=y);
      plt.xticks(ticks=list(range(len(y))), labels=x);
        
      plt.title('Sampling Distribution', fontsize=14)
      plt.xlabel('sample mean', fontsize=14)
      plt.ylabel('prob', fontsize=14);
    
    • Compute mean of all sample means

      * ```python
        print(np.mean(samples_mean)) # 14
      
  • Sample Means with size 2 and 5

    * ```python
      plt.scatter(x=sample_means_2, y=[2]*len(sample_means_2));
      plt.scatter(x=sample_means_5, y=[5]*len(sample_means_5));
      plt.axvline(x=14);
        
      plt.ylim(0,6);
        
      plt.title('Sample Means for size 2 and 5', fontsize=14)
      plt.xlabel('mean', fontsize=14)
      plt.ylabel('sample size', fontsize=14)
    
      • Sample mean to estimate population mean involves sampling error. However, the error on average is smaller with large sample size (n=5) than with lesser sample size (n=2)

Sampling Error

  • Error resulting from using a sample characteristic to estimate Population characteristic

  • Sample means cluster closely to population means when sample size increases

  • Possible Sampling error decreases as sample size increases

What happens when we don’t have population to sample from?

  • Sampling distribution of the sample mean
    • Population is normally distributed
    • Population is not normally distributed

Population is Normal

  • Population: $Mean=\mu ~and~ SD=\sigma$

  • Sampling Distribution of sample mean will also be normal irrespective of sample size

  • If population is large compared to sample size or sampling is done with replacement

    • sampling distribution has mean $ \mu $ and SD $ \frac{\sigma}{\sqrt{n}} $
  • Standard Error term is used for standard deviation of a statistic

    • Standard Error (Deviation) $SE(\bar{X}) = SD(\bar{X}) = \frac{\sigma}{\sqrt{n}}$

      Sample Mean $ \mu $

      Z-Score of Sample Mean $ z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} $

  • Example: Speedboat Engines

    • The engines made by Ford for speedboats have an average power of 220 horsepower (HP) and standard deviation of 15 HP. You can assume the distribution of power follows a normal distribution.

    • Consumer Reports® is testing the engines and will dispute the company’s claim if the sample mean is less than 215 HP. If they take a sample of 4 engines, what is the probability the mean is less than 215?

      • Find $ P(\bar{X} < 215) $

      • Since population is normal distribution, implies, $ \bar{X} $ has normal distribution with mean 220 and SD $ \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{4}} = 7.5$

      • $ P(\bar{X} < 215) = P(Z < \frac{215-220}{7.5}) = P(Z < -0.67) = 0.2514 $

      • Thus, probability that the mean is less than 215 HP is 25.14%.

      • https://online.stat.psu.edu/stat500/sites/stat500/files/inline-images/500l4ex4.1.png
    • If Consumer Reports® samples 100 engines, what is the probability that the sample mean will be less than 215?

      • $\mu=220 ~ SD=\frac{15}{\sqrt{100}}=1.5$

      • $ P(\bar{X} < 215) = P(Z < \frac{215-220}{1.5}) = P(Z < -3.33) = 0.00043 $

      • Probability is 0.043%

    Population is not Normal

    Central Limit Theorem

    • For a large sample size, $ \bar{x} $ is approximately normally distributed, regardless of the distribution of the population one samples from
    • If population has mean $ \mu$ and SD $ \sigma $ then $\bar{x}$ has mean $\mu$ and SD $\frac{\sigma}{\sqrt{n}}$
    • CLT applies to sample mean from any distribution (left skewed or right skewed)

    • As long as sample size is large, the distribution of sample means will follow approximate Normal Distribution

    • for questions large means n>30

    • Central Limit Theorem demonstration

        • Sample mean is not normal if population is skewed and sample size is small
        • Replication of experiment by 10,000 is good
        • If population is normal then sample mean is normal even if n=2
        • If population is skewed, distribution of sample mean looks normal when $n$ gets larger
        • In all cases:
          • Mean of Sample Mean $\equiv$ Population Mean
          • Standard Error of sample mean $ \equiv \frac{\sigma}{\sqrt{n}} $

Sampling Distribution of the Sample Mean

  • Sampling distribution of the sample mean can be defined with help from Central Limit Theorem:

  • Mean of Sampling Distribution of Sample Mean $\equiv$ Population Mean, $\mu$

  • Standard Error (Deviation) of Sampling Distribution of Sample Mean, $ \equiv \frac{\sigma}{\sqrt{n}} $

  • Normal if:

    • Population distribution is normal or
    • Sample size is large ($ n>30 $)
  • Example: Weights of Baby Giraffes

    • The weights of baby giraffes are known to have a mean of 125 pounds and a standard deviation of 15 pounds.

    • If we obtained a random sample of 40 baby giraffes,

      1. what is the probability that the sample mean will be between 120 and 130 pounds?
      2. what is the 75th percentile of the sample means of size n=40?
    • Solution:

      • Population not known if normal but n > 40 implies central limit theorem can be applied

      • Sampling distribution of sample mean will have $ \mu = 125$ and standard error $ \sigma = \frac{15}{\sqrt{40}} = 2.37170825$

      • $ P(120<\bar{X}<130) = P( \frac{120-125}{2.372} < Z < \frac{130-125}{2.372}) = P(-2.108< Z <2.108)$

      • $ = P(Z <2.108) - P(Z <-2.108) = 0.9826 - 0.0174 = 0.9652 $

      • $ 96.52\% $

      • 75th Percentile is $ P(Z<a) = 0.75 \implies a = .6745 $

      • $ .6745 = \frac{\bar{X}-125}{2.372} \implies 126.6 $

      • 75th percentile of all sample means of size n=40 is 126.6

Sampling Distribution of the Sample Proportion

  • Notations

    • $ p $ is the population proportion. It is a fixed value
    • $ n $ is the size of the random sample
    • $ \hat{p} $ is the sample proportion. It varies based on the sample
  • Example: Favorite Color

    • NameABCDE
      ColorGreenBlueYellowPurpleBlue

      Proportion who prefer Blue from Population

      • $ p = \frac{2}{5} = .4 $
    • Proportion who prefer Blue based on Sample (Population unknown)

    • Sample of 2

      • %matplotlib inline
              
        import itertools
        import collections
        import random
              
        import matplotlib.pyplot as plt
              
        import numpy as np
        import pandas as pd
              
        def get_blues_prop(sample, data):
            n = len(sample)
            blues = 0
            for item in sample:
                color = data[item]
                blues += color == 'Blue'
            return blues/n
              
        def normalize_freq(freqs):
            total = sum(freqs.values())
            for k,v in freqs.items():
                freqs[k] = v/total
            return freqs
                
          data = {'A':'Green', 'B':'Blue', 'C':'Yellow', 'D':'Purple', 'E':'Blue'}
              
        sample_size = 2
              
        samples = itertools.combinations(data, sample_size)
              
        props = sorted([get_blues_prop(sample, data) for sample in samples])
              
        freqs = collections.Counter(props)
              
        freqs = normalize_freq(freqs)
              
        print(freqs)
              
        labels = list(freqs.keys())
        height = list(freqs.values())
        x = range(len(height))
              
        plt.bar(x=x, height=height);
        plt.xticks(ticks=x, labels=labels);
        plt.ylabel('Prob', fontsize=16)
        plt.xlabel('P (Blue)', fontsize=16);
        plt.title('Sampling Distribution of P(Blue)', fontsize=16);
        
      • PMF:

      • P(Blue)00.51.0
        Probability0.30.60.1
      • True Proportion is 2/5 = 0.4
      • n=2, doesn’t imply that sampling proportion is equal to true proportion
    • Sample of 4

      • P(Blue)0.250.5
        Probability0.40.6
    • Sampling Distribution of sample proportion will also have sampling error

    • Larger the sample size, smaller the spead of distribution

Normal Approximation to the Binomial

  • How to apply Central Limit Theorem to find sampling distribution of the sample proportion