Sampling Distributions

9 minute read

Published: February 22, 2021

This post covers Sampling Distributions.

Z-Table

Please refer z-Table

Inferential Statistics

Use sample statistic to estimate population parameter.
If we obtain a random sample and compute sample statistic, which is a random variable, however, population parameter is fixed
If the statistic is random variable, can we find its distribution? mean and standard deviation?

Sampling Distribution

sampling distribution of statistic is a probability distribution

Example: Pumpkin Weights

Population

Pumpkin A B C D E F
Weight (pounds) 19 14 15 9 10 17
Population Mean = 14
Mean of all Samples of size 2 (without replacement)

Pumpkin	A	B	C	D	E	F
Weight (pounds)	19	14	15	9	10	17

%matplotlib inline
import itertools
import numpy as np
import collections
import matplotlib.pyplot as plt
      
def get_weights(sample):
    sample_weight = []
    for s in sample:
        w = data[s]
        sample_weight.append(w)
    return sample_weight
      
data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17}
      
sample_size = 2
      
# All possible combinations of sample size two
pumpkins_samples = itertools.combinations(data, sample_size)
      
samples_mean = []
for sample in pumpkins_samples:
    sample_weight = get_weights(sample)
    sample_mean = np.mean(sample_weight)
    samples_mean.append(sample_mean)
          
sample_means_2 = sorted(samples_mean)
samples_mean_freq = collections.Counter(sample_means_2)
      
keys = list(samples_mean_freq.keys()) 
values = list(samples_mean_freq.values()) 
      
values = [v/sum(values) for v in values]
      
plt.bar(range(len(samples_mean_freq)), values, tick_label=keys)
      
plt.title('Sampling Distribution', fontsize=14)
plt.xlabel('Sample Mean', fontsize=14)
plt.ylabel('Prob', fontsize=14);

Population Mean is 14 and there is only one case where sample mean equals population mean
Thus, we have error when we use sample mean to estimate population mean
Compute mean of all sample means
```
print(np.mean(samples_mean)) # 14
```
- Even though individual sample mean has error but expected value is right, exactly the population mean.
- Overall average of sample mean is exactly the population mean if experiment is repeated

Sample Size = 5

%matplotlib inline
import itertools
import numpy as np
import collections
import matplotlib.pyplot as plt
  
def get_weights(sample):
    sample_weight = []
    for s in sample:
        w = data[s]
        sample_weight.append(w)
    return sample_weight
  
data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17}
  
sample_size = 5
  
# All possible combinations of sample size two
pumpkins_samples = itertools.combinations(data, sample_size)
  
samples_mean = []
for sample in pumpkins_samples:
    sample_weight = get_weights(sample)
    sample_mean = np.mean(sample_weight)
    samples_mean.append(sample_mean)
      
sample_means_5 = sorted(samples_mean)
samples_mean_freq = collections.Counter(sample_means_5)
  
keys = list(samples_mean_freq.keys()) 
values = list(samples_mean_freq.values()) 
  
values = [v/sum(values) for v in values]
  
plt.bar(range(len(samples_mean_freq)), values, tick_label=keys)
  
plt.title('Sampling Distribution', fontsize=14)
plt.xlabel('Sample Mean', fontsize=14)
plt.ylabel('Prob', fontsize=14);

Compute mean of all sample means
```
print(np.mean(samples_mean)) # 14
```

Sample Means with size 2 and 5

plt.scatter(x=sample_means_2, y=[2]*len(sample_means_2));
plt.scatter(x=sample_means_5, y=[5]*len(sample_means_5));
plt.axvline(x=14);
  
plt.ylim(0,6);
  
plt.title('Sample Means for size 2 and 5', fontsize=14)
plt.xlabel('mean', fontsize=14)
plt.ylabel('sample size', fontsize=14);

- Sample mean to estimate population mean involves sampling error. However, the error on average is smaller with large sample size (n=5) than with lesser sample size (n=2)

Sampling Error

Error resulting from using a sample characteristic to estimate Population characteristic
Sample means cluster closely to population means when sample size increases
Possible Sampling error decreases as sample size increases

What happens when we don’t have population to sample from?
Sampling distribution of the sample mean
Population is normally distributed
Population is not normally distributed

Population is Normal

Population: $Mean=\mu ~and~ SD=\sigma$
Sampling Distribution of sample mean will also be normal irrespective of sample size
If population is large compared to sample size or sampling is done with replacement
- sampling distribution has mean $ \mu $ and SD $ \frac{\sigma}{\sqrt{n}} $
Standard Error term is used for standard deviation of a statistic
- Standard Error (Deviation) $SE(\bar{X}) = SD(\bar{X}) = \frac{\sigma}{\sqrt{n}}$
  Sample Mean $ \mu $
  Z-Score of Sample Mean $ z = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} $
Example: Speedboat Engines
- The engines made by Ford for speedboats have an average power of 220 horsepower (HP) and standard deviation of 15 HP. You can assume the distribution of power follows a normal distribution.
- Consumer Reports® is testing the engines and will dispute the company’s claim if the sample mean is less than 215 HP. If they take a sample of 4 engines, what is the probability the mean is less than 215?
  - Find $ P(\bar{X} < 215) $
  - Since population is normal distribution, implies, $ \bar{X} $ has normal distribution with mean 220 and SD $ \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{4}} = 7.5$
  - $ P(\bar{X} < 215) = P(Z < \frac{215-220}{7.5}) = P(Z < -0.67) = 0.2514 $
  - Thus, probability that the mean is less than 215 HP is 25.14%.
  - https://online.stat.psu.edu/stat500/sites/stat500/files/inline-images/500l4ex4.1.png
- If Consumer Reports® samples 100 engines, what is the probability that the sample mean will be less than 215?
  - $\mu=220 ~ SD=\frac{15}{\sqrt{100}}=1.5$
  - $ P(\bar{X} < 215) = P(Z < \frac{215-220}{1.5}) = P(Z < -3.33) = 0.00043 $
  - Probability is 0.043%
Population is not Normal
Central Limit Theorem
- For a large sample size, $ \bar{x} $ is approximately normally distributed, regardless of the distribution of the population one samples from
- If population has mean $ \mu$ and SD $ \sigma $ then $\bar{x}$ has mean $\mu$ and SD $\frac{\sigma}{\sqrt{n}}$
- CLT applies to sample mean from any distribution (left skewed or right skewed)
- As long as sample size is large, the distribution of sample means will follow approximate Normal Distribution
- for questions large means n>30
- Central Limit Theorem demonstration
  - https://onlinestatbook.com/stat_sim/sampling_dist/
    - Begin
    - Select Normal
      - Click repeatedly Animated or click 5, 10,000 or 100,000 to sample
      - See distribution of means getting updated
  - - Sample mean is not normal if population is skewed and sample size is small
    - Replication of experiment by 10,000 is good
    - If population is normal then sample mean is normal even if n=2
    - If population is skewed, distribution of sample mean looks normal when $n$ gets larger
    - In all cases:
      - Mean of Sample Mean $\equiv$ Population Mean
      - Standard Error of sample mean $ \equiv \frac{\sigma}{\sqrt{n}} $


https://online.stat.psu.edu/stat500/sites/stat500/files/inline-images/500l4ex4.1.png

Sampling Distribution of the Sample Mean

Sampling distribution of the sample mean can be defined with help from Central Limit Theorem:
Mean of Sampling Distribution of Sample Mean $\equiv$ Population Mean, $\mu$
Standard Error (Deviation) of Sampling Distribution of Sample Mean, $ \equiv \frac{\sigma}{\sqrt{n}} $
Normal if:
- Population distribution is normal or
- Sample size is large ($ n>30 $)
Example: Weights of Baby Giraffes
- The weights of baby giraffes are known to have a mean of 125 pounds and a standard deviation of 15 pounds.
- If we obtained a random sample of 40 baby giraffes,
  1. what is the probability that the sample mean will be between 120 and 130 pounds?
  2. what is the 75th percentile of the sample means of size n=40?
- Solution:
  - Population not known if normal but n > 40 implies central limit theorem can be applied
  - Sampling distribution of sample mean will have $ \mu = 125$ and standard error $ \sigma = \frac{15}{\sqrt{40}} = 2.37170825$
  - $ P(120<\bar{X}<130) = P( \frac{120-125}{2.372} < Z < \frac{130-125}{2.372}) = P(-2.108< Z <2.108)$
  - $ = P(Z <2.108) - P(Z <-2.108) = 0.9826 - 0.0174 = 0.9652 $
  - $ 96.52\% $
  - 75th Percentile is $ P(Z<a) = 0.75 \implies a = .6745 $
  - $ .6745 = \frac{\bar{X}-125}{2.372} \implies 126.6 $
  - 75th percentile of all sample means of size n=40 is 126.6

Sampling Distribution of the Sample Proportion

Notations
- $ p $ is the population proportion. It is a fixed value
- $ n $ is the size of the random sample
- $ \hat{p} $ is the sample proportion. It varies based on the sample

Example: Favorite Color

Name A B C D E
Color Green Blue Yellow Purple Blue
Proportion who prefer Blue from Population
- $ p = \frac{2}{5} = .4 $
Proportion who prefer Blue based on Sample (Population unknown)

Name	A	B	C	D	E
Color	Green	Blue	Yellow	Purple	Blue

Sample of 2

%matplotlib inline
      
import itertools
import collections
import random
      
import matplotlib.pyplot as plt
      
import numpy as np
import pandas as pd
      
def get_blues_prop(sample, data):
    n = len(sample)
    blues = 0
    for item in sample:
        color = data[item]
        blues += color == 'Blue'
    return blues/n
      
def normalize_freq(freqs):
    total = sum(freqs.values())
    for k,v in freqs.items():
        freqs[k] = v/total
    return freqs
        
data = {'A':'Green', 'B':'Blue', 'C':'Yellow', 'D':'Purple', 'E':'Blue'}
      
sample_size = 2
      
samples = itertools.combinations(data, sample_size)
      
props = sorted([get_blues_prop(sample, data) for sample in samples])
      
freqs = collections.Counter(props)
      
freqs = normalize_freq(freqs)
      
print(freqs)
      
labels = list(freqs.keys())
height = list(freqs.values())
x = range(len(height))
      
plt.bar(x=x, height=height, tick_label=labels);
plt.ylabel('Prob', fontsize=16)
plt.xlabel('P (Blue)', fontsize=16);
plt.title('Sampling Distribution of P(Blue)', fontsize=16);

PMF:
P(Blue) 0 0.5 1.0
Probability 0.3 0.6 0.1
True Proportion is 2/5 = 0.4
n=2, doesn’t imply that sampling proportion is equal to true proportion

P(Blue)	0	0.5	1.0
Probability	0.3	0.6	0.1

Sample of 4

%matplotlib inline
    
import itertools
import collections
import random
    
import matplotlib.pyplot as plt
    
import numpy as np
import pandas as pd
    
def get_blues_prop(sample, data):
    n = len(sample)
    blues = 0
    for item in sample:
        color = data[item]
        blues += color == 'Blue'
    return blues/n
    
def normalize_freq(freqs):
    total = sum(freqs.values())
    for k,v in freqs.items():
        freqs[k] = v/total
    return freqs
      
data = {'A':'Green', 'B':'Blue', 'C':'Yellow', 'D':'Purple', 'E':'Blue'}
    
sample_size = 4
    
samples = itertools.combinations(data, sample_size)
    
props = sorted([get_blues_prop(sample, data) for sample in samples])
    
freqs = collections.Counter(props)
    
freqs = normalize_freq(freqs)
    
print(freqs)
    
labels = list(freqs.keys())
height = list(freqs.values())
x = range(len(height))
    
plt.bar(x=x, height=height, tick_label=labels);
#plt.xticks(ticks=x, labels=labels);
plt.ylabel('Prob', fontsize=16)
plt.xlabel('P (Blue)', fontsize=16);
plt.title('Sampling Distribution of P(Blue)', fontsize=16);

P(Blue) 0.25 0.5
Probability 0.4 0.6

Sampling Distribution of sample proportion will also have sampling error
Larger the sample size, smaller the spead of distribution

P(Blue)	0.25	0.5
Probability	0.4	0.6

Share on

Twitter Facebook LinkedIn

Sampling Distributions

Z-Table

Inferential Statistics

Sampling Distribution

Sampling Error

Population is Normal

Population is not Normal

Sampling Distribution of the Sample Mean

Sampling Distribution of the Sample Proportion

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1