Sampling Distributions
Published:
This post covers Sampling Distributions.
Inferential Statistics
 Use sample statistic to estimate population parameter.
 If we obtain a random sample and compute sample statistic, which is a random variable, however, population parameter is fixed
 If the statistic is random variable, can we find its distribution? mean and standard deviation?
Sampling Distribution
sampling distribution of statistic is a probability distribution
Example: Pumpkin Weights
Population
Pumpkin A B C D E F Weight (pounds) 19 14 15 9 10 17 Population Mean = 14
Mean of all Samples of size 2 (without replacement)
import itertools data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17} sample_size = 2 pumpkins_samples = itertools.combinations(data, sample_size) samples_mean = [] for sample in pumpkins_samples: sample_weight = [] for i in range(sample_size): sample_weight.append(data[sample[i]]) sample_mean = np.mean(sample_weight) samples_mean.append(sample_mean) samples_mean = sorted(samples_mean) samples_mean_freq = collections.Counter(samples_mean) total = sum(samples_mean_freq.values(), 0.0) for k, v in samples_mean_freq.items(): samples_mean_freq[k] = v/total x = list(samples_mean_freq.keys()) y = list(samples_mean_freq.values()) plt.bar(x=list(range(len(y))), height=y); plt.xticks(ticks=list(range(len(y))), labels=x); plt.title('Sampling Distribution', fontsize=14) plt.xlabel('sample mean', fontsize=14) plt.ylabel('prob', fontsize=14);
Population Mean is 14 and there is only one case where sample mean equals population mean
Thus, we have error when we use sample mean to estimate population mean
Compute mean of all sample means
print(np.mean(samples_mean)) # 14
Even though individual sample mean has error but expected value is right, exactly the population mean.
Overall average of sample mean is exactly the population mean if experiment is repeated
Sample Size = 5
data = {'A':19, 'B':14, 'C':15, 'D':9, 'E':10, 'F':17} sample_size = 5 pumpkins_samples = itertools.combinations(data, sample_size) samples_mean = [] for sample in pumpkins_samples: sample_weight = [] for i in range(sample_size): sample_weight.append(data[sample[i]]) sample_mean = np.mean(sample_weight) samples_mean.append(sample_mean) samples_mean = sorted(samples_mean) samples_mean_freq = collections.Counter(samples_mean) total = sum(samples_mean_freq.values(), 0.0) for k, v in samples_mean_freq.items(): samples_mean_freq[k] = v/total x = list(samples_mean_freq.keys()) y = list(samples_mean_freq.values()) plt.bar(x=list(range(len(y))), height=y); plt.xticks(ticks=list(range(len(y))), labels=x); plt.title('Sampling Distribution', fontsize=14) plt.xlabel('sample mean', fontsize=14) plt.ylabel('prob', fontsize=14);
Compute mean of all sample means
print(np.mean(samples_mean)) # 14
Sample Means with size 2 and 5
plt.scatter(x=sample_means_2, y=[2]*len(sample_means_2)); plt.scatter(x=sample_means_5, y=[5]*len(sample_means_5)); plt.axvline(x=14); plt.ylim(0,6); plt.title('Sample Means for size 2 and 5', fontsize=14) plt.xlabel('mean', fontsize=14) plt.ylabel('sample size', fontsize=14)
 Sample mean to estimate population mean involves sampling error. However, the error on average is smaller with large sample size (n=5) than with lesser sample size (n=2)
Sampling Error
Error resulting from using a sample characteristic to estimate Population characteristic
Sample means cluster closely to population means when sample size increases
Possible Sampling error decreases as sample size increases
What happens when we don’t have population to sample from?
 Sampling distribution of the sample mean
 Population is normally distributed
 Population is not normally distributed
Population is Normal
Population: $Mean=\mu ~and~ SD=\sigma$
Sampling Distribution of sample mean will also be normal irrespective of sample size
If population is large compared to sample size or sampling is done with replacement
 sampling distribution has mean $ \mu $ and SD $ \frac{\sigma}{\sqrt{n}} $
Standard Error term is used for standard deviation of a statistic
Standard Error (Deviation) $SE(\bar{X}) = SD(\bar{X}) = \frac{\sigma}{\sqrt{n}}$
Sample Mean $ \mu $
ZScore of Sample Mean $ z = \frac{\bar{x}\mu}{\frac{\sigma}{\sqrt{n}}} $
Example: Speedboat Engines
The engines made by Ford for speedboats have an average power of 220 horsepower (HP) and standard deviation of 15 HP. You can assume the distribution of power follows a normal distribution.
Consumer Reports® is testing the engines and will dispute the company’s claim if the sample mean is less than 215 HP. If they take a sample of 4 engines, what is the probability the mean is less than 215?
Find $ P(\bar{X} < 215) $
Since population is normal distribution, implies, $ \bar{X} $ has normal distribution with mean 220 and SD $ \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{4}} = 7.5$
$ P(\bar{X} < 215) = P(Z < \frac{215220}{7.5}) = P(Z < 0.67) = 0.2514 $
Thus, probability that the mean is less than 215 HP is 25.14%.
https://online.stat.psu.edu/stat500/sites/stat500/files/inlineimages/500l4ex4.1.png
If Consumer Reports® samples 100 engines, what is the probability that the sample mean will be less than 215?
$\mu=220 ~ SD=\frac{15}{\sqrt{100}}=1.5$
$ P(\bar{X} < 215) = P(Z < \frac{215220}{1.5}) = P(Z < 3.33) = 0.00043 $
Probability is 0.043%
Population is not Normal
Central Limit Theorem
 For a large sample size, $ \bar{x} $ is approximately normally distributed, regardless of the distribution of the population one samples from
 If population has mean $ \mu$ and SD $ \sigma $ then $\bar{x}$ has mean $\mu$ and SD $\frac{\sigma}{\sqrt{n}}$
CLT applies to sample mean from any distribution (left skewed or right skewed)
As long as sample size is large, the distribution of sample means will follow approximate Normal Distribution
for questions large means n>30
Central Limit Theorem demonstration

 Sample mean is not normal if population is skewed and sample size is small
 Replication of experiment by 10,000 is good
 If population is normal then sample mean is normal even if n=2
 If population is skewed, distribution of sample mean looks normal when $n$ gets larger
 In all cases:
 Mean of Sample Mean $\equiv$ Population Mean
 Standard Error of sample mean $ \equiv \frac{\sigma}{\sqrt{n}} $

Sampling Distribution of the Sample Mean
Sampling distribution of the sample mean can be defined with help from Central Limit Theorem:
Mean of Sampling Distribution of Sample Mean $\equiv$ Population Mean, $\mu$
Standard Error (Deviation) of Sampling Distribution of Sample Mean, $ \equiv \frac{\sigma}{\sqrt{n}} $
Normal if:
 Population distribution is normal or
 Sample size is large ($ n>30 $)
Example: Weights of Baby Giraffes
The weights of baby giraffes are known to have a mean of 125 pounds and a standard deviation of 15 pounds.
If we obtained a random sample of 40 baby giraffes,
 what is the probability that the sample mean will be between 120 and 130 pounds?
 what is the 75th percentile of the sample means of size n=40?
Solution:
Population not known if normal but n > 40 implies central limit theorem can be applied
Sampling distribution of sample mean will have $ \mu = 125$ and standard error $ \sigma = \frac{15}{\sqrt{40}} = 2.37170825$
$ P(120<\bar{X}<130) = P( \frac{120125}{2.372} < Z < \frac{130125}{2.372}) = P(2.108< Z <2.108)$
$ = P(Z <2.108)  P(Z <2.108) = 0.9826  0.0174 = 0.9652 $
$ 96.52\% $
75th Percentile is $ P(Z<a) = 0.75 \implies a = .6745 $
$ .6745 = \frac{\bar{X}125}{2.372} \implies 126.6 $
75th percentile of all sample means of size n=40 is 126.6
Sampling Distribution of the Sample Proportion
Notations
 $ p $ is the population proportion. It is a fixed value
 $ n $ is the size of the random sample
 $ \hat{p} $ is the sample proportion. It varies based on the sample
Example: Favorite Color
Name A B C D E Color Green Blue Yellow Purple Blue Proportion who prefer Blue from Population
 $ p = \frac{2}{5} = .4 $
Proportion who prefer Blue based on Sample (Population unknown)
Sample of 2
%matplotlib inline import itertools import collections import random import matplotlib.pyplot as plt import numpy as np import pandas as pd def get_blues_prop(sample, data): n = len(sample) blues = 0 for item in sample: color = data[item] blues += color == 'Blue' return blues/n def normalize_freq(freqs): total = sum(freqs.values()) for k,v in freqs.items(): freqs[k] = v/total return freqs data = {'A':'Green', 'B':'Blue', 'C':'Yellow', 'D':'Purple', 'E':'Blue'} sample_size = 2 samples = itertools.combinations(data, sample_size) props = sorted([get_blues_prop(sample, data) for sample in samples]) freqs = collections.Counter(props) freqs = normalize_freq(freqs) print(freqs) labels = list(freqs.keys()) height = list(freqs.values()) x = range(len(height)) plt.bar(x=x, height=height); plt.xticks(ticks=x, labels=labels); plt.ylabel('Prob', fontsize=16) plt.xlabel('P (Blue)', fontsize=16); plt.title('Sampling Distribution of P(Blue)', fontsize=16);
PMF:
P(Blue) 0 0.5 1.0 Probability 0.3 0.6 0.1  True Proportion is 2/5 = 0.4
 n=2, doesn’t imply that sampling proportion is equal to true proportion
Sample of 4
P(Blue) 0.25 0.5 Probability 0.4 0.6
Sampling Distribution of sample proportion will also have sampling error
Larger the sample size, smaller the spead of distribution
Normal Approximation to the Binomial
 How to apply Central Limit Theorem to find sampling distribution of the sample proportion