Estimation

Published:

This post covers Estimation.

• Population Mean = 37.72
• Population Standard Deviation = $\sigma$ = 16.04
• Sample Size = 35
• Sample Mean = 40
• Standard Error = $\frac{\sigma}{\sqrt{n}}$ = $\frac{16.04}{\sqrt{35}} = 2.71$
• What would be the best guess of mean of Population for treated population if we have sample mean 40 and sample size 35
• Point Estimate = 40
• What will be the range of population mean for the treated population if the sample mean is 40
• 68–95–99.7 rule
• Approx 68% of sample means fall within 2.711 of 40
• Approx 95% of sample means fall within 5.423 of 40
• Margin of Error
• There is a 95% chance that sample mean 40 will be within 34.577 and 45.423 of the new population mean
• 2*std_error 5.423 is the Margin of Error
• Interval Estimate for Population Mean
• 95% of sample means will be within 2 std deviations 5.423 from the sample mean 40
• Confidence Interval Bounds
• $\mu - 2 * \frac{\sigma}{\sqrt{n}} < 40 < \mu + 2 * \frac{\sigma}{\sqrt{n}}$
• $- 2 * \frac{\sigma}{\sqrt{n}} < 40 - \mu < 2 * \frac{\sigma}{\sqrt{n}}$
• $-40 - 2 * \frac{\sigma}{\sqrt{n}} < - \mu < -40 + 2 * \frac{\sigma}{\sqrt{n}}$
• $40 + 2 * \frac{\sigma}{\sqrt{n}} > \mu > 40 - 2 * \frac{\sigma}{\sqrt{n}}$
• $40 - 2 * \frac{\sigma}{\sqrt{n}} < \mu < 40 + 2 * \frac{\sigma}{\sqrt{n}}$
• 95% confidence interval for the mean is 34.577, 45.423
• Exact Z-scores
• What are the Z-score values that bound 95% of the data
• Z Score values that bound 95% of the data are -1.96, 1.96
• 95% CI with exact Z-Scores
• 95% of sample means fall within 1.96 standard errors from the population mean
• 95% confidence interval for mean: (34.68644, 45.31356)
• Generalize Point Estimate
• Let $\bar{X}$ be mean of sample and $\mu$ be population mean
• What is point estimate?
• $\bar{X}$
• What is Interval Estimate?
• $\bar{X} - 1.96\frac{\sigma}{\sqrt{n}},~\bar{X} + 1.96\frac{\sigma}{\sqrt{n}}$
• Large Sample Size
• n = 250
• Standard Error = $\frac{\sigma}{\sqrt{n}}$ = $\frac{16.04}{\sqrt{250}} = 1.01$
• 95% confidence interval for mean: (38.02, 41.98)
• Bigger Sample, Smaller CI
• 95% confidence estimate when n= 35 is (34.6884, 45.3116)
• 95% confidence interval when n=250 is (38.0204, 41.9796)
• Z for 98% CI
• Z Score values that bound 98% of the data are -2.33, 2.33
• 98% confidence interval for mean: (37.6467, 42.3533) for sample size 250
• Critical Values of Z
• $\pm 2.33$ - critical values of Z for 98% CI
• $\pm 1.96$ - critical values of Z for 95% CI

L02 - Estimation

%matplotlib inline

import numpy as np
import pandas as pd
import scipy
import itertools
import math
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os

def get_data(data_file):
with open(data_file, 'r') as fp:

data = [float(d.strip()) for d in data]
return data

score_file = './data/Klout/score.csv'
data = get_data(score_file)
print(len(data))


Point Estimate for Population Mean

sample_size = 35
sample_mean = 40
std_error = 2.71

# What would be best guess of mean of population
# for treated population
# if we have sample (n=35, mean=40)
# from previous lesson
point_estimate = sample_mean
print(point_estimate)

%matplotlib inline
import matplotlib.pyplot as plt
import random
import numpy as np

fig, ax1 = plt.subplots()

ax1.hist(data, alpha=0.5);

mean = np.mean(sample),
std_error = np.std(sample)/np.sqrt(len(sample))

for i in range(10):
sample = random.sample(data, 35)
mean = np.mean(sample),
std_error = np.std(sample)/np.sqrt(len(sample))

values = np.random.normal(loc=mean, scale=std_error, size=1000)

ax2 = ax1.twinx()
sns.kdeplot(values, color="red", ax=ax2);

# What will be range of population mean
# for treated population
# if sample mean is 40

msg = f'Approx 68% of sample means fall within {std_error} of {sample_mean}'
print(msg)

msg = f'Approx 95% of sample means fall within {2*std_error} of {sample_mean}'
print(msg)

def plot_sd(mean, se, ax=None, color='blue'):
values = np.random.normal(loc=mean, scale=se, size=1000) # 1000 values using normal distribution
if ax:
sns.kdeplot(values, ax=ax, color=color);
else:
sns.kdeplot(values, color=color);


Margin of Error

ax = plt.subplot()

plot_sd(sample_mean, std_error, ax=ax)

ax.axvline(sample_mean, color='red',
label=sample_mean);

x1 = sample_mean - 2*std_error
x2 = sample_mean + 2*std_error

ax.axvline(x1, color='green', label=x1);
ax.axvline(x2, color='blue', label=x2);
plt.legend();

msg = "There is 95% chance that sample mean"
msg += f" {sample_mean} will be within"
msg += f" {x1} and {x2} of new population mean"
print(msg)

msg = f"2*std_error {2*std_error} is"
msg += " Margin of Error"
print(msg)


Interval Estimate for Population Mean

• $\mu - 2 * \frac{\sigma}{\sqrt{n}},~~\mu + 2 * \frac{\sigma}{\sqrt{n}}$
msg = f"95% of sample means will be within 2 std deviation {2 * std_error} from sample mean {sample_mean}"
print(msg)


Confidence Interval Bounds

• $\mu - 2 * \frac{\sigma}{\sqrt{n}} < 40 < \mu + 2 * \frac{\sigma}{\sqrt{n}}$
• $- 2 * \frac{\sigma}{\sqrt{n}} < 40 - \mu < 2 * \frac{\sigma}{\sqrt{n}}$
• $-40 - 2 * \frac{\sigma}{\sqrt{n}} < - \mu < -40 + 2 * \frac{\sigma}{\sqrt{n}}$
• $40 + 2 * \frac{\sigma}{\sqrt{n}} > \mu > 40 - 2 * \frac{\sigma}{\sqrt{n}}$
• $40 - 2 * \frac{\sigma}{\sqrt{n}} < \mu < 40 + 2 * \frac{\sigma}{\sqrt{n}}$
print(f'95% confidence interval  for the mean is {sample_mean - 2 * std_error}, {sample_mean + 2 * std_error}')

ax = plt.subplot()

plot_sd(sample_mean - 2 * std_error, std_error,
ax=ax, color='blue')

plot_sd(sample_mean + 2 * std_error, std_error,
ax=ax, color='green')

ax.axvline(sample_mean - 2 * std_error,
label=sample_mean - 2 * std_error,
color='blue');

ax.axvline(sample_mean + 2 * std_error,
label=sample_mean + 2 * std_error,
color='green');

ax.axvline(sample_mean, color='red');

plt.legend();


Exact Z-scores

# What are the Z-score values that bound
# 95% of the data
ax = plt.subplot()

sample_mean_ = 0
std_error_ = 1

plot_sd(sample_mean_, std_error_, ax=ax)

ax.axvline(sample_mean_ - 2* std_error_,
color='red',
label=sample_mean_ - 2* std_error_);
ax.axvline(sample_mean_ + 2* std_error_,
color='green',
label=sample_mean_ + 2* std_error_);

area1 = 2.5/100 # Less that red line
z1 = scipy.stats.norm.ppf(area1)

area2 =  area1 + .95
z2 = scipy.stats.norm.ppf(area2)

msg = f"Z Score values that bound 95%"
msg += f" of the data are {z1:.2f}, {z2:.2f}"

print(msg)


95% CI with exact Z-Scores

• 95% of sample means fall within 1.96 standard errors from the population mean
CI95_lower = sample_mean - 1.96 * std_error
CI95_upper = sample_mean + 1.96 * std_error
CI95 = CI95_lower, CI95_upper

print(f'95% confidence interval for mean: {CI95}')


Generalize Point Estimate

• Let $\bar{X}$ be mean of sample and $\mu$ be population mean
• What is point estimate?
• $\bar{X}$
• What is Interval Estimate?
• $\bar{X} - 1.96\frac{\sigma}{\sqrt{n}},~\bar{X} + 1.96\frac{\sigma}{\sqrt{n}}$

CI Range for Larger Sample Size

• sample_size = 35
• sample_mean = 40
• std_error = 2.71
• 95% confidence estimate is (34.6884, 45.3116)
# What would be best guess of mean of population
# if we have sample (n=35, mean=40)
# from previous lesson

sample_size = 250
sample_mean = 40
std_error = 1.01

CI95_lower = sample_mean - 1.96 * std_error
CI95_upper = sample_mean + 1.96 * std_error
CI95 = CI95_lower, CI95_upper

print(f'95% confidence interval for mean: {CI95}')


Bigger Sample, Smaller CI

• 95% confidence estimate when n= 35 is (34.6884, 45.3116)
• 95% confidence interval when n=250 is (38.0204, 41.9796)

Treatment Effect

• occurs when the intervention affects the population mean
• When the sample mean is far on the tails of the sampling distribution, and therefore unlikely to have occurred by chance, there is evidence for a treatment effect

Z for 98% CI

# What are the Z-score values that bound
# 98% of the data

ax = plt.subplot()

sample_mean_ = 0
std_error_ = 1

plot_sd(sample_mean_, std_error_, ax=ax)

ax.axvline(sample_mean_ - 3* std_error_,
color='red',
label=sample_mean_ - 2* std_error_);

ax.axvline(sample_mean_ + 3* std_error_,
color='green',
label=sample_mean_ + 2* std_error_);

area1 = 1/100 # Less that red line
z1 = scipy.stats.norm.ppf(area1)

area2 =  area1 + .98
z2 = scipy.stats.norm.ppf(area2)

msg = "Z Score values that bound 98% of the data"
msg += f"are {z1:.2f}, {z2:.2f}"
print(msg)


98% CI

# What would be best guess of mean of population
# if we have sample (n=35, mean=40)
# from previous lesson

sample_size = 250
sample_mean = 40
std_error = 1.01

CI98_lower = sample_mean - 2.33 * std_error
CI98_upper = sample_mean + 2.33 * std_error
CI98 = CI98_lower, CI98_upper

print(f'98% confidence interval for mean: {CI98}')


Critical Values of Z

• $\pm 2.33$ - critical values of Z for 98% CI
• $\pm 1.96$ - critical values of Z for 95% CI

Engagement Ratio

data_file = './data/EngagementRatio/EngagementRatio.csv'
data = get_data(data_file)
n = len(data)
print(n)

# Population Parameters
print(f'Population Mean={np.mean(data):.3f} and Standard Deviation = {np.std(data):.3f}')

# Sample of 20 Students with mean 0.13
sample_size = 20
X_bar = 0.13

# Point Estimate
print(X_bar)

# Interval Estimate
std_error = np.std(data)/np.sqrt(sample_size)
print(f'{std_error:.3f}')

ll = X_bar - 1.96 * std_error
ul = X_bar + 1.96 * std_error

print(f'95% CI Interval Estimate is {ll:.3f}, {ul:.3f}')

msg = f'2*std_error {2*std_error} is the Margin of Error'
print(msg)


Measure of Engagement and Learning

# Population Parameters

# Measure of Engagement
mu_e = 7.5
sigma_e = .64

# Measure of Learning
mu_l = 8.2
sigma_l = .73


Experiment - did it increased?

# Experiment on a sample of 20
sample_size = 20
x_bar_e = 8.94
x_bar_l = 8.35

# Measure of Engagement - Sampling Distribution
mean_e = mu_e
std_error_e = sigma_e/np.sqrt(sample_size)
print(f'Mean {mean_e}, Standard Error {std_error_e:.3f}')

# Measure of Learning - Sampling Distribution
mean_l = mu_l
std_error_l = sigma_l/np.sqrt(sample_size)
print(f'Mean {mean_l}, Standard Error {std_error_l:.3f}')

# Where does the sample falls on sampling distribution
z_e = (x_bar_e - mu_e) / std_error_e
z_l = (x_bar_l - mu_l) / std_error_l

print(f'z_e={z_e:.2f}, z_e={z_l:.2f}')

area_e = 1 - scipy.stats.norm.cdf(z_e)
area_l = 1 - scipy.stats.norm.cdf(z_l)

print(f'prob_e={area_e:.2f}, prob_l = {area_l:.2f}')

print('Experiment seems to have had an effect on engagement, but not learning')




Tags: