Estimation

9 minute read

Published: April 08, 2021

This post covers Estimation.

Population Mean = 37.72
Population Standard Deviation = $\sigma$ = 16.04
Sample Size = 35
Sample Mean = 40
Standard Error = $\frac{\sigma}{\sqrt{n}}$ = $\frac{16.04}{\sqrt{35}} = 2.71$
What would be the best guess of mean of Population for treated population if we have sample mean 40 and sample size 35
- Point Estimate = 40
What will be the range of population mean for the treated population if the sample mean is 40
- 68–95–99.7 rule
- Approx 68% of sample means fall within 2.711 of 40
- Approx 95% of sample means fall within 5.423 of 40
Margin of Error
- There is a 95% chance that sample mean 40 will be within 34.577 and 45.423 of the new population mean
- 2*std_error 5.423 is the Margin of Error
Interval Estimate for Population Mean
- 95% of sample means will be within 2 std deviations 5.423 from the sample mean 40
Confidence Interval Bounds
- $ \mu - 2 * \frac{\sigma}{\sqrt{n}} < 40 < \mu + 2 * \frac{\sigma}{\sqrt{n}} $
- $ - 2 * \frac{\sigma}{\sqrt{n}} < 40 - \mu < 2 * \frac{\sigma}{\sqrt{n}} $
- $ -40 - 2 * \frac{\sigma}{\sqrt{n}} < - \mu < -40 + 2 * \frac{\sigma}{\sqrt{n}} $
- $ 40 + 2 * \frac{\sigma}{\sqrt{n}} > \mu > 40 - 2 * \frac{\sigma}{\sqrt{n}} $
- $ 40 - 2 * \frac{\sigma}{\sqrt{n}} < \mu < 40 + 2 * \frac{\sigma}{\sqrt{n}} $
- 95% confidence interval for the mean is 34.577, 45.423
Exact Z-scores
- What are the Z-score values that bound 95% of the data
  - Z Score values that bound 95% of the data are -1.96, 1.96
- 95% CI with exact Z-Scores
  - 95% of sample means fall within 1.96 standard errors from the population mean
  - 95% confidence interval for mean: (34.68644, 45.31356)
- Generalize Point Estimate
  - Let $\bar{X}$ be mean of sample and $\mu$ be population mean
    - What is point estimate?
      - $\bar{X}$
    - What is Interval Estimate?
      - $\bar{X} - 1.96\frac{\sigma}{\sqrt{n}},~\bar{X} + 1.96\frac{\sigma}{\sqrt{n}}$
  - Large Sample Size
    - n = 250
    - Standard Error = $\frac{\sigma}{\sqrt{n}}$ = $\frac{16.04}{\sqrt{250}} = 1.01 $
  - 95% confidence interval for mean: (38.02, 41.98)
- Bigger Sample, Smaller CI
  - 95% confidence estimate when n= 35 is (34.6884, 45.3116)
  - 95% confidence interval when n=250 is (38.0204, 41.9796)
- Z for 98% CI
  - Z Score values that bound 98% of the data are -2.33, 2.33
- 98% confidence interval for mean: (37.6467, 42.3533) for sample size 250
- Critical Values of Z
  - $\pm 2.33$ - critical values of Z for 98% CI
  - $\pm 1.96$ - critical values of Z for 95% CI

L02 - Estimation

%matplotlib inline

import numpy as np
import pandas as pd
import scipy
import itertools
import math
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os

def get_data(data_file):
    with open(data_file, 'r') as fp:
        data = fp.readlines()

    data = [float(d.strip()) for d in data]
    return data

score_file = './data/Klout/score.csv'
data = get_data(score_file)
print(len(data))

Point Estimate for Population Mean

sample_size = 35
sample_mean = 40
std_error = 2.71

# What would be best guess of mean of population 
# for treated population
# if we have sample (n=35, mean=40) 
# from previous lesson
point_estimate = sample_mean
print(point_estimate)

%matplotlib inline
import matplotlib.pyplot as plt
import random
import numpy as np

fig, ax1 = plt.subplots()

ax1.hist(data, alpha=0.5);


mean = np.mean(sample), 
std_error = np.std(sample)/np.sqrt(len(sample))

for i in range(10):
    sample = random.sample(data, 35)
    mean = np.mean(sample), 
    std_error = np.std(sample)/np.sqrt(len(sample))

    values = np.random.normal(loc=mean, scale=std_error, size=1000)

    ax2 = ax1.twinx() 
    sns.kdeplot(values, color="red", ax=ax2);

# What will be range of population mean
# for treated population
# if sample mean is 40

msg = f'Approx 68% of sample means fall within {std_error} of {sample_mean}'
print(msg)

msg = f'Approx 95% of sample means fall within {2*std_error} of {sample_mean}'
print(msg)

def plot_sd(mean, se, ax=None, color='blue'):
    values = np.random.normal(loc=mean, scale=se, size=1000) # 1000 values using normal distribution
    if ax:
      	sns.kdeplot(values, ax=ax, color=color);
    else:
	      sns.kdeplot(values, color=color);

Margin of Error

ax = plt.subplot()

plot_sd(sample_mean, std_error, ax=ax)

ax.axvline(sample_mean, color='red', 
           label=sample_mean);

x1 = sample_mean - 2*std_error
x2 = sample_mean + 2*std_error

ax.axvline(x1, color='green', label=x1);
ax.axvline(x2, color='blue', label=x2);
plt.legend();


msg = "There is 95% chance that sample mean"
msg += f" {sample_mean} will be within"
msg += f" {x1} and {x2} of new population mean"
print(msg)

msg = f"2*std_error {2*std_error} is"
msg += " Margin of Error"
print(msg)

Interval Estimate for Population Mean

$ \mu - 2 * \frac{\sigma}{\sqrt{n}},~~\mu + 2 * \frac{\sigma}{\sqrt{n}} $

msg = f"95% of sample means will be within 2 std deviation {2 * std_error} from sample mean {sample_mean}"
print(msg)

Confidence Interval Bounds

$ \mu - 2 * \frac{\sigma}{\sqrt{n}} < 40 < \mu + 2 * \frac{\sigma}{\sqrt{n}} $
$ - 2 * \frac{\sigma}{\sqrt{n}} < 40 - \mu < 2 * \frac{\sigma}{\sqrt{n}} $
$ -40 - 2 * \frac{\sigma}{\sqrt{n}} < - \mu < -40 + 2 * \frac{\sigma}{\sqrt{n}} $
$ 40 + 2 * \frac{\sigma}{\sqrt{n}} > \mu > 40 - 2 * \frac{\sigma}{\sqrt{n}} $
$ 40 - 2 * \frac{\sigma}{\sqrt{n}} < \mu < 40 + 2 * \frac{\sigma}{\sqrt{n}} $

print(f'95% confidence interval  for the mean is {sample_mean - 2 * std_error}, {sample_mean + 2 * std_error}')

ax = plt.subplot()

plot_sd(sample_mean - 2 * std_error, std_error, 
        ax=ax, color='blue')

plot_sd(sample_mean + 2 * std_error, std_error, 
        ax=ax, color='green')

ax.axvline(sample_mean - 2 * std_error, 
           label=sample_mean - 2 * std_error, 
           color='blue');

ax.axvline(sample_mean + 2 * std_error, 
           label=sample_mean + 2 * std_error, 
           color='green');

ax.axvline(sample_mean, color='red');

plt.legend();

Exact Z-scores

# What are the Z-score values that bound 
# 95% of the data
ax = plt.subplot()

sample_mean_ = 0
std_error_ = 1

plot_sd(sample_mean_, std_error_, ax=ax)

ax.axvline(sample_mean_ - 2* std_error_, 
           color='red', 
           label=sample_mean_ - 2* std_error_);
ax.axvline(sample_mean_ + 2* std_error_, 
           color='green', 
           label=sample_mean_ + 2* std_error_);

area1 = 2.5/100 # Less that red line
z1 = scipy.stats.norm.ppf(area1)

area2 =  area1 + .95
z2 = scipy.stats.norm.ppf(area2)

msg = f"Z Score values that bound 95%"
msg += f" of the data are {z1:.2f}, {z2:.2f}"

print(msg)

95% CI with exact Z-Scores

95% of sample means fall within 1.96 standard errors from the population mean

CI95_lower = sample_mean - 1.96 * std_error
CI95_upper = sample_mean + 1.96 * std_error
CI95 = CI95_lower, CI95_upper

print(f'95% confidence interval for mean: {CI95}')

Generalize Point Estimate

Let $\bar{X}$ be mean of sample and $\mu$ be population mean
- What is point estimate?
  - $\bar{X}$
- What is Interval Estimate?
  - $\bar{X} - 1.96\frac{\sigma}{\sqrt{n}},~\bar{X} + 1.96\frac{\sigma}{\sqrt{n}}$

CI Range for Larger Sample Size

sample_size = 35
sample_mean = 40
std_error = 2.71
95% confidence estimate is (34.6884, 45.3116)

# What would be best guess of mean of population 
# if we have sample (n=35, mean=40) 
# from previous lesson

sample_size = 250
sample_mean = 40
std_error = 1.01

CI95_lower = sample_mean - 1.96 * std_error
CI95_upper = sample_mean + 1.96 * std_error
CI95 = CI95_lower, CI95_upper

print(f'95% confidence interval for mean: {CI95}')

Bigger Sample, Smaller CI

95% confidence estimate when n= 35 is (34.6884, 45.3116)
95% confidence interval when n=250 is (38.0204, 41.9796)

Treatment Effect

occurs when the intervention affects the population mean
When the sample mean is far on the tails of the sampling distribution, and therefore unlikely to have occurred by chance, there is evidence for a treatment effect

Z for 98% CI

# What are the Z-score values that bound
# 98% of the data

ax = plt.subplot()

sample_mean_ = 0
std_error_ = 1

plot_sd(sample_mean_, std_error_, ax=ax)

ax.axvline(sample_mean_ - 3* std_error_, 
           color='red', 
           label=sample_mean_ - 2* std_error_);

ax.axvline(sample_mean_ + 3* std_error_, 
           color='green', 
           label=sample_mean_ + 2* std_error_);

area1 = 1/100 # Less that red line
z1 = scipy.stats.norm.ppf(area1)

area2 =  area1 + .98
z2 = scipy.stats.norm.ppf(area2)

msg = "Z Score values that bound 98% of the data"
msg += f"are {z1:.2f}, {z2:.2f}"
print(msg)

98% CI

# What would be best guess of mean of population 
# if we have sample (n=35, mean=40) 
# from previous lesson

sample_size = 250
sample_mean = 40
std_error = 1.01

CI98_lower = sample_mean - 2.33 * std_error
CI98_upper = sample_mean + 2.33 * std_error
CI98 = CI98_lower, CI98_upper

print(f'98% confidence interval for mean: {CI98}')

Critical Values of Z

$\pm 2.33$ - critical values of Z for 98% CI
$\pm 1.96$ - critical values of Z for 95% CI

Engagement Ratio

data_file = './data/EngagementRatio/EngagementRatio.csv'
data = get_data(data_file)
n = len(data)
print(n)

# Population Parameters
print(f'Population Mean={np.mean(data):.3f} and Standard Deviation = {np.std(data):.3f}')

# Sample of 20 Students with mean 0.13
sample_size = 20
X_bar = 0.13

# Point Estimate
print(X_bar)

# Interval Estimate
std_error = np.std(data)/np.sqrt(sample_size)
print(f'{std_error:.3f}')

ll = X_bar - 1.96 * std_error 
ul = X_bar + 1.96 * std_error

print(f'95% CI Interval Estimate is {ll:.3f}, {ul:.3f}')

msg = f'2*std_error {2*std_error} is the Margin of Error'
print(msg)

Measure of Engagement and Learning

# Population Parameters

# Measure of Engagement
mu_e = 7.5
sigma_e = .64

# Measure of Learning
mu_l = 8.2
sigma_l = .73

Experiment - did it increased?

# Experiment on a sample of 20
sample_size = 20
x_bar_e = 8.94
x_bar_l = 8.35

# Measure of Engagement - Sampling Distribution
mean_e = mu_e
std_error_e = sigma_e/np.sqrt(sample_size)
print(f'Mean {mean_e}, Standard Error {std_error_e:.3f}')

# Measure of Learning - Sampling Distribution
mean_l = mu_l
std_error_l = sigma_l/np.sqrt(sample_size)
print(f'Mean {mean_l}, Standard Error {std_error_l:.3f}')

# Where does the sample falls on sampling distribution
z_e = (x_bar_e - mu_e) / std_error_e
z_l = (x_bar_l - mu_l) / std_error_l

print(f'z_e={z_e:.2f}, z_e={z_l:.2f}')

area_e = 1 - scipy.stats.norm.cdf(z_e)
area_l = 1 - scipy.stats.norm.cdf(z_l)

print(f'prob_e={area_e:.2f}, prob_l = {area_l:.2f}')

print('Experiment seems to have had an effect on engagement, but not learning')

Share on

Twitter Facebook LinkedIn

Estimation

L02 - Estimation

Point Estimate for Population Mean

Margin of Error

Interval Estimate for Population Mean

Confidence Interval Bounds

Exact Z-scores

95% CI with exact Z-Scores

Generalize Point Estimate

CI Range for Larger Sample Size

Bigger Sample, Smaller CI

Treatment Effect

Z for 98% CI

98% CI

Critical Values of Z

Engagement Ratio

Measure of Engagement and Learning

Experiment - did it increased?

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1