# Descriptive Statistics

Published:

This post covers Introduction to descriptive statistics.

# Introduction

• summary statistic that quantitatively describes or summarises features from a collection of information

• Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarise a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

• Inferential Statistics
• Population: group of interest
• Sample: subset of population
• Statistic: numeric summary from sample
• Parameter: numeric summary of the population
• Measures that are commonly used to describe a data set are

• measures of central tendency and
• mean, median and mode
• measures of variability or dispersion
• standard deviation (or variance), the minimum/maximum values of the variables, kurtosis, and skewness
• Use in Statistical Analysis

• Descriptive statistics provide summaries about sample/observations - quantitive (summary statistics) / visual
• A collection of summarisation techniques - Exploratory Data Analysis (EDA)
• Univariate Analysis

• Describe the distribution of single variable: central tendency/dispersion
• Mean, median, mode, range, quartile, variance, standard deviation, skewness, histogram
• Bivariate and Multivariate Analysis

• Descriptive Statistics may also be used to describe relationship between pairs of variables

• Scatter plots, Cross Tabulation or Contingency Table

SalesProduct AProduct BTotal
2019100012002200
202080015002300
Total180027004500
• Quantitive Measure of Dependence

• Correlation
• Pearson - when both variables are continuous
• Spearman’s rho - if one or both are not continuous
• Covariance
• Slope in regression analysis

# Data Types

• Important to understand type of analysis and plots
• Quantative and Categorical

## Quantative

• Numeric values that allow mathematical operations
DiscreteContinuous
Number of studentsAge [since age in years can be divided in months, then days, hours, ..]

## Categorical

• Group or set of items
Nominal (no order)Ordinal (ordered)
GenderRating: F to A*

# Notations:

• Random Variable, $X$
• Observed value of random variable, $x_i$

# Quantitive data - Discrete or Continuous

• Center: Mean, Median, Mode
• Spread: Numeric (Range, IQR, SD, Var) and Visual (Histogram, Box with five-number summary)
• Shape: Histogram/Box plot - Symmetrical, Right Skewed, Left Skewed
• Outliers: Histogram

## Measure of Center

• Mean
• Sum/Count, $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
• Median
• Middle value of a data set
• Mode
• Most frequent number in the data set

### Numeric Measure

• Range
• Maximum - Minimum
• Interquartile Range
• Q3 - Q1
• Standard Deviation
• Single value that measures spread of the data
• On average, how much each value varies from the mean of the values i.e. average variation
• Compute $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
• $SD = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})}$
• Problem since some values are negative and some positive
• Compute square root of the variance
• Variance
• Single value that measures spread of the data
• Average squared difference of each value from the mean
• $Variance = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})^2}$

### Visual

• Histogram: represents distribution of numerical data

• Bin the range of values by dividing entire range into a series of intervals and count how many values fall into each interval

• Usually Consecutive, non-overlapping, adjacent, often equal size

• If the bins are of equal size:

• Rectangle is erected over the bin with height proportional to frequency (no of values in each bin)
• If the bins are not of equal size:

• Rectangle is erected so that area is proportional to frequency of values in the bin. Thus, vertical axis represents frequency density
• Rectangle touch each other to indicate continuous data

• Different from bar chart, which are for categorical data. Bar charts may have gap in bars to visually indicate that it is bar chart

• Examples

Symmetric Unimodal Skewed Right Skewed Left Bimodal Multimodal Symmetric • We may plot using different bin widths to learn data

• Five-number Summary

• is set of descriptive statistics that provides information about a dataset

• Consists of five sample percentiles for a Univariate Variale

• Sample minimum
• Lower / 1st Quartile
• Median / middle value
• Upper / 3rd Quartile
• Sample Maximum
• Provides information about location (median); spread (quartile); range (min and max)

• Example

• import numpy as np

def fivenum(data):
"""Five-number summary."""
return np.percentile(data, [0, 25, 50, 75, 100], interpolation='midpoint')

# Number of moons of each planet in the Solar System
moons = [0, 0, 1, 2, 63, 61, 27, 13]

print(fivenum(moons))
[  0.    0.5   7.5  44.   63. ]

• Find 5 number summary of 3, 1, 2, 8, 5, 10, 3

• Nums = 3, 1, 2, 8, 5, 10, 3
• Ordered Nums = 1, 2, 3, 3, 5, 8, 10
• Min = 1; Max = 10; Median (Q2) = 3
• Q1 is median of 1, 2, 3 = 2 [25% data below this value]
• Q3 is median of 5, 8, 10 = 8 [75% data below this value]
• Box Plot

• Used to quickly compare spread of dataset
• Boxplot with whiskers from minimum to maximum
• • Same Boxplot with whiskers with maximum 1.5 IQR
• • From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance.
• Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance.
• All other observed points are plotted as outliers

## Measure of Shape

• Histogram / Box Plot
• Symmetrical
• Mean = Median = Mode
• Right Skewed
• Median < Mean
• Longer Whisker on the right of Box Plot
• Left Skewed
• Mean < Median
• Longer whisker on the left of Box Plot
• Examples:
• Symmetrical or Bell Curve: Scores in Exam; Heights of persons
• Left Skewed: Age of death
• Right Skewed: Distribution of wealth

## Outliers

• Values that are far from rest of dataset
• May just look at the histogram and observe if the value is far from other values
• Outliers may impact the summary statistics e.g, mean salary of CEOs when one company is Apple and rest small scale and even standard deviation will also not good measure in this case
• If Typo then correct/remove
• Report Five number summary when outliers are present since single number may be misleading if outliers are present

## Guidelines

• Plot the data
• If outliers then handle
• If Symmetrical, Bell Shaped
• Mean and Standard Deviation
• If Skewed
• Five number summary

Tags: