Descriptive Statistics
Published:
This post covers Introduction to descriptive statistics.
Descriptive statistics
summary statistic that quantitatively describes or summarises features from a collection of information
Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarise a sample, rather than use the data to learn about the population that the sample of data is thought to represent.
- Inferential Statistics
- Population: group of interest
- Sample: subset of population
- Statistic: numeric summary from sample
- Parameter: numeric summary of the population
- Inferential Statistics
Measures that are commonly used to describe a data set are
- measures of central tendency and
- mean, median and mode
- measures of variability or dispersion
- standard deviation (or variance), the minimum/maximum values of the variables, kurtosis, and skewness
- measures of central tendency and
Use in Statistical Analysis
- Descriptive statistics provide summaries about sample/observations - quantitive (summary statistics) / visual
- A collection of summarisation techniques - Exploratory Data Analysis (EDA)
Univariate Analysis
- Describe the distribution of single variable: central tendency/dispersion
- Mean, median, mode, range, quartile, variance, standard deviation, skewness, histogram
- Describe the distribution of single variable: central tendency/dispersion
Bivariate and Multivariate Analysis
Descriptive Statistics may also be used to describe relationship between pairs of variables
Scatter plots, Cross Tabulation or Contingency Table
Sales Product A Product B Total 2019 1000 1200 2200 2020 800 1500 2300 Total 1800 2700 4500 Quantitive Measure of Dependence
- Correlation
- Pearson - when both variables are continuous
- Spearman’s rho - if one or both are not continuous
- Covariance
- Slope in regression analysis
- Correlation
Data Types
Important to understand type of analysis and plots
Quantative
- Numeric values that allow mathematical operations
- Discrete
- Number of students
- Continuous
- e.g. Age [since age in years can be divided in months, then days, hours, ..]
Categorical
Group or set of items
Nominal (no order)
- Gender
Ordinal (ordered)
- e.g. rating: F to A*
For Discrete or Continuous Quantitive data
Center, Spread, Shape, Outliers
Random Variable, $X$
Observed value of random variable, $x_i$
Measure of Center
- Mean
- Sum/Count, $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
- Median
- Middle value of a data set
- Mode
- Most frequent number in the data set
Measure of Spread
Numeric Measure
- Range
- Maximum - Minimum
- Interquartile Range
- Q3 - Q1
- Standard Deviation
- Single value that measures spread of the data
- On average, how much each value varies from the mean of the values i.e. average variation
- Compute $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
- $SD = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})}$
- Problem since some values are negative and some positive
- Compute square root of the variance
- Variance
- Single value that measures spread of the data
- Average squared difference of each value from the mean
- $Variance = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})^2}$
- Range
Visual
Histogram: represents distribution of numerical data
Bin the range of values by dividing entire range into a series of intervals and count how many values fall into each interval
Usually Consecutive, non-overlapping, adjacent, often equal size
If the bins are of equal size:
- Rectangle is erected over the bin with height proportional to frequency (no of values in each bin)
If the bins are not of equal size:
- Rectangle is erected so that area is proportional to frequency of values in the bin. Thus, vertical axis represents frequency density
Rectangle touch each other to indicate continuous data
Different from bar chart, which are for categorical data. Bar charts may have gap in bars to visually indicate that it is bar chart
Examples
Symmetric Unimodal Skewed Right Skewed Left Bimodal Multimodal Symmetric We may plot using different bin widths to learn data
is set of descriptive statistics that provides information about a dataset
Consists of five sample percentiles for a Univariate Variale
- Sample minimum
- Lower / 1st Quartile
- Median / middle value
- Upper / 3rd Quartile
- Sample Maximum
Provides information about location (median); spread (quartile); range (min and max)
import numpy as np def fivenum(data): """Five-number summary.""" return np.percentile(data, [0, 25, 50, 75, 100], interpolation='midpoint') # Number of moons of each planet in the Solar System moons = [0, 0, 1, 2, 63, 61, 27, 13] print(fivenum(moons)) [ 0. 0.5 7.5 44. 63. ]
Find 5 number summary of 3, 1, 2, 8, 5, 10, 3
- Nums = 3, 1, 2, 8, 5, 10, 3
- Ordered Nums = 1, 2, 3, 3, 5, 8, 10
- Min = 1; Max = 10; Median (Q2) = 3
- Q1 is median of 1, 2, 3 = 2 [25% data below this value]
- Q3 is median of 5, 8, 10 = 8 [75% data below this value]
- Used to quickly compare spread of dataset
- Boxplot with whiskers from minimum to maximum
- Same Boxplot with whiskers with maximum 1.5 IQR
- From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance.
- Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance.
- All other observed points are plotted as outliers
- Boxplot with whiskers from minimum to maximum
- Used to quickly compare spread of dataset
Measure of Shape
- Histogram / Box Plot
- Symmetrical
- Mean = Median = Mode
- Right Skewed
- Median < Mean
- Longer Whisker on the right of Box Plot
- Left Skewed
- Mean < Median
- Longer whisker on the left of Box Plot
- Symmetrical
- Examples:
- Symmetrical or Bell Curve: Scores in Exam; Heights of persons
- Left Skewed: Age of death
- Right Skewed: Distribution of wealth
Outliers
Values that are far from rest of dataset
May just look at the histogram and observe if the value is far from other values
Outliers may impact the summary statistics e.g, mean salary of CEOs when one company is Apple and rest small scale and even standard deviation will also not good measure in this case
If Typo then correct/remove
Report Five number summary when outliers are present since single number may be misleading if outliers are present
- Mean
Guidelines
- Plot the data
- If outliers then handle
- If Symmetrical, Bell Shaped
- Mean and Standard Deviation
- If Skewed
- Five number summary