# Descriptive Statistics

** Published:**

This post covers Introduction to descriptive statistics.

# Descriptive statistics

summary statistic that quantitatively describes or summarises features from a collection of information

Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarise a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

- Inferential Statistics
- Population: group of interest
- Sample: subset of population
- Statistic: numeric summary from sample
- Parameter: numeric summary of the population

- Inferential Statistics
Measures that are commonly used to describe a data set are

- measures of central tendency and
- mean, median and mode

- measures of variability or dispersion
- standard deviation (or variance), the minimum/maximum values of the variables, kurtosis, and skewness

- measures of central tendency and
Use in Statistical Analysis

- Descriptive statistics provide summaries about sample/observations - quantitive (summary statistics) / visual
- A collection of summarisation techniques - Exploratory Data Analysis (EDA)

Univariate Analysis

- Describe the distribution of single variable: central tendency/dispersion
- Mean, median, mode, range, quartile, variance, standard deviation, skewness, histogram

- Describe the distribution of single variable: central tendency/dispersion
Bivariate and Multivariate Analysis

Descriptive Statistics may also be used to describe relationship between pairs of variables

Scatter plots, Cross Tabulation or Contingency Table

Sales Product A Product B Total 2019 1000 1200 2200 2020 800 1500 2300 Total 1800 2700 4500 Quantitive Measure of Dependence

- Correlation
- Pearson - when both variables are continuous
- Spearmanâ€™s rho - if one or both are not continuous

- Covariance
- Slope in regression analysis

- Correlation

## Data Types

Important to understand type of analysis and plots

Quantative

- Numeric values that allow mathematical operations
- Discrete
- Number of students

- Continuous
- e.g. Age [since age in years can be divided in months, then days, hours, ..]

Categorical

Group or set of items

Nominal (no order)

- Gender

Ordinal (ordered)

- e.g. rating: F to A*

### For Discrete or Continuous Quantitive data

Center, Spread, Shape, Outliers

Random Variable, $X$

Observed value of random variable, $x_i$

#### Measure of Center

- Mean
- Sum/Count, $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$

- Median
- Middle value of a data set

- Mode
- Most frequent number in the data set

#### Measure of Spread

Numeric Measure

- Range
- Maximum - Minimum

- Interquartile Range
- Q3 - Q1

- Standard Deviation
- Single value that measures spread of the data
- On average, how much each value varies from the mean of the values i.e. average variation
- Compute $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
- $SD = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})}$
- Problem since some values are negative and some positive
- Compute square root of the variance

- Variance
- Single value that measures spread of the data
- Average squared difference of each value from the mean
- $Variance = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})^2}$

- Range
Visual

Histogram: represents distribution of numerical data

Bin the range of values by dividing entire range into a series of intervals and count how many values fall into each interval

Usually Consecutive, non-overlapping, adjacent, often equal size

If the bins are of equal size:

- Rectangle is erected over the bin with height proportional to frequency (no of values in each bin)

If the bins are not of equal size:

- Rectangle is erected so that area is proportional to frequency of values in the bin. Thus, vertical axis represents frequency density

Rectangle touch each other to indicate continuous data

Different from bar chart, which are for categorical data. Bar charts may have gap in bars to visually indicate that it is bar chart

Examples

Symmetric Unimodal Skewed Right Skewed Left **Bimodal****Multimodal****Symmetric**We may plot using different bin widths to learn data

is set of descriptive statistics that provides information about a dataset

Consists of five sample percentiles for a Univariate Variale

- Sample minimum
- Lower / 1st Quartile
- Median / middle value
- Upper / 3rd Quartile
- Sample Maximum

Provides information about location (median); spread (quartile); range (min and max)

`import numpy as np def fivenum(data): """Five-number summary.""" return np.percentile(data, [0, 25, 50, 75, 100], interpolation='midpoint') # Number of moons of each planet in the Solar System moons = [0, 0, 1, 2, 63, 61, 27, 13] print(fivenum(moons)) [ 0. 0.5 7.5 44. 63. ]`

Find 5 number summary of 3, 1, 2, 8, 5, 10, 3

- Nums = 3, 1, 2, 8, 5, 10, 3
- Ordered Nums = 1, 2, 3, 3, 5, 8, 10
- Min = 1; Max = 10; Median (Q2) = 3
- Q1 is median of 1, 2, 3 = 2 [25% data below this value]
- Q3 is median of 5, 8, 10 = 8 [75% data below this value]

- Used to quickly compare spread of dataset
- Boxplot with whiskers from minimum to maximum
- Same Boxplot with whiskers with maximum 1.5 IQR
- From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance.
- Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance.
- All other observed points are plotted as outliers

- Boxplot with whiskers from minimum to maximum

- Used to quickly compare spread of dataset

### Measure of Shape

- Histogram / Box Plot
- Symmetrical
- Mean = Median = Mode

- Right Skewed
- Median < Mean
- Longer Whisker on the right of Box Plot

- Left Skewed
- Mean < Median
- Longer whisker on the left of Box Plot

- Symmetrical
- Examples:
- Symmetrical or Bell Curve: Scores in Exam; Heights of persons
- Left Skewed: Age of death
- Right Skewed: Distribution of wealth

### Outliers

Values that are far from rest of dataset

May just look at the histogram and observe if the value is far from other values

Outliers may impact the summary statistics e.g, mean salary of CEOs when one company is Apple and rest small scale and even standard deviation will also not good measure in this case

If Typo then correct/remove

Report Five number summary when outliers are present since single number may be misleading if outliers are present

- Mean
## Guidelines

- Plot the data
- If outliers then handle
- If Symmetrical, Bell Shaped
- Mean and Standard Deviation

- If Skewed
- Five number summary