Descriptive Statistics

15 minute read

Published:

This lesson is from Introduction to Descriptive Statistics by Jackie Nicholas, Mathematics Learning Centre, University of Sydney

Introduction

  • summary statistic that quantitatively describes or summarises features from a collection of information
  • Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarise a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

    • Inferential Statistics
      • Population: group of interest
      • Sample: subset of population
      • Statistic: numeric summary from sample
      • Parameter: numeric summary of the population
  • Measures that are commonly used to describe a data set are

    • measures of central tendency and
      • mean, median and mode
    • measures of variability or dispersion
      • standard deviation (or variance), the minimum/maximum values of the variables, kurtosis, and skewness

Data Types

  • Important to understand type of analysis and plots
  • Quantative and Categorical

Quantative

  • Numeric values that allow mathematical operations
DiscreteContinuous
Number of studentsAge [since age in years can be divided in months, then days, hours, ..]

Categorical

  • Group or set of items
Nominal (no order)Ordinal (ordered)
GenderRating: F to A*

Notations:

  • Random Variable, $X$
  • Observed value of random variable, $x_i$

Quantitive data - Discrete or Continuous

  • Center: Mean, Median, Mode
  • Spread: Numeric (Range, IQR, SD, Var) and Visual (Histogram, Box with five-number summary)
  • Shape: Histogram/Box plot - Symmetrical, Right Skewed, Left Skewed
  • Outliers: Histogram

Measures of Central Tendency

Mean

  • Sum/Count, $ \mu = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$

Median

  • Middle value of a data set

Mode

  • Most frequent number in the data set

The best one to use in a given situation depends on the type of variable given.

  • For example, suppose a class of 20 students own among them a total of 17 pets as shown in the following table.

    • Which measure of central tendency should we use here?

    • Type of PetNumber
      Cat5
      Dog4
      Goldfish3
      Rabbit1
      Bird4
    • Focus is on Type of Pet

      • Average wil be mode
        • Modal category - cat occurs most often
    • Focus is on average number of pets owned

      • 17 Pets and 20 students
      • mean = 17/20 = 0.85
    • Average number of pets per student

      • we need different data

        • Number of PetsFrequency
          011
          14
          23
          31
          41

          $ Mean = \frac{11\times0 + 4\times1 + 3\times2 1\times3 + 1\times4}{20} = 0.85$

        • The $11\times0 = 0+0+….(11~times)$

        • Data set: 20 students have pets as: $0,0,0,0,0,0,0,0,0,0,0,~1,1,1,1,~2,2,2,~3,~4$

        • Median = 0 and Mode = 0
  • Comparison

    • The mean, as opposed to the median, is a more accurate estimate of the central tendency of quantity variables in certain situations.

    • One of them is that the mean is calculated by taking the mean of all the observed values.

    • But when it comes to calculating the median, although it is true that all observed values are utilized in the ranking process, it is only true that the middle or mid-two values are used in the computation.

    • Another advantage is that the mean remains relatively consistent from one sample to the next.

    • If we take numerous samples from the same population, their means are less likely to fluctuate than their medians, which indicates that they are more similar.

    • When there are a few extreme values reported, the median is employed as a measure of central tendency, rather than the mean.

    • It is important to note that the mean is very sensitive to high values, and that it may not be an adequate measure of central tendency in certain situations.

    • This is shown in the following illustration.

      • Number of PetsFrequency
        011
        14
        23
        31
        41
        181
      • Mean changes from 0.85 to 1.8

      • However, Median = 0 and Mode = 0
    • The consequence of the outlier was to dramatically raise the mean, and as a result, the median is now a more accurate indication of the center of the distribution than the mean was before.

    • Unless there are clearly outliers, the mean is often employed to denote the midpoint of a distribution, with the exception of circumstances in which there are clearly outliers.

    • The mean may also be thought of as the balance point of distribution.

Measures of Dispersion

Range

  • $Range = Maximum - Minimum$

  • The range, however, is severely restricted when used as a measure of dispersion. Because it is based on just two observations, the lowest and the highest, we will obtain incorrect picture of dispersion if any of these values is an outlier in the distribution.

  • | | | | —————————— | —————————— |

  • We are looking for a measure of dispersion that will appropriately reflect the variability of the observed data.

  • We will now focus on the standard deviation, which is the most often used measure of dispersion.

Standard Deviation

  • Consider the following scenario: we have a collection of data in which there is no fluctuation in the observed values.

  • If all of the observations had the same value, say 5,5,5, the mean would also have the same value, 5.

  • Each observation would be the same as the previous one and would not deviate from the mean.

  • Consider the another scenario: we have a collection of observations with some variability. The observed values would depart from the mean by a wide range of amounts depending on the situation.

  • The standard deviation may be thought of as a form of average of all of these deviations from the mean.

  • Variance

    • Single value that measures spread of the data
    • Average squared difference of each value from the mean
    • or $Variance = \sigma^2 = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \mu)^2}$
  • Standard Deviation is square root of the variance

    • $SD = \sigma = \sqrt{Variance}$
  • MS Excel

    • The StdDevP summary function should be used when the entire population is used in the calculation.

    • When a sample of the data is used, not the entire population, then use the StdDev summary function.

    • DataDeviations (Mean=16)Squared Deviations
      1824
      1824
      1600
      1600
      13-39
      15-11
    • Variance = Mean of Squared Deviations = 3
    • SD = $\sqrt{3} = 1.732$
    • or STDEV.P()

Interquartile Range

  • It is also possible to assess dispersion or spread using the interquartile range (IQR).
  • It is useful when the median is employed as a measure of central tendency.
  • A range is defined as the portion of a distribution where 50% of the middle half of the distribution is found.
  • When it comes to statistics, a quartile is a form of quantile that splits the number of data points into four equal-sized sections, or quarters, that are more-or-less the same size.
  • For the purpose of computing quartiles, the data must be sorted from smallest to largest; as such, quartiles are a kind of order statistic.
  • This is how the three primary quartiles are divided:
    • The first quartile (Q1) of a data set is defined as the number that falls in the center of the range between the lowest number (the minimum) and the median of the data set.
      • It is sometimes referred to as the lower or 25th empirical quartile, due to the fact that 25 percent of the data falls below this threshold.
    • The second quartile (Q2) of a data set is the median of the data set; hence, 50% of the data is located below this point.
    • The third quartile (Q3) of a data set is the value that falls in the middle of the range between the median and the greatest value (maximum).
      • It is referred to as the upper or 75th empirical quartile due to the fact that 75% of the data is found below this position.
  • Five-number summary
    • Minimum, Maximum, and the three quartiles
    • This summary is significant in statistics because it offers information on the data’s center and spread.
    • Knowing the lower and higher quartiles offers insight into the size of the spread and if the dataset is skewed to one side or the other of the spectrum.
    • This is because, while the number of data points is divided equally across quartiles, the range is not the same between quartiles (i.e., Q3-Q2 ≠ Q2-Q1) and is instead known as the interquartile range (IQR).
    • Maximum and minimum show the spread of the data, the upper and lower quartiles can provide more detailed information on
      • the location of specific data points,
      • the presence of outliers in the data, and
      • the difference in spread between the middle 50% of the data and the outer data points.
https://www.mathsisfun.com
  • Using the interquartile range to explain data sets is especially beneficial in situations when there are extreme values.
  • When compared to the range and, to a lesser degree, the standard deviation, it is less sensitive to extreme values since it is calculated based on a distribution that is distributed throughout the middle 50% of the distribution.
  • If there are data sets with extreme values, it may be more acceptable to utilize the median to indicate central tendency and the interquartile range to describe spread.

Estimates of the Mean and Variance

  • We have focused our attention so far on calculating a population’s mean, standard deviation, and variance.
  • The Greek letters have been used $\mu, \sigma^2, \sigma$
  • We are mostly concerned in statistics with the analysis of data from a sample chosen from a population in order to draw conclusions about the population as a whole.
  • Our data sets are typically obtained from random sampling of the general population.
  • When we have a random sample of size n, we can utilize the information from the sample to estimate the mean and variance of the population in the following ways:
  • $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i} $
  • $ s^2 = \frac{1}{n-1}\sum\limits_{i=1}^{n}({x - \mu)^2}$
    • dividing by $n$ will underestimate population
    • Thus, with divinding $n-1$, $s^2$ is an unbiased estimator of population variance
    • $s^2$ is estimated pouplation variance [it is not accurate to call it as sample variance]

Presenting Data Using Histograms and Bar Graphs

Histograms

  • A histogram is a kind of graph that is often used in statistics to depict data.

  • Creating a histogram is accomplished by separating the data into a number of classes, and then representing the number of instances of each class or frequency with a vertical rectangle.

  • 0,5,15,15,25,25,25,25,35,35,35,35,35,35,45,45,45,45,45,45,45,55,55,55,55,55,55,55,55,65,65,65,65,65,65,65,65,65,65,65,65,65,65,65,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,85,85,85,85,85,85,85,85,85,85,95,95,95,100

  • The area of the rectangle reflects frequency of the class.

    • Range of marksFrequency
      1–102
      11–202
      21–304
      31–406
      41–507
      51–608
      61–7015
      71–8022
      81–9010
      91–1004
      TOTAL80
  • Class Interval: 1–10

  • MS Excel

    • https://trumpexcel.com/histogram-in-excel/

      Right Click on Bin -> Format Data Series

    • Bin Width = 10

  • Because each column has the same width, i.e. one, the height of each column is equal to the area of the column.
  • The entire area encompassed shows the total number of people that participated in the sample.
  • We may estimate the number of persons who received grades between 26 and 40 by looking at the histogram.
    • $(\frac{1}{2}\times4) + (1\times6) = 8$

Exploratory Data Analysis (EDA)

  • Descriptive statistics provide summaries about sample/observations - quantitive (summary statistics) / visual
  • A collection of summarisation techniques - Exploratory Data Analysis (EDA)

  • Univariate Analysis

    • Describe the distribution of single variable: central tendency/dispersion
      • Mean, median, mode, range, quartile, variance, standard deviation, skewness, histogram
  • Bivariate and Multivariate Analysis

    • Descriptive Statistics may also be used to describe relationship between pairs of variables

      • Scatter plots, Cross Tabulation or Contingency Table

        SalesProduct AProduct BTotal
        2019100012002200
        202080015002300
        Total180027004500
      • Quantitive Measure of Dependence

        • Correlation
          • Pearson - when both variables are continuous
          • Spearman’s rho - if one or both are not continuous
        • Covariance
        • Slope in regression analysis

Visual

  • Histogram: represents distribution of numerical data

    • Bin the range of values by dividing entire range into a series of intervals and count how many values fall into each interval

    • Usually Consecutive, non-overlapping, adjacent, often equal size

    • If the bins are of equal size:

      • Rectangle is erected over the bin with height proportional to frequency (no of values in each bin)
    • If the bins are not of equal size:

      • Rectangle is erected so that area is proportional to frequency of values in the bin. Thus, vertical axis represents frequency density
    • Rectangle touch each other to indicate continuous data

    • Different from bar chart, which are for categorical data. Bar charts may have gap in bars to visually indicate that it is bar chart

    • Examples

      Symmetric Unimodal Skewed Right Skewed Left
      Bimodal Multimodal Symmetric
    • We may plot using different bin widths to learn data

    • Five-number Summary

      • is set of descriptive statistics that provides information about a dataset

      • Consists of five sample percentiles for a Univariate Variale

        • Sample minimum
        • Lower / 1st Quartile
        • Median / middle value
        • Upper / 3rd Quartile
        • Sample Maximum
      • Provides information about location (median); spread (quartile); range (min and max)

      • Example

        • import numpy as np
                  
          def fivenum(data):
              """Five-number summary."""
              return np.percentile(data, [0, 25, 50, 75, 100], interpolation='midpoint')
                  
          # Number of moons of each planet in the Solar System
          moons = [0, 0, 1, 2, 63, 61, 27, 13]
                  
          print(fivenum(moons))
          [  0.    0.5   7.5  44.   63. ]
          
        • Find 5 number summary of 3, 1, 2, 8, 5, 10, 3

          • Nums = 3, 1, 2, 8, 5, 10, 3
          • Ordered Nums = 1, 2, 3, 3, 5, 8, 10
          • Min = 1; Max = 10; Median (Q2) = 3
          • Q1 is median of 1, 2, 3 = 2 [25% data below this value]
          • Q3 is median of 5, 8, 10 = 8 [75% data below this value]
  • Box Plot

    • Used to quickly compare spread of dataset
      • Boxplot with whiskers from minimum to maximum
      • Same Boxplot with whiskers with maximum 1.5 IQR
        • From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance.
        • Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance.
        • All other observed points are plotted as outliers

Measure of Shape

  • Histogram / Box Plot
    • Symmetrical
      • Mean = Median = Mode
    • Right Skewed
      • Median < Mean
      • Longer Whisker on the right of Box Plot
    • Left Skewed
      • Mean < Median
      • Longer whisker on the left of Box Plot
  • Examples:
    • Symmetrical or Bell Curve: Scores in Exam; Heights of persons
    • Left Skewed: Age of death
    • Right Skewed: Distribution of wealth

Outliers

  • Values that are far from rest of dataset
  • May just look at the histogram and observe if the value is far from other values
  • Outliers may impact the summary statistics e.g, mean salary of CEOs when one company is Apple and rest small scale and even standard deviation will also not good measure in this case
  • If Typo then correct/remove
  • Report Five number summary when outliers are present since single number may be misleading if outliers are present

Guidelines

  • Plot the data
  • If outliers then handle
  • If Symmetrical, Bell Shaped
    • Mean and Standard Deviation
  • If Skewed
    • Five number summary