Descriptive Statistics

6 minute read

Published:

This post covers Introduction to descriptive statistics.

Descriptive statistics

  • summary statistic that quantitatively describes or summarises features from a collection of information

  • Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarise a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

    • Inferential Statistics
      • Population: group of interest
      • Sample: subset of population
      • Statistic: numeric summary from sample
      • Parameter: numeric summary of the population
  • Measures that are commonly used to describe a data set are

    • measures of central tendency and
      • mean, median and mode
    • measures of variability or dispersion
      • standard deviation (or variance), the minimum/maximum values of the variables, kurtosis, and skewness
  • Use in Statistical Analysis

    • Descriptive statistics provide summaries about sample/observations - quantitive (summary statistics) / visual
    • A collection of summarisation techniques - Exploratory Data Analysis (EDA)
  • Univariate Analysis

    • Describe the distribution of single variable: central tendency/dispersion
      • Mean, median, mode, range, quartile, variance, standard deviation, skewness, histogram
  • Bivariate and Multivariate Analysis

    • Descriptive Statistics may also be used to describe relationship between pairs of variables

      • Scatter plots, Cross Tabulation or Contingency Table

        SalesProduct AProduct BTotal
        2019100012002200
        202080015002300
        Total180027004500
      • Quantitive Measure of Dependence

        • Correlation
          • Pearson - when both variables are continuous
          • Spearman’s rho - if one or both are not continuous
        • Covariance
        • Slope in regression analysis

Data Types

  • Important to understand type of analysis and plots

  • Quantative

    • Numeric values that allow mathematical operations
    • Discrete
      • Number of students
    • Continuous
      • e.g. Age [since age in years can be divided in months, then days, hours, ..]
  • Categorical

    • Group or set of items

    • Nominal (no order)

      • Gender
    • Ordinal (ordered)

      • e.g. rating: F to A*

For Discrete or Continuous Quantitive data

  • Center, Spread, Shape, Outliers

  • Random Variable, $X$

  • Observed value of random variable, $x_i$

    Measure of Center

    • Mean
      • Sum/Count, $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
    • Median
      • Middle value of a data set
    • Mode
      • Most frequent number in the data set

    Measure of Spread

    • Numeric Measure

      • Range
        • Maximum - Minimum
      • Interquartile Range
        • Q3 - Q1
      • Standard Deviation
        • Single value that measures spread of the data
        • On average, how much each value varies from the mean of the values i.e. average variation
        • Compute $\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}{x_i}$
        • $SD = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})}$
        • Problem since some values are negative and some positive
        • Compute square root of the variance
      • Variance
        • Single value that measures spread of the data
        • Average squared difference of each value from the mean
        • $Variance = \frac{1}{n}\sum\limits_{i=1}^{n}({x - \bar{x})^2}$
    • Visual

      • Histogram: represents distribution of numerical data

        • Bin the range of values by dividing entire range into a series of intervals and count how many values fall into each interval

        • Usually Consecutive, non-overlapping, adjacent, often equal size

        • If the bins are of equal size:

          • Rectangle is erected over the bin with height proportional to frequency (no of values in each bin)
        • If the bins are not of equal size:

          • Rectangle is erected so that area is proportional to frequency of values in the bin. Thus, vertical axis represents frequency density
        • Rectangle touch each other to indicate continuous data

        • Different from bar chart, which are for categorical data. Bar charts may have gap in bars to visually indicate that it is bar chart

        • Examples

          Symmetric Unimodal Skewed Right Skewed Left
          Bimodal Multimodal Symmetric
        • We may plot using different bin widths to learn data

        • Five-number Summary

          • is set of descriptive statistics that provides information about a dataset

          • Consists of five sample percentiles for a Univariate Variale

            • Sample minimum
            • Lower / 1st Quartile
            • Median / middle value
            • Upper / 3rd Quartile
            • Sample Maximum
          • Provides information about location (median); spread (quartile); range (min and max)

          • Example

            • import numpy as np
                          
              def fivenum(data):
                  """Five-number summary."""
                  return np.percentile(data, [0, 25, 50, 75, 100], interpolation='midpoint')
                          
              # Number of moons of each planet in the Solar System
              moons = [0, 0, 1, 2, 63, 61, 27, 13]
                          
              print(fivenum(moons))
              [  0.    0.5   7.5  44.   63. ]
              
            • Find 5 number summary of 3, 1, 2, 8, 5, 10, 3

              • Nums = 3, 1, 2, 8, 5, 10, 3
              • Ordered Nums = 1, 2, 3, 3, 5, 8, 10
              • Min = 1; Max = 10; Median (Q2) = 3
              • Q1 is median of 1, 2, 3 = 2 [25% data below this value]
              • Q3 is median of 5, 8, 10 = 8 [75% data below this value]
    • Box Plot

      • Used to quickly compare spread of dataset
        • Boxplot with whiskers from minimum to maximum
        • Same Boxplot with whiskers with maximum 1.5 IQR
          • From above the upper quartile, a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed point from the dataset that falls within this distance.
          • Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile and a whisker is drawn up to the lower observed point from the dataset that falls within this distance.
          • All other observed points are plotted as outliers

    Measure of Shape

    • Histogram / Box Plot
      • Symmetrical
        • Mean = Median = Mode
      • Right Skewed
        • Median < Mean
        • Longer Whisker on the right of Box Plot
      • Left Skewed
        • Mean < Median
        • Longer whisker on the left of Box Plot
    • Examples:
      • Symmetrical or Bell Curve: Scores in Exam; Heights of persons
      • Left Skewed: Age of death
      • Right Skewed: Distribution of wealth

    Outliers

    • Values that are far from rest of dataset

    • May just look at the histogram and observe if the value is far from other values

    • Outliers may impact the summary statistics e.g, mean salary of CEOs when one company is Apple and rest small scale and even standard deviation will also not good measure in this case

    • If Typo then correct/remove

    • Report Five number summary when outliers are present since single number may be misleading if outliers are present

  • Guidelines

    • Plot the data
    • If outliers then handle
    • If Symmetrical, Bell Shaped
      • Mean and Standard Deviation
    • If Skewed
      • Five number summary