Outliers

1 minute read

Published: January 01, 2022

This post covers detecting outliers.

Outliers

https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623
https://medium.datadriveninvestor.com/finding-outliers-in-dataset-using-python-efc3fce6ce32

import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt

from scipy import stats

Data

# multiply and add by random numbers to get some real values
data = np.random.randn(50000)  * 20 + 20
data

Method 1 - Scatter Plot

plt.scatter(x=range(len(data)), y=data);

Method 2 — Standard Deviation

If a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations
https://miro.medium.com/max/1400/1*rV7rq7F_uB5gwjzzGJ9VqA.png


https://miro.medium.com/max/1400/1*rV7rq7F_uB5gwjzzGJ9VqA.png

def outliers_SD(data):
    #define a list to accumlate anomalies
    anomalies = []
    
    # Set upper and lower limit to 3 standard deviation
    data_std = np.std(data)
    data_mean = np.mean(data)
    
    lower_limit  = data_mean - data_std * 3
    upper_limit = data_mean + data_std * 3
    
    # Generate outliers
    for outlier in data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)
    return anomalies

outliers = outliers_SD(data)

plt.scatter(range(len(outliers)), outliers);

Method 3 — Boxplots


https://miro.medium.com/max/1280/1*AU07MCIdvUnjskY1XH9auw.png

https://miro.medium.com/max/1400/1*J5Xm0X-phCJJ-DKZMZ_88w.png

sns.boxplot(data=data);

Method 4 - Using Z score

Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.
Re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers.
In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
$ Z~score = \frac{(Observation — Mean)}{Standard Deviation} $
$ z = \frac{X - \mu}{\sigma} $

def outliers_zscore(data, threshold=3):
    outliers=[]
    
    mean = np.mean(data)
    std = np.std(data)
    
    for x in data:
        z = (x - mean)/std
        
        if np.abs(z) > threshold:
            outliers.append(x)
            
    return outliers

outliers = outliers_zscore(data)

plt.scatter(range(len(outliers)), outliers);

def outliers_zscore(data, threshold=3):
    z = stats.zscore(data)
    z = np.abs(z)
    outliers_idx = np.where(z > 3)
    
    outliers = data[outliers_idx]
    return outliers

outliers = outliers_zscore(data)

plt.scatter(range(len(outliers)), outliers);

Share on

Twitter Facebook LinkedIn

Outliers

Outliers

Data

Method 1 - Scatter Plot

Method 2 — Standard Deviation

Method 3 — Boxplots

Method 4 - Using Z score

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1