Outliers

1 minute read

Published:

This post covers detecting outliers.

Outliers

  • https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

  • https://medium.datadriveninvestor.com/finding-outliers-in-dataset-using-python-efc3fce6ce32

import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt

from scipy import stats

Data

# multiply and add by random numbers to get some real values
data = np.random.randn(50000)  * 20 + 20
data

Method 1 - Scatter Plot

plt.scatter(x=range(len(data)), y=data);

Method 2 — Standard Deviation

  • If a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations

  • https://miro.medium.com/max/1400/1*rV7rq7F_uB5gwjzzGJ9VqA.png
def outliers_SD(data):
    #define a list to accumlate anomalies
    anomalies = []
    
    # Set upper and lower limit to 3 standard deviation
    data_std = np.std(data)
    data_mean = np.mean(data)
    
    lower_limit  = data_mean - data_std * 3
    upper_limit = data_mean + data_std * 3
    
    # Generate outliers
    for outlier in data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)
    return anomalies

outliers = outliers_SD(data)

plt.scatter(range(len(outliers)), outliers);

Method 3 — Boxplots

https://miro.medium.com/max/1280/1*AU07MCIdvUnjskY1XH9auw.png
https://miro.medium.com/max/1400/1*J5Xm0X-phCJJ-DKZMZ_88w.png
sns.boxplot(data=data);

Method 4 - Using Z score

  • Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.

  • Re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers.

  • In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

  • $ Z~score = \frac{(Observation — Mean)}{Standard Deviation} $

  • $ z = \frac{X - \mu}{\sigma} $

def outliers_zscore(data, threshold=3):
    outliers=[]
    
    mean = np.mean(data)
    std = np.std(data)
    
    for x in data:
        z = (x - mean)/std
        
        if np.abs(z) > threshold:
            outliers.append(x)
            
    return outliers

outliers = outliers_zscore(data)

plt.scatter(range(len(outliers)), outliers);
def outliers_zscore(data, threshold=3):
    z = stats.zscore(data)
    z = np.abs(z)
    outliers_idx = np.where(z > 3)
    
    outliers = data[outliers_idx]
    return outliers

outliers = outliers_zscore(data)

plt.scatter(range(len(outliers)), outliers);