Visualization

3 minute read

Published:

This post covers Bivariate Visualization.

Bivariate

  • Datasets: /posts/python/setup
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

Bivariate Plots

  • Scatter plot
    • Quantative vs Quantative
  • Violin Plot
    • Quantitative vs Qualitative
  • Clustered Bar chart
    • Qualitative vs Qualitative

Scatterplot and Correlation

Source: Wikipedia

  • Examples of scatter diagrams with different values of correlation coefficient (ρ)

Source: Wikipedia

  • Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
df = pd.read_csv('./data/diabetes/diabetes.csv')
print(df.shape)
print(df.info())

Scatter plot: Quantative vs Quantative

plt.scatter(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16)
sns.regplot(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);
# No explanation

sns.regplot(data=df, x='Outcome', y='SkinThickness', fit_reg=False);
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False);
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

Transparency and Jitter

  • too many overlapping points
sns.regplot(data=df, x='Outcome', y='SkinThickness', x_jitter=0.1, fit_reg=False, scatter_kws={'alpha':.4});
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False, scatter_kws={'alpha':.4})
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

Heat Map

  • relationship with color and density
  • good for discrete quantative vs discrete quantative
bins_x = np.arange(0, 199+5, 20)
bins_y = np.arange(0, 122+5, 15)

plt.hist2d(data=df, x='Glucose', y='BloodPressure', cmin=0.6, cmap='plasma_r', bins=[bins_x, bins_y])
plt.colorbar()
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);
df[['Glucose', 'BloodPressure']].describe()

Violin Plot: Quantative vs Qualitative

  • Violin plots
    • similar to box plots
    • show the probability density of the data at different values, usually smoothed by a kernel density estimator.
    • more informative than a plain box plot
      • shows summary statistics such as mean/median and interquartile ranges
    • violin plot shows the full distribution of the data
    • useful when the data distribution is multimodal (more than one peak)
      • violin plot shows the presence of different peaks, their position and relative amplitude
sns.violinplot(data=df, x='Outcome', y='Glucose');
color = sns.color_palette()[0]
sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner=None);
plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);
color = sns.color_palette()[0]

sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner='quartile');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

Box Plot

sns.boxplot(data=df, x='Outcome', y='Glucose');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

Clustered Barchart: Qualitative vs Qualitative

df = pd.read_csv('./data/titanic/train.csv')
print(df.shape)
print(df.info())
def clean_gender(df):
    df.Gender.replace(to_replace='M', value='male', inplace=True)

    df.Gender.replace(to_replace='Male', value='male', inplace=True)

    df.Gender.replace(to_replace='F', value='female', inplace=True)

    df.Gender.replace(to_replace='Female', value='female', inplace=True)

    return df

df = clean_gender(df)
sns.countplot(data=df, x='Survived', hue='Gender');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);
sns.countplot(data=df, x='Survived', hue='Parch');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);
df.Survived.replace(0, 'NotSurvived', inplace=True)
df.Survived.replace(1, 'Survived', inplace=True)

sns.countplot(data=df, x='Parch', hue='Survived');
plt.legend(loc='center right');
#plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

Faceting

  • useful when number of levels in categorical variables
df = pd.read_csv('./data/diabetes/diabetes.csv')

bins = np.arange(0, 200, 20) # to ensure same number of bins for each facet

#print(bins.shape)

g = sns.FacetGrid(data=df, col='Outcome', col_wrap=2, sharey=True);

g.map(plt.hist, 'Glucose', bins=bins);