Bivariate Visualization

Published:

This post covers Bivariate Visualization.

Bivariate Visualization

• Datasets: /posts/python/setup
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np


Bivariate Plots

• Scatter plot
• Quantative vs Quantative
• Violin Plot
• Quantative vs Qualitative
• Clustered Bar chart
• Qualitative vs Qualitative

Scatterplot and Correlation

Source: Wikipedia

• Examples of scatter diagrams with different values of correlation coefficient (ρ)

Source: Wikipedia

• Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
df = pd.read_csv('./data/diabetes/diabetes.csv')
print(df.shape)
print(df.info())


Scatter plot: Quantative vs Quantative

plt.scatter(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16)

sns.regplot(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

# No explanation

sns.regplot(data=df, x='Outcome', y='SkinThickness', fit_reg=False);
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False);
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);


Transparency and Jitter

• too many overlapping points
sns.regplot(data=df, x='Outcome', y='SkinThickness', x_jitter=0.1, fit_reg=False, scatter_kws={'alpha':.4});
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False, scatter_kws={'alpha':.4})
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);


Heat Map

• relationship with color and density
• good for discrete quantative vs discrete quantative
bins_x = np.arange(0, 199+5, 20)
bins_y = np.arange(0, 122+5, 15)

plt.hist2d(data=df, x='Glucose', y='BloodPressure', cmin=0.6, cmap='plasma_r', bins=[bins_x, bins_y])
plt.colorbar()
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

df[['Glucose', 'BloodPressure']].describe()


Violin Plot: Quantative vs Qualitative

sns.violinplot(data=df, x='Outcome', y='Glucose');

color = sns.color_palette()[0]
sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner=None);
plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);


Box Plot

sns.boxplot(data=df, x='Outcome', y='Glucose');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

color = sns.color_palette()[0]

sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner='quartile');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);


Clustered Barchart: Qualitative vs Qualitative

df = pd.read_csv('./data/titanic/train.csv')
print(df.shape)
print(df.info())

def clean_gender(df):
df.Gender.replace(to_replace='M', value='male', inplace=True)

df.Gender.replace(to_replace='Male', value='male', inplace=True)

df.Gender.replace(to_replace='F', value='female', inplace=True)

df.Gender.replace(to_replace='Female', value='female', inplace=True)

return df

df = clean_gender(df)

sns.countplot(data=df, x='Survived', hue='Gender');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

sns.countplot(data=df, x='Survived', hue='Parch');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

df.Survived.replace(0, 'NotSurvived', inplace=True)
df.Survived.replace(1, 'Survived', inplace=True)

sns.countplot(data=df, x='Parch', hue='Survived');
plt.legend(loc='center right');
#plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);


Faceting

• useful when number of levels in categorical variables
# Quantative vs Qualtative

color = sns.color_palette()[0]

sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner='quartile');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

df = pd.read_csv('./data/diabetes/diabetes.csv')

bins = np.arange(0, 200, 20) # to ensure same number of bins for each facet

#print(bins.shape)

g = sns.FacetGrid(data=df, col='Outcome', col_wrap=2, sharey=True);

g.map(plt.hist, 'Glucose', bins=bins);





Tags: