Visualization

3 minute read

Published: January 01, 2022

This post covers Bivariate Visualization.

Bivariate

Datasets: /posts/python/setup

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

Bivariate Plots

Scatter plot
- Quantative vs Quantative
Violin Plot
- Quantitative vs Qualitative
Clustered Bar chart
- Qualitative vs Qualitative

Scatterplot and Correlation

Pearson Correlation Coefficient
- strength of linear correlation between two numeric variables

Source: Wikipedia

Examples of scatter diagrams with different values of correlation coefficient (ρ)

Source: Wikipedia

Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

df = pd.read_csv('./data/diabetes/diabetes.csv')
print(df.shape)
print(df.info())

Scatter plot: Quantative vs Quantative

plt.scatter(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

plt.scatter(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16)

sns.regplot(data=df, x='BMI', y='SkinThickness');
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='Insulin');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('Insulin', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure');
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

# No explanation

sns.regplot(data=df, x='Outcome', y='SkinThickness', fit_reg=False);
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False);
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

Transparency and Jitter

too many overlapping points

sns.regplot(data=df, x='Outcome', y='SkinThickness', x_jitter=0.1, fit_reg=False, scatter_kws={'alpha':.4});
plt.xlabel('BMI', fontsize=16)
plt.ylabel('SkinThickness', fontsize=16)
plt.show()

sns.regplot(data=df, x='Glucose', y='BloodPressure', fit_reg=False, scatter_kws={'alpha':.4})
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

Heat Map

relationship with color and density
good for discrete quantative vs discrete quantative

bins_x = np.arange(0, 199+5, 20)
bins_y = np.arange(0, 122+5, 15)

plt.hist2d(data=df, x='Glucose', y='BloodPressure', cmin=0.6, cmap='plasma_r', bins=[bins_x, bins_y])
plt.colorbar()
plt.xlabel('Glucose', fontsize=16)
plt.ylabel('BloodPressure', fontsize=16);

df[['Glucose', 'BloodPressure']].describe()

Violin Plot: Quantative vs Qualitative

Violin plots
- similar to box plots
- show the probability density of the data at different values, usually smoothed by a kernel density estimator.
- more informative than a plain box plot
  - shows summary statistics such as mean/median and interquartile ranges
- violin plot shows the full distribution of the data
- useful when the data distribution is multimodal (more than one peak)
  - violin plot shows the presence of different peaks, their position and relative amplitude

sns.violinplot(data=df, x='Outcome', y='Glucose');

color = sns.color_palette()[0]
sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner=None);
plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

color = sns.color_palette()[0]

sns.violinplot(data=df, x='Outcome', y='Glucose', color=color, inner='quartile');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

Box Plot

sns.boxplot(data=df, x='Outcome', y='Glucose');

plt.xticks(range(2), ['No Diab', 'Diab'], fontsize=16);

Clustered Barchart: Qualitative vs Qualitative

df = pd.read_csv('./data/titanic/train.csv')
print(df.shape)
print(df.info())

def clean_gender(df):
    df.Gender.replace(to_replace='M', value='male', inplace=True)

    df.Gender.replace(to_replace='Male', value='male', inplace=True)

    df.Gender.replace(to_replace='F', value='female', inplace=True)

    df.Gender.replace(to_replace='Female', value='female', inplace=True)

    return df

df = clean_gender(df)

sns.countplot(data=df, x='Survived', hue='Gender');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

sns.countplot(data=df, x='Survived', hue='Parch');

plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

df.Survived.replace(0, 'NotSurvived', inplace=True)
df.Survived.replace(1, 'Survived', inplace=True)

sns.countplot(data=df, x='Parch', hue='Survived');
plt.legend(loc='center right');
#plt.xticks(range(2), ['Not Survived', 'Survived'], fontsize=16);

Faceting

useful when number of levels in categorical variables

df = pd.read_csv('./data/diabetes/diabetes.csv')

bins = np.arange(0, 200, 20) # to ensure same number of bins for each facet

#print(bins.shape)

g = sns.FacetGrid(data=df, col='Outcome', col_wrap=2, sharey=True);

g.map(plt.hist, 'Glucose', bins=bins);

Share on

Twitter Facebook LinkedIn

Visualization

Bivariate

Bivariate Plots

Scatterplot and Correlation

Scatter plot: Quantative vs Quantative

Transparency and Jitter

Heat Map

Violin Plot: Quantative vs Qualitative

Box Plot

Clustered Barchart: Qualitative vs Qualitative

Faceting

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1