Data Wrangling

1 minute read

Published: January 01, 2022

This post covers Data Wrangling.

Data Wrangling

Data Gathering, Assessing, and Cleaning
Common issues:
- Missing Data
- Duplicate Data
- Typo error
- Different Format
- Inaccuracy in the data

Data Gathering

Gather data using Facebook, Twitter API

Download Titanic Dataset

Unzip file if needed

import zipfile
      
with zipfile.ZipFile('./data/titanic.zip', 'r') as fp:
    fp.extractall(path='data')

import pandas as pd
      
csv_file = './data/titanic/train.csv'
      
df = pd.read_csv(csv_file)
display(df.head(2))

Data Assessing

Identify issues related to Quality and Tidiness using Visual and Programmatic tools

import pandas as pd
    
csv_file = './data/titanic/train.csv'
df = pd.read_csv(csv_file)
display(df.info())

Data Cleaning

Define cleaning plan - Code to clean - Test if cleaned

# Rename Columns
cols = {'SibSp':'SiblingSpouseAboard', 'Parch':'ParentsChildrenAboard'}
    
df_clean.rename(columns=cols, inplace=True)
df_clean.head(2)

# Replace values
csv_file = './data/titanic/train.csv'
df = pd.read_csv(csv_file)
df_clean = df.copy()
    
display(df_clean.Gender.value_counts())
    
male_list = ['M', 'Male']
female_list = ['F', 'Female']
    
for text in male_list:
    df_clean.Gender.replace(to_replace=text, value='male', inplace=True)
        
for text in female_list:
    df_clean.Gender.replace(to_replace=text, value='female', inplace=True)
print()
display(df_clean.Gender.value_counts())

# Replace values
csv_file = './data/titanic/train.csv'
df = pd.read_csv(csv_file)
df_clean = df.copy()
    
display(df_clean.Gender.value_counts())
    
male_list = ['M', 'Male']
female_list = ['F', 'Female']
    
for text in male_list:
    df_clean.loc[df_clean.Gender==text, 'Gender'] = 'male'
        
for text in female_list:
    df_clean.loc[df_clean.Gender==text, 'Gender'] = 'female'
print()
display(df_clean.Gender.value_counts())

# Test
for text in male_list:
    assert text not in df_clean.Gender.values
        
for text in female_list:
    assert text not in df_clean.Gender.values

def clean_gender(df):
    male_list = ["M", "Male"] # replaced with male
    female_list = ["F", "Female"] # replaced with female
    for text in male_list:
        df.Gender.replace(to_replace=text, value="male", inplace=True)
    for text in female_list:
        df.Gender.replace(to_replace=text, value="female", inplace=True)

Data Analysis and Visualisation

# Percentage of Passenger Survived
    
display(df_clean.Survived.value_counts())
print()
    
survived = df_clean.Survived.value_counts()[1]
total = df_clean.Survived.count()
    
print(f'Survived {survived} from Total {total}: {survived/total:.2f}%')

# Plots
%matplotlib inline
    
df_clean.Survived.value_counts().plot(kind='pie', legend=True);

df_clean.Survived.value_counts().plot(kind='bar', legend=True);

df_clean.Survived.value_counts().plot(kind='barh', legend=True);

df_titanic = pd.read_csv("data/titanic.csv")
clean_gender(df_titanic)
    
plt.figure(figsize=(3,3))
    
labels = ["Not Survived", "Survived"]
ax = df_titanic.Survived.value_counts().plot(kind='pie', labels=labels);
    
ax.yaxis.set_visible(False)
    
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1));
# upper right, upper left, lower left, lower right, right, 
# center left, center right, lower center, upper center, center

df_titanic = pd.read_csv("data/titanic.csv")
clean_gender(df_titanic)
    
plt.figure(figsize=(5,3))
    
ax = df_titanic.Survived.value_counts().plot(kind="bar")
    
labels = ["Not Survived", "Survived"]
ax.set_xticklabels(labels, rotation=45, fontsize=14);
plt.title("Titanic Survival");
ax.yaxis.set_visible(False)

df_titanic = pd.read_csv("data/titanic.csv")
clean_gender(df_titanic)
    
plt.figure(figsize=(5,3))
    
ax = df_titanic.Survived.value_counts().plot(kind="barh")
    
labels = ["Not Survived", "Survived"]
ax.set_yticklabels(labels, rotation=0, fontsize=14);
plt.title("Titanic Survival");
ax.xaxis.set_visible(False)

Share on

Twitter Facebook LinkedIn

Data Wrangling

Data Wrangling

Data Gathering

Data Assessing

Data Cleaning

Data Analysis and Visualisation

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1