Data Wrangling
Published:
This post covers Data Wrangling.
Data Wrangling
Data Gathering, Assessing, and Cleaning
Common issues:
- Missing Data
- Duplicate Data
- Typo error
- Different Format
- Inaccuracy in the data
Data Gathering
Gather data using Facebook, Twitter API
Download Titanic Dataset
Unzip file if needed
import zipfile with zipfile.ZipFile('./data/titanic.zip', 'r') as fp: fp.extractall(path='data')
import pandas as pd csv_file = './data/titanic/train.csv' df = pd.read_csv(csv_file) display(df.head(2))
Data Assessing
Identify issues related to Quality and Tidiness using Visual and Programmatic tools
import pandas as pd csv_file = './data/titanic/train.csv' df = pd.read_csv(csv_file) display(df.info())
Data Cleaning
Define cleaning plan - Code to clean - Test if cleaned
# Rename Columns cols = {'SibSp':'SiblingSpouseAboard', 'Parch':'ParentsChildrenAboard'} df_clean.rename(columns=cols, inplace=True) df_clean.head(2)
# Replace values csv_file = './data/titanic/train.csv' df = pd.read_csv(csv_file) df_clean = df.copy() display(df_clean.Gender.value_counts()) male_list = ['M', 'Male'] female_list = ['F', 'Female'] for text in male_list: df_clean.Gender.replace(to_replace=text, value='male', inplace=True) for text in female_list: df_clean.Gender.replace(to_replace=text, value='female', inplace=True) print() display(df_clean.Gender.value_counts())
# Replace values csv_file = './data/titanic/train.csv' df = pd.read_csv(csv_file) df_clean = df.copy() display(df_clean.Gender.value_counts()) male_list = ['M', 'Male'] female_list = ['F', 'Female'] for text in male_list: df_clean.loc[df_clean.Gender==text, 'Gender'] = 'male' for text in female_list: df_clean.loc[df_clean.Gender==text, 'Gender'] = 'female' print() display(df_clean.Gender.value_counts())
# Test for text in male_list: assert text not in df_clean.Gender.values for text in female_list: assert text not in df_clean.Gender.values
Data Analysis and Visualisation
# Percentage of Passenger Survived display(df_clean.Survived.value_counts()) print() survived = df_clean.Survived.value_counts()[1] total = df_clean.Survived.count() print(f'Survived {survived} from Total {total}: {survived/total:.2f}%')
# Plots %matplotlib inline df_clean.Survived.value_counts().plot(kind='pie', legend=True);
df_clean.Survived.value_counts().plot(kind='bar', legend=True);
df_clean.Survived.value_counts().plot(kind='barh', legend=True);
%matplotlib inline import numpy as np import matplotlib.pyplot as plt labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object) labels[0] = 'Not Survived' labels[1] = 'Survived' df_clean.Survived.value_counts().plot(kind='pie', legend=True, labels=labels);
labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object) labels[0] = 'Not Survived' labels[1] = 'Survived' ax = df_clean.Survived.value_counts().plot(kind='bar'); ax.set_xticklabels(labels, rotation=45, fontsize=14); plt.title('Titanic');
labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object) labels[0] = 'Not Survived' labels[1] = 'Survived' ax = df_clean.Survived.value_counts().plot(kind='barh'); ax.set_yticklabels(labels, fontsize=14); plt.title('Titanic');