Data Wrangling

1 minute read

Published:

This post covers Data Wrangling.

Data Wrangling

  • Data Gathering, Assessing, and Cleaning

  • Common issues:

    • Missing Data
    • Duplicate Data
    • Typo error
    • Different Format
    • Inaccuracy in the data
  • Data Gathering

    • Gather data using Facebook, Twitter API

    • Download Titanic Dataset

      • Unzip file if needed

      • import zipfile
              
        with zipfile.ZipFile('./data/titanic.zip', 'r') as fp:
            fp.extractall(path='data')
        
      • import pandas as pd
              
        csv_file = './data/titanic/train.csv'
              
        df = pd.read_csv(csv_file)
        display(df.head(2))
        
  • Data Assessing

    • Identify issues related to Quality and Tidiness using Visual and Programmatic tools

    • import pandas as pd
          
      csv_file = './data/titanic/train.csv'
      df = pd.read_csv(csv_file)
      display(df.info())
      
  • Data Cleaning

    • Define cleaning plan - Code to clean - Test if cleaned

    • # Rename Columns
      cols = {'SibSp':'SiblingSpouseAboard', 'Parch':'ParentsChildrenAboard'}
          
      df_clean.rename(columns=cols, inplace=True)
      df_clean.head(2)
      
    • # Replace values
      csv_file = './data/titanic/train.csv'
      df = pd.read_csv(csv_file)
      df_clean = df.copy()
          
      display(df_clean.Gender.value_counts())
          
      male_list = ['M', 'Male']
      female_list = ['F', 'Female']
          
      for text in male_list:
          df_clean.Gender.replace(to_replace=text, value='male', inplace=True)
              
      for text in female_list:
          df_clean.Gender.replace(to_replace=text, value='female', inplace=True)
      print()
      display(df_clean.Gender.value_counts())
      
    • # Replace values
      csv_file = './data/titanic/train.csv'
      df = pd.read_csv(csv_file)
      df_clean = df.copy()
          
      display(df_clean.Gender.value_counts())
          
      male_list = ['M', 'Male']
      female_list = ['F', 'Female']
          
      for text in male_list:
          df_clean.loc[df_clean.Gender==text, 'Gender'] = 'male'
              
      for text in female_list:
          df_clean.loc[df_clean.Gender==text, 'Gender'] = 'female'
      print()
      display(df_clean.Gender.value_counts())
      
    • # Test
      for text in male_list:
          assert text not in df_clean.Gender.values
              
      for text in female_list:
          assert text not in df_clean.Gender.values
      

    Data Analysis and Visualisation

    • # Percentage of Passenger Survived
          
      display(df_clean.Survived.value_counts())
      print()
          
      survived = df_clean.Survived.value_counts()[1]
      total = df_clean.Survived.count()
          
      print(f'Survived {survived} from Total {total}: {survived/total:.2f}%')
      
    • # Plots
      %matplotlib inline
          
      df_clean.Survived.value_counts().plot(kind='pie', legend=True);
      
    • df_clean.Survived.value_counts().plot(kind='bar', legend=True);
      
    • df_clean.Survived.value_counts().plot(kind='barh', legend=True);
      
    • %matplotlib inline
      import numpy as np
      import matplotlib.pyplot as plt
          
      labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object)
      labels[0] = 'Not Survived'
      labels[1] = 'Survived'
          
      df_clean.Survived.value_counts().plot(kind='pie', legend=True, labels=labels);
      
    • labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object)
      labels[0] = 'Not Survived'
      labels[1] = 'Survived'
          
      ax = df_clean.Survived.value_counts().plot(kind='bar');
      ax.set_xticklabels(labels, rotation=45, fontsize=14);
          
      plt.title('Titanic');
      
    • labels = np.full(shape=len(df_clean.Gender.value_counts()), fill_value='', dtype=object)
      labels[0] = 'Not Survived'
      labels[1] = 'Survived'
          
      ax = df_clean.Survived.value_counts().plot(kind='barh');
      ax.set_yticklabels(labels, fontsize=14);
      plt.title('Titanic');