Python - Pandas
Published:
This post covers Introduction to Pandas.
Hello Pandas!
DataFrame
import pandas as pd
sales = {'Product A': [80, 100, 120],
'Product B': [45, 50, 55]}
df = pd.DataFrame(sales)
display(df)
sales = {'Product A': [80, 100, 120],
'Product B': [45, 50, 55]}
df = pd.DataFrame(sales,
index=[2018, 2019, 2020])
display(df)
Series
sales_A = pd.Series([80, 100, 120])
display(sales_A)
print()
sales_A = pd.Series([80, 100, 120], index=[2018, 2019, 2020])
display(sales_A)
print()
sales_A = pd.Series([80, 100, 120], index=[2018, 2019, 2020], name='Product_A')
display(sales_A)
Reading Data Files
# https://www.kaggle.com/c/titanic/data
csv_titanic = './data/titanic/train.csv'
df_titanic = pd.read_csv(csv_titanic)
print(df_titanic.shape)
display(df_titanic.head())
display(df_titanic.head(2))
display(df_titanic.tail())
display(df_titanic.tail(2))
Indexing, Selecting & Assigning
Native accessors
csv_titanic = './data/titanic/train.csv'
df_titanic = pd.read_csv(csv_titanic)
display(df_titanic.head(1))
print(df_titanic['Survived'])
print(df_titanic.Survived)
print(df_titanic.Name[0])
Indexing in pandas
# row-first, column-second
display(df_titanic.iloc[0])
display(df_titanic.iloc[0:3])
display(df_titanic.iloc[0:3, 0:4])
display(df_titanic.iloc[:5, 3])
print()
display(df_titanic.iloc[[0, 5, -2], 3])
Label-based Selection
display(df_titanic.loc[0])
display(df_titanic.loc[0:3])
display(df_titanic.loc[:, ['Name', 'Survived', 'Age']])
display(df_titanic.loc[:3, 'Name'])
display(df_titanic.loc[[1, 10, 100], ['Name', 'Survived']])
Manipulating the index
df_titanic = pd.read_csv(csv_titanic)
df_titanic.set_index('PassengerId', inplace=True)
display(df_titanic.head(2))
Conditional Selection
display(df_titanic.Survived==1)
display(df_titanic.loc[df_titanic.Survived==1])
query = (df_titanic.Survived==1) & (df_titanic.Age < 20)
display(df_titanic.loc[query])
query = (df_titanic.Survived==1) & ( (df_titanic.Age < 20) | (df_titanic.Pclass==1))
display(df_titanic.loc[query])
query = df_titanic.Cabin.isin(['C123', 'C85'])
display(df_titanic.loc[query])
query = df_titanic.Age.notnull()
display(df_titanic.loc[query])
Assigning data
df_titanic['NewClass'] = 'everyone'
display(df_titanic.head(3))
df_titanic['PassengerIdBackwards'] = range(len(df_titanic), 0, -1)
display(df_titanic.head(3))