Pandas tutorial

Exploring the basics of the Pandas library
tutorial
Published

November 21, 2021

Pandas is an open-source library that is built on top of the NumPy library. It is used for working with data sets, i.e. analyzing, cleaning, exploring, and manipulating data.

Creating a Dataframe

# a Python dictionary
person = {
    'first': 'Pranav', 
    'last': 'Shirole', 
    'email': '123pranav@email.com'
}
people = {
    'first': ['Pranav', 'Jane', 'John'], 
    'last': ['Shirole', 'Doe', 'Doe'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']
}
# view all emails
people['email']
['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']

Let’s create a dataframe from the above dictionary.

import pandas as pd
import numpy as np 
# create a dataframe
df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# access column of a dataframe
df['email']
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

The above code returns a Series object. A Series has an index, which you can see on the left (0, 1, 2).

df.email
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

It’s better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we’ll get an error.

type(df['email'])
pandas.core.series.Series

A Series is rows of data of a single column. It’s a 1-D array.
A DataFrame is rows and columns of data, a 2-D array. It’s a container of multiple Series objects.

# access multiple columns
# pass a list of columns inside the brackets
df[['last', 'email']]
last email
0 Shirole 123pranav@email.com
1 Doe janedoe@email.com
2 Doe johndoe@email.com

The above code returns a DataFrame, a filtered-down dataframe.

# this is no longer a series since it has more than 1 row and column
type(df[['last', 'email']])
pandas.core.frame.DataFrame
# see all columns
df.columns
Index(['first', 'last', 'email'], dtype='object')

Get rows and columns using loc and iloc

iloc

iloc allows us to access rows by integer location.

# get first row
# returns a Series that contains the values of the first row of data
df.iloc[0]
first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

The above code returns a Series that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.

# get first and second row / multiple rows
# returns a DataFrame
df.iloc[[0, 1]]
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com

The above code returns a DataFrame.

We can also select columns with loc and iloc. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc and iloc as functions, we think of rows as the first argument and columns as the second argument.

# grab the email address of the first two rows
# index of email column will be 2 since it's the third column
df.iloc[[0, 1], 2]
0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

loc

For loc, we search by labels.

df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# get the first row (similar to iloc)
df.loc[0]
first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object
# get first two rows (similar to iloc)
df.loc[[0, 1]]
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
# grab the email address of the first two rows
df.loc[[0, 1], 'email']
0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

We can also pass in a list of columns with loc and iloc.

# get last name and email of first two rows
# the output will follow the order in which the columns are inputted regardless of the main dataframe
df.loc[[0, 1], ['email', 'last']]
email last
0 123pranav@email.com Shirole
1 janedoe@email.com Doe

Indexes

df['email']
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

We can set email column as the index for the dataframe. The dataframe doesn’t actually change unless you use inplace=True, which is nice since it lets us experiment with our dataset.

# set email column as the index for the dataframe
df.set_index('email', inplace=True)
df
first last
email
123pranav@email.com Pranav Shirole
janedoe@email.com Jane Doe
johndoe@email.com John Doe
df.index
Index(['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com'], dtype='object', name='email')

Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.

# see all information on someone by using email
df.loc['123pranav@email.com']
first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object
# get just the last name from the email
df.loc['123pranav@email.com', 'last']
'Shirole'

Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc.

# The below code will now result in an error since the index has changed
# df.loc[0]

# but you can still use iloc
df.iloc[0]
first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

You can also reset the index back to the default.

# reset index
df.reset_index(inplace=True)
df
email first last
0 123pranav@email.com Pranav Shirole
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

Filtering

You can filter specific data from the dataset using pandas.

# get everyone with the last name Doe
df['last'] == 'Doe'
0    False
1     True
2     True
Name: last, dtype: bool

We get a Series object with Boolean values. True values are the ones that met our filter criteria. False values are the ones that did not meet our filter criteria.

filt = (df['last'] == 'Doe')
df[filt]
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
# this also works without setting a variable
df[df['last'] == 'Doe']
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

You can filter data using loc by passing in a series of Boolean values.

# this also does the same thing as above
df.loc[filt]
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

This is great because we can still grab data for specific columns as well.

# get only email IDs for filter
# remember: df.loc[rows, cols]
df.loc[filt, 'email']
1    janedoe@email.com
2    johndoe@email.com
Name: email, dtype: object
# get all email IDs where last name is Doe and first name is John
filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']
2    johndoe@email.com
Name: email, dtype: object
# get all email IDs where last name is Shirole or first name is John
filt = (df['last'] == 'Shirole') | (df['first'] == 'John')
df.loc[filt, 'email']
0    123pranav@email.com
2      johndoe@email.com
Name: email, dtype: object

You can get the opposite of a filter using ~.

# get email IDs where last name is not Shirole or first name is not John
df.loc[~filt, 'email']
1    janedoe@email.com
Name: email, dtype: object

Updating Rows and Columns

Updating Columns

df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# all columns
df.columns
Index(['first', 'last', 'email'], dtype='object')
# rename all columns at once
df.columns = ['first name', 'last name', 'email']
df
first name last name email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# upper case all column names
df.columns = [x.upper() for x in df.columns]
df
FIRST NAME LAST NAME EMAIL
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# remove spaces in column names or replace them with something else
df.columns = df.columns.str.replace(' ', '_')
df
FIRST_NAME LAST_NAME EMAIL
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
# lower case all column names
df.columns = [x.lower() for x in df.columns]
df
first_name last_name email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

We can only rename some columns by passing in a dictionary of column names. We need to include inplace=True for the changes to take place.

# key: old_name , value: new_name
df.rename(columns = {'first_name': 'first', 'last_name': 'last'}, inplace=True)
df 
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Updating Rows

# get row with name John Doe
df.loc[2]
first                 John
last                   Doe
email    johndoe@email.com
Name: 2, dtype: object
# change all values for a row
df.loc[2] = ['John', 'Smith', 'JohnSmith@email.com']
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnSmith@email.com
# get only two columns of one row
df.loc[2, ['last', 'email']]
last                   Smith
email    JohnSmith@email.com
Name: 2, dtype: object
# change only certain values for a row
df.loc[2, ['last', 'email']] = ['Doe', 'JohnDoe@email.com']
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com
# change a single value
df.loc[2, 'last'] = 'Smith'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnDoe@email.com

The specifier at can be used to look up or change a specific value. But you can, and maybe for consistency, should use loc.

df.at[2, 'last'] = 'Doe'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com
filt = (df['email'] == 'JohnDoe@email.com')
df[filt]
first last email
2 John Doe JohnDoe@email.com
# change John Doe's last name to Smith
df[filt]['last']
2    Doe
Name: last, dtype: object
#collapse-output
# this gives a warning and does not work
# you cannot change your last name in this method
df[filt]['last'] = 'Smith'
df
/tmp/ipykernel_183/3199811375.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[filt]['last'] = 'Smith'
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com

Just use loc or at when setting values.

df.loc[filt, 'last'] = 'Smith'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnDoe@email.com
# lowercase all emails
# this does not actually make the change
df['email'].str.lower()
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object
# this makes the change
df['email'] = df['email'].str.lower()
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith johndoe@email.com

Other methods to update

apply, map, applymap, replace

apply

apply is used for calling a function on our values. It can work on a Series and DataFrame.

Using apply on a Series.

# check length of email addresses 
df['email'].apply(len)
0    19
1    17
2    17
Name: email, dtype: int64
# make email uppercase
def update_email(email):
    return email.upper()
df['email'].apply(update_email)
0    123PRANAV@EMAIL.COM
1      JANEDOE@EMAIL.COM
2      JOHNDOE@EMAIL.COM
Name: email, dtype: object
df['email'] = df['email'].apply(update_email)
df
first last email
0 Pranav Shirole 123PRANAV@EMAIL.COM
1 Jane Doe JANEDOE@EMAIL.COM
2 John Smith JOHNDOE@EMAIL.COM

lambda functions are anonymous functions, without a name.

# conver emails back to lowercase
df['email'] = df['email'].apply(lambda x: x.lower())
df 
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith johndoe@email.com

Using apply on a DataFrame.

# running apply on a Series returns a Series
df['email'].apply(len)
0    19
1    17
2    17
Name: email, dtype: int64
# applies the len function to all the Series (i.e. columns) in the DataFrame
df.apply(len)
first    3
last     3
email    3
dtype: int64

The above code returns the number of rows in each column.

# the above apply does same as this, but for every column
len(df['email'])
3
# count across columns, i.e. sideways (how many columns for each index)
df.apply(len, axis='columns')
0    3
1    3
2    3
dtype: int64
# grab minimum value for a Series
df.apply(pd.Series.min) # will choose alphabetically
first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

The above code returns the lowest value alphabetically in the first, last and email columns respectively.

# same as above (for a Series)
df.apply(lambda x: x.min())
first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

This is usually more useful when your dataframe contains numerical data.

Running apply on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.

applymap

applymap applies a function to every individual element in the DataFrame. It only works on DataFrames.

df.applymap(len)
first last email
0 6 7 19
1 4 3 17
2 4 5 17
df.applymap(str.lower)
first last email
0 pranav shirole 123pranav@email.com
1 jane doe janedoe@email.com
2 john smith johndoe@email.com

map

map method is used for substituting each value in a Series with another value. It only works on a Series.

# subsitute the first names
df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})
0     NaN
1    Mary
2     NaN
Name: first, dtype: object

Note that the values we didn’t substitute were converted to NaN values. That may not be what we want.
If you want to keep the other names, use the replace method.

replace

df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
0    Pranav
1      Mary
2      John
Name: first, dtype: object
df['first'] = df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Mary Doe janedoe@email.com
2 John Smith johndoe@email.com

Add/Remove Rows and Columns

df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Add column

# combine the name columns
df['first'] + ' ' + df['last']
0    Pranav Shirole
1          Jane Doe
2          John Doe
dtype: object
# add a column named full_name
df['full_name'] = df['first'] + ' ' + df['last']
df
first last email full_name
0 Pranav Shirole 123pranav@email.com Pranav Shirole
1 Jane Doe janedoe@email.com Jane Doe
2 John Doe johndoe@email.com John Doe

Drop column

# drop the columns named first and last
df.drop(columns=['first', 'last'], inplace=True)
df
email full_name
0 123pranav@email.com Pranav Shirole
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

We can also reverse this process and split the full_name column into two different columns.

# split full_name into two columns
# split where there is a space
df['full_name'].str.split(' ')
0    [Pranav, Shirole]
1          [Jane, Doe]
2          [John, Doe]
Name: full_name, dtype: object

The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand argument.

# assign list values to two different columns
df['full_name'].str.split(' ', expand=True)
0 1
0 Pranav Shirole
1 Jane Doe
2 John Doe

We get two columns of the split result.

df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
df
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe

We have added the first and last columns with the values of the list.

Add rows

You can add a new row using append. Make sure to use ignore_index=True as an argument.

# add a single row using append
# use ignore_index=True
df.append({'first': 'Tony'}, ignore_index=True)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 NaN NaN Tony NaN

Since we only assigned a first name to our row, the other values in the columns are NaN (not a number), used for missing values.

Let’s create another dataframe.

people2 = {
    'first': ['Tony', 'Steve'], 
    'last': ['Stark', 'Rogers'], 
    'email': ['ironman@email.com', 'cap@email.com']
}
df2 = pd.DataFrame(people2)
df2
first last email
0 Tony Stark ironman@email.com
1 Steve Rogers cap@email.com

Let’s add the two dataframes together.

# add df2 to df
df.append(df2, ignore_index=True)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers

Since there is no inplace argument here, we need to set the dataframe to df to make the changes permanent.

df = df.append(df2, ignore_index=True)
df
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers

Remove rows

To remove rows, we can pass in the indexes that we want to remove using the drop method. To apply the changes, use inplace=True.

# drop index 4
# to apply changes, use inplace=True
df.drop(index=4)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark

To drop particular rows with a condition, you can pass in the indexes of the filter.

# drop rows where last name is Doe
# note the index method used in the end
df.drop(index=df[df['last'] == 'Doe'].index)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers
# or, same as above pulling the condition out
# this is more readable
filt = df['last'] == 'Doe'
df.drop(index=df[filt].index)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers
df.drop(['full_name'], axis=1, inplace=True)

Sorting

Sort columns

# sort by last name
df.sort_values(by='last')
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
4 cap@email.com Steve Rogers
0 123pranav@email.com Pranav Shirole
3 ironman@email.com Tony Stark
# sort by descending order
df.sort_values(by='last', ascending=False)
email first last
3 ironman@email.com Tony Stark
0 123pranav@email.com Pranav Shirole
4 cap@email.com Steve Rogers
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.

# pass a list to sort by
df.sort_values(by=['last', 'first'])
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
4 cap@email.com Steve Rogers
0 123pranav@email.com Pranav Shirole
3 ironman@email.com Tony Stark
people3 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com']
}
df = pd.DataFrame(people3)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
3 Thor Odinson thor@email.com

We can also have one column sorted in ascending order and another in descending order.

# sort one column by ascending and another by descending
# last name in descending and first in ascending
# pass a list to ascending too
df.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)
df
first last email
0 Pranav Shirole 123pranav@email.com
3 Thor Odinson thor@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index.

df.sort_index(inplace=True)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
3 Thor Odinson thor@email.com

To sort only a single column, i.e. a Series, we can use sort_values.

# sort a single column (i.e. a single Series)
# just leave the arguments blank
df['last'].sort_values()
1        Doe
2        Doe
3    Odinson
0    Shirole
Name: last, dtype: object

Grouping and Aggregating

people4 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor', 'Tony', 'Steve', 'Bruce', 'Clark'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson', 'Stark', 'Rogers', 'Wayne', 'Kent'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com', 
              'ironman@email.com', 'cap@email.com', 'thebatman@email.com', None],
    'age': [30, 35, 25, 90, 50, 75, 40, np.nan],
    'salary': [80000, 120000, 100000, 80000, 500000, 45000, 350000, None],
    'work': ['analyst', 'developer', 'developer', 'avenger', 'avenger', 'avenger', 'vigilante', 'superhero'],
    'country': ['India', 'India', 'India', 'USA', 'USA', 'USA', 'USA', 'Krypton']
}
df = pd.DataFrame(people4)
df
first last email age salary work country
0 Pranav Shirole 123pranav@email.com 30.0 80000.0 analyst India
1 Jane Doe janedoe@email.com 35.0 120000.0 developer India
2 John Doe johndoe@email.com 25.0 100000.0 developer India
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA
7 Clark Kent None NaN NaN superhero Krypton

Aggregating

# look at mean salaries
# ignore the NaN values
df['salary'].mean()
182142.85714285713
# look at median salaries
# ignore the NaN values
df['salary'].median()
100000.0
# run the median function on entire dataframe
# only for numerical values
df.median()
age           40.0
salary    100000.0
dtype: float64
# how many people have a salary
df['salary'].count()
7
# what kind of work do the people do
df['work']
0      analyst
1    developer
2    developer
3      avenger
4      avenger
5      avenger
6    vigilante
7    superhero
Name: work, dtype: object
# how many people are avengers 
df['work'].value_counts()
avenger      3
developer    2
vigilante    1
analyst      1
superhero    1
Name: work, dtype: int64

We can use the normalize=True argument for the value_counts method to view percentages.

# what percentage of people are avengers
df['work'].value_counts(normalize=True)
avenger      0.375
developer    0.250
vigilante    0.125
analyst      0.125
superhero    0.125
Name: work, dtype: float64

37.5% of the people are avengers. 25% are developers.

Grouping

Let’s group the people by their work.

df.groupby(['country'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30d4c6d610>

This DataFrameGroupBy object contains a bunch of groups.
Let’s set the GroupBy object as a variable to reuse it later.

# set GroupBy object as a variable
country_grp = df.groupby(['country'])
# get all the people from USA
country_grp.get_group('USA')
first last email age salary work country
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA

We can also use a filter to perform the same task as above.

filt = df['country'] == 'USA'
df.loc[filt]
first last email age salary work country
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA
# let's check work by country, here for the USA
df.loc[filt]['work'].value_counts()
avenger      3
vigilante    1
Name: work, dtype: int64
# let's check the same data as above for India
filt = df['country'] == 'India'
df.loc[filt]
first last email age salary work country
0 Pranav Shirole 123pranav@email.com 30.0 80000.0 analyst India
1 Jane Doe janedoe@email.com 35.0 120000.0 developer India
2 John Doe johndoe@email.com 25.0 100000.0 developer India
df.loc[filt]['work'].value_counts()
developer    2
analyst      1
Name: work, dtype: int64
country_grp['work'].value_counts().loc['India']
work
developer    2
analyst      1
Name: work, dtype: int64

The above code (or DataFrameGroupBy object) is useful because we can run one code for each country in the dataset without using a filter for every other country.

country_grp['work'].value_counts().loc['USA']
work
avenger      3
vigilante    1
Name: work, dtype: int64
# median salaries by country
country_grp['salary'].median()
country
India      100000.0
Krypton         NaN
USA        215000.0
Name: salary, dtype: float64
# median salary for India
country_grp['salary'].median().loc['India']
100000.0

Multiple aggregate functions

We can also get multiple aggregate functions at once using agg.

country_grp['salary'].agg(['median', 'mean'])
median mean
country
India 100000.0 100000.0
Krypton NaN NaN
USA 215000.0 243750.0
# for USA
country_grp['salary'].agg(['median', 'mean']).loc['USA']
median    215000.0
mean      243750.0
Name: USA, dtype: float64
# for USA and India
country_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]
median mean
country
USA 215000.0 243750.0
India 100000.0 100000.0

We can calculate sums using sum. It works on numbers as well as Boolean data types (where it will take True as 1 and False as 0).

# how many people are developers in India
filt = df['country'] == 'India'
df.loc[filt]['work'].str.contains('deve').sum()
2
# how many people are developers in each country
avengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())
avengers
country
India      0
Krypton    0
USA        3
Name: work, dtype: int64
# total number of people from each country
country_respondents = df['country'].value_counts()
country_respondents
USA        4
India      3
Krypton    1
Name: country, dtype: int64

We can combine more than one Series together using concat.

# how many people are avengers, group by country
avengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)
avengers_df
country work
India 3 0
Krypton 1 0
USA 4 3
# rename columns appropriately
avengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)
avengers_df
number of people are avengers
India 3 0
Krypton 1 0
USA 4 3
# add another column for the percentage of people who are avengers
avengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100
avengers_df
number of people are avengers pct are avengers
India 3 0 0.0
Krypton 1 0 0.0
USA 4 3 75.0
# sort countries by the largest number of people who are avengers
avengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)
avengers_df
number of people are avengers pct are avengers
USA 4 3 75.0
India 3 0 0.0
Krypton 1 0 0.0
# for USA
avengers_df.loc['USA']
number of people     4.0
are avengers         3.0
pct are avengers    75.0
Name: USA, dtype: float64

Cleaning Data

people5 = {
    'first': ['Pranav', 'Jane', 'John', 'Bruce', np.nan, None, 'NA'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Wayne', np.nan, None, 'Missing'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 
              None, np.nan, 'anonymous@email.com', 'Missing'],
    'age': ['32', '38', '40', '45', None, None, 'Missing']
}
df = pd.DataFrame(people5)
df
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NA Missing Missing Missing
# drop missing values
df.dropna()
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
6 NA Missing Missing Missing

Two of the default arguments of dropna are as follows: > df.dropna(axis='index', how='any')

The axis argument can either be set to index or columns. index will tell pandas to drop NA values from rows that have missing values. columns will tell pandas to drop columns with NA values. THe how argument can either be set to any or all. any will drop the rows with one or more missing values. all will drop rows in which all values are missing.

df.dropna(axis='columns', how='all')
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NA Missing Missing Missing
df.dropna(axis='columns', how='any')
0
1
2
3
4
5
6

We get an empty dataframe because any columns with even a single missing value are dropped.

# get only rows that do not have a missing email address
df.dropna(axis='index', how='any', subset=['email'])
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
5 None None anonymous@email.com None
6 NA Missing Missing Missing
# drop all rows that don't have a last name or an email
df.dropna(axis='index', how='all', subset=['last', 'email']) 
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
5 None None anonymous@email.com None
6 NA Missing Missing Missing
# replace "NA" and "Missing" values
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NaN NaN NaN NaN
df.dropna()
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
df.dropna(axis='index', how='all', subset=['last', 'email']) 
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
5 None None anonymous@email.com None

You can see which values would and would not be treated as NA by using the isna method.

# check which values are missing
df.isna()
first last email age
0 False False False False
1 False False False False
2 False False False False
3 False False True False
4 True True True True
5 True True False True
6 True True True True

To substitute NA values with a specific value, use the fillna method.

# fill NA values with another value
df.fillna('MISSING')
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne MISSING 45
4 MISSING MISSING MISSING MISSING
5 MISSING MISSING anonymous@email.com MISSING
6 MISSING MISSING MISSING MISSING

Check the data types using thee dtypes attribute.

df.dtypes
first    object
last     object
email    object
age      object
dtype: object

So if we wanted the average age, it wouldn’t work with the current object data type.

type(np.nan)
float

As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype method of the DataFrame object. For e.g., you can convert all int columns to float using df.astype(float).

# convert age dtype to float
df['age'] = df['age'].astype(float)
df.dtypes
first     object
last      object
email     object
age      float64
dtype: object
# take average of the age column
df['age'].mean()
38.75