Pandas tutorial – pranav

Pandas is an open-source library that is built on top of the NumPy library. It is used for working with data sets, i.e. analyzing, cleaning, exploring, and manipulating data.

Creating a Dataframe

# a Python dictionary
person = {
    'first': 'Pranav', 
    'last': 'Shirole', 
    'email': '123pranav@email.com'
}

people = {
    'first': ['Pranav', 'Jane', 'John'], 
    'last': ['Shirole', 'Doe', 'Doe'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']
}

# view all emails
people['email']

['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']

Let’s create a dataframe from the above dictionary.

import pandas as pd
import numpy as np

# create a dataframe
df = pd.DataFrame(people)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# access column of a dataframe
df['email']

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

The above code returns a Series object. A Series has an index, which you can see on the left (0, 1, 2).

df.email

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

It’s better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we’ll get an error.

type(df['email'])

pandas.core.series.Series

A Series is rows of data of a single column. It’s a 1-D array.
A DataFrame is rows and columns of data, a 2-D array. It’s a container of multiple Series objects.

# access multiple columns
# pass a list of columns inside the brackets
df[['last', 'email']]

	last	email
0	Shirole	123pranav@email.com
1	Doe	janedoe@email.com
2	Doe	johndoe@email.com

The above code returns a DataFrame, a filtered-down dataframe.

# this is no longer a series since it has more than 1 row and column
type(df[['last', 'email']])

pandas.core.frame.DataFrame

# see all columns
df.columns

Index(['first', 'last', 'email'], dtype='object')

Get rows and columns using `loc` and `iloc`

`iloc`

iloc allows us to access rows by integer location.

# get first row
# returns a Series that contains the values of the first row of data
df.iloc[0]

first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

The above code returns a Series that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.

# get first and second row / multiple rows
# returns a DataFrame
df.iloc[[0, 1]]

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com

The above code returns a DataFrame.

We can also select columns with loc and iloc. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc and iloc as functions, we think of rows as the first argument and columns as the second argument.

# grab the email address of the first two rows
# index of email column will be 2 since it's the third column
df.iloc[[0, 1], 2]

0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

`loc`

For loc, we search by labels.

df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# get the first row (similar to iloc)
df.loc[0]

first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

# get first two rows (similar to iloc)
df.loc[[0, 1]]

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com

# grab the email address of the first two rows
df.loc[[0, 1], 'email']

0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

We can also pass in a list of columns with loc and iloc.

# get last name and email of first two rows
# the output will follow the order in which the columns are inputted regardless of the main dataframe
df.loc[[0, 1], ['email', 'last']]

	email	last
0	123pranav@email.com	Shirole
1	janedoe@email.com	Doe

Indexes

df['email']

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

We can set email column as the index for the dataframe. The dataframe doesn’t actually change unless you use inplace=True, which is nice since it lets us experiment with our dataset.

# set email column as the index for the dataframe
df.set_index('email', inplace=True)

df

	first	last
email
123pranav@email.com	Pranav	Shirole
janedoe@email.com	Jane	Doe
johndoe@email.com	John	Doe

df.index

Index(['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com'], dtype='object', name='email')

Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.

# see all information on someone by using email
df.loc['123pranav@email.com']

first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

# get just the last name from the email
df.loc['123pranav@email.com', 'last']

'Shirole'

Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc.

# The below code will now result in an error since the index has changed
# df.loc[0]

# but you can still use iloc
df.iloc[0]

first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

You can also reset the index back to the default.

# reset index
df.reset_index(inplace=True)
df

	email	first	last
0	123pranav@email.com	Pranav	Shirole
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

Filtering

You can filter specific data from the dataset using pandas.

# get everyone with the last name Doe
df['last'] == 'Doe'

0    False
1     True
2     True
Name: last, dtype: bool

We get a Series object with Boolean values. True values are the ones that met our filter criteria. False values are the ones that did not meet our filter criteria.

filt = (df['last'] == 'Doe')

df[filt]

	email	first	last
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

# this also works without setting a variable
df[df['last'] == 'Doe']

	email	first	last
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

You can filter data using loc by passing in a series of Boolean values.

# this also does the same thing as above
df.loc[filt]

	email	first	last
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

This is great because we can still grab data for specific columns as well.

# get only email IDs for filter
# remember: df.loc[rows, cols]
df.loc[filt, 'email']

1    janedoe@email.com
2    johndoe@email.com
Name: email, dtype: object

# get all email IDs where last name is Doe and first name is John
filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']

2    johndoe@email.com
Name: email, dtype: object

# get all email IDs where last name is Shirole or first name is John
filt = (df['last'] == 'Shirole') | (df['first'] == 'John')
df.loc[filt, 'email']

0    123pranav@email.com
2      johndoe@email.com
Name: email, dtype: object

You can get the opposite of a filter using ~.

# get email IDs where last name is not Shirole or first name is not John
df.loc[~filt, 'email']

1    janedoe@email.com
Name: email, dtype: object

Updating Rows and Columns

Updating Columns

df = pd.DataFrame(people)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# all columns
df.columns

Index(['first', 'last', 'email'], dtype='object')

# rename all columns at once
df.columns = ['first name', 'last name', 'email']
df

	first name	last name	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# upper case all column names
df.columns = [x.upper() for x in df.columns]
df

	FIRST NAME	LAST NAME	EMAIL
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# remove spaces in column names or replace them with something else
df.columns = df.columns.str.replace(' ', '_')

df

	FIRST_NAME	LAST_NAME	EMAIL
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

# lower case all column names
df.columns = [x.lower() for x in df.columns]
df

	first_name	last_name	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

We can only rename some columns by passing in a dictionary of column names. We need to include inplace=True for the changes to take place.

# key: old_name , value: new_name
df.rename(columns = {'first_name': 'first', 'last_name': 'last'}, inplace=True)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

Updating Rows

# get row with name John Doe
df.loc[2]

first                 John
last                   Doe
email    johndoe@email.com
Name: 2, dtype: object

# change all values for a row
df.loc[2] = ['John', 'Smith', 'JohnSmith@email.com']
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Smith	JohnSmith@email.com

# get only two columns of one row
df.loc[2, ['last', 'email']]

last                   Smith
email    JohnSmith@email.com
Name: 2, dtype: object

# change only certain values for a row
df.loc[2, ['last', 'email']] = ['Doe', 'JohnDoe@email.com']
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	JohnDoe@email.com

# change a single value
df.loc[2, 'last'] = 'Smith'
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Smith	JohnDoe@email.com

The specifier at can be used to look up or change a specific value. But you can, and maybe for consistency, should use loc.

df.at[2, 'last'] = 'Doe'
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	JohnDoe@email.com

filt = (df['email'] == 'JohnDoe@email.com')
df[filt]

	first	last	email
2	John	Doe	JohnDoe@email.com

# change John Doe's last name to Smith
df[filt]['last']

2    Doe
Name: last, dtype: object

#collapse-output
# this gives a warning and does not work
# you cannot change your last name in this method
df[filt]['last'] = 'Smith'
df

/tmp/ipykernel_183/3199811375.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[filt]['last'] = 'Smith'

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	JohnDoe@email.com

Just use loc or at when setting values.

df.loc[filt, 'last'] = 'Smith'
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Smith	JohnDoe@email.com

# lowercase all emails
# this does not actually make the change
df['email'].str.lower()

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

# this makes the change
df['email'] = df['email'].str.lower()
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Smith	johndoe@email.com

Other methods to update

apply, map, applymap, replace

apply

apply is used for calling a function on our values. It can work on a Series and DataFrame.

Using apply on a Series.

# check length of email addresses 
df['email'].apply(len)

0    19
1    17
2    17
Name: email, dtype: int64

# make email uppercase
def update_email(email):
    return email.upper()

df['email'].apply(update_email)

0    123PRANAV@EMAIL.COM
1      JANEDOE@EMAIL.COM
2      JOHNDOE@EMAIL.COM
Name: email, dtype: object

df['email'] = df['email'].apply(update_email)
df

	first	last	email
0	Pranav	Shirole	123PRANAV@EMAIL.COM
1	Jane	Doe	JANEDOE@EMAIL.COM
2	John	Smith	JOHNDOE@EMAIL.COM

lambda functions are anonymous functions, without a name.

# conver emails back to lowercase
df['email'] = df['email'].apply(lambda x: x.lower())
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Smith	johndoe@email.com

Using apply on a DataFrame.

# running apply on a Series returns a Series
df['email'].apply(len)

0    19
1    17
2    17
Name: email, dtype: int64

# applies the len function to all the Series (i.e. columns) in the DataFrame
df.apply(len)

first    3
last     3
email    3
dtype: int64

The above code returns the number of rows in each column.

# the above apply does same as this, but for every column
len(df['email'])

# count across columns, i.e. sideways (how many columns for each index)
df.apply(len, axis='columns')

0    3
1    3
2    3
dtype: int64

# grab minimum value for a Series
df.apply(pd.Series.min) # will choose alphabetically

first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

The above code returns the lowest value alphabetically in the first, last and email columns respectively.

# same as above (for a Series)
df.apply(lambda x: x.min())

first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

This is usually more useful when your dataframe contains numerical data.

Running apply on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.

applymap

applymap applies a function to every individual element in the DataFrame. It only works on DataFrames.

df.applymap(len)

	first	last	email
0	6	7	19
1	4	3	17
2	4	5	17

df.applymap(str.lower)

	first	last	email
0	pranav	shirole	123pranav@email.com
1	jane	doe	janedoe@email.com
2	john	smith	johndoe@email.com

map

map method is used for substituting each value in a Series with another value. It only works on a Series.

# subsitute the first names
df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})

0     NaN
1    Mary
2     NaN
Name: first, dtype: object

Note that the values we didn’t substitute were converted to NaN values. That may not be what we want.
If you want to keep the other names, use the replace method.

replace

df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})

0    Pranav
1      Mary
2      John
Name: first, dtype: object

df['first'] = df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Mary	Doe	janedoe@email.com
2	John	Smith	johndoe@email.com

Add/Remove Rows and Columns

df = pd.DataFrame(people)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

Add column

# combine the name columns
df['first'] + ' ' + df['last']

0    Pranav Shirole
1          Jane Doe
2          John Doe
dtype: object

# add a column named full_name
df['full_name'] = df['first'] + ' ' + df['last']
df

	first	last	email	full_name
0	Pranav	Shirole	123pranav@email.com	Pranav Shirole
1	Jane	Doe	janedoe@email.com	Jane Doe
2	John	Doe	johndoe@email.com	John Doe

Drop column

# drop the columns named first and last
df.drop(columns=['first', 'last'], inplace=True)

df

	email	full_name
0	123pranav@email.com	Pranav Shirole
1	janedoe@email.com	Jane Doe
2	johndoe@email.com	John Doe

We can also reverse this process and split the full_name column into two different columns.

# split full_name into two columns
# split where there is a space
df['full_name'].str.split(' ')

0    [Pranav, Shirole]
1          [Jane, Doe]
2          [John, Doe]
Name: full_name, dtype: object

The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand argument.

# assign list values to two different columns
df['full_name'].str.split(' ', expand=True)

	0	1
0	Pranav	Shirole
1	Jane	Doe
2	John	Doe

We get two columns of the split result.

df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)

df

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
1	janedoe@email.com	Jane Doe	Jane	Doe
2	johndoe@email.com	John Doe	John	Doe

We have added the first and last columns with the values of the list.

Add rows

You can add a new row using append. Make sure to use ignore_index=True as an argument.

# add a single row using append
# use ignore_index=True
df.append({'first': 'Tony'}, ignore_index=True)

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
1	janedoe@email.com	Jane Doe	Jane	Doe
2	johndoe@email.com	John Doe	John	Doe
3	NaN	NaN	Tony	NaN

Since we only assigned a first name to our row, the other values in the columns are NaN (not a number), used for missing values.

Let’s create another dataframe.

people2 = {
    'first': ['Tony', 'Steve'], 
    'last': ['Stark', 'Rogers'], 
    'email': ['ironman@email.com', 'cap@email.com']
}
df2 = pd.DataFrame(people2)

df2

	first	last	email
0	Tony	Stark	ironman@email.com
1	Steve	Rogers	cap@email.com

Let’s add the two dataframes together.

# add df2 to df
df.append(df2, ignore_index=True)

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
1	janedoe@email.com	Jane Doe	Jane	Doe
2	johndoe@email.com	John Doe	John	Doe
3	ironman@email.com	NaN	Tony	Stark
4	cap@email.com	NaN	Steve	Rogers

Since there is no inplace argument here, we need to set the dataframe to df to make the changes permanent.

df = df.append(df2, ignore_index=True)
df

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
1	janedoe@email.com	Jane Doe	Jane	Doe
2	johndoe@email.com	John Doe	John	Doe
3	ironman@email.com	NaN	Tony	Stark
4	cap@email.com	NaN	Steve	Rogers

Remove rows

To remove rows, we can pass in the indexes that we want to remove using the drop method. To apply the changes, use inplace=True.

# drop index 4
# to apply changes, use inplace=True
df.drop(index=4)

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
1	janedoe@email.com	Jane Doe	Jane	Doe
2	johndoe@email.com	John Doe	John	Doe
3	ironman@email.com	NaN	Tony	Stark

To drop particular rows with a condition, you can pass in the indexes of the filter.

# drop rows where last name is Doe
# note the index method used in the end
df.drop(index=df[df['last'] == 'Doe'].index)

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
3	ironman@email.com	NaN	Tony	Stark
4	cap@email.com	NaN	Steve	Rogers

# or, same as above pulling the condition out
# this is more readable
filt = df['last'] == 'Doe'
df.drop(index=df[filt].index)

	email	full_name	first	last
0	123pranav@email.com	Pranav Shirole	Pranav	Shirole
3	ironman@email.com	NaN	Tony	Stark
4	cap@email.com	NaN	Steve	Rogers

df.drop(['full_name'], axis=1, inplace=True)

Sorting

Sort columns

# sort by last name
df.sort_values(by='last')

	email	first	last
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe
4	cap@email.com	Steve	Rogers
0	123pranav@email.com	Pranav	Shirole
3	ironman@email.com	Tony	Stark

# sort by descending order
df.sort_values(by='last', ascending=False)

	email	first	last
3	ironman@email.com	Tony	Stark
0	123pranav@email.com	Pranav	Shirole
4	cap@email.com	Steve	Rogers
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.

# pass a list to sort by
df.sort_values(by=['last', 'first'])

	email	first	last
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe
4	cap@email.com	Steve	Rogers
0	123pranav@email.com	Pranav	Shirole
3	ironman@email.com	Tony	Stark

people3 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com']
}
df = pd.DataFrame(people3)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com
3	Thor	Odinson	thor@email.com

We can also have one column sorted in ascending order and another in descending order.

# sort one column by ascending and another by descending
# last name in descending and first in ascending
# pass a list to ascending too
df.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
3	Thor	Odinson	thor@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index.

df.sort_index(inplace=True)
df

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com
3	Thor	Odinson	thor@email.com

To sort only a single column, i.e. a Series, we can use sort_values.

# sort a single column (i.e. a single Series)
# just leave the arguments blank
df['last'].sort_values()

1        Doe
2        Doe
3    Odinson
0    Shirole
Name: last, dtype: object

Grouping and Aggregating

people4 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor', 'Tony', 'Steve', 'Bruce', 'Clark'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson', 'Stark', 'Rogers', 'Wayne', 'Kent'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com', 
              'ironman@email.com', 'cap@email.com', 'thebatman@email.com', None],
    'age': [30, 35, 25, 90, 50, 75, 40, np.nan],
    'salary': [80000, 120000, 100000, 80000, 500000, 45000, 350000, None],
    'work': ['analyst', 'developer', 'developer', 'avenger', 'avenger', 'avenger', 'vigilante', 'superhero'],
    'country': ['India', 'India', 'India', 'USA', 'USA', 'USA', 'USA', 'Krypton']
}
df = pd.DataFrame(people4)
df

	first	last	email	age	salary	work	country
0	Pranav	Shirole	123pranav@email.com	30.0	80000.0	analyst	India
1	Jane	Doe	janedoe@email.com	35.0	120000.0	developer	India
2	John	Doe	johndoe@email.com	25.0	100000.0	developer	India
3	Thor	Odinson	thor@email.com	90.0	80000.0	avenger	USA
4	Tony	Stark	ironman@email.com	50.0	500000.0	avenger	USA
5	Steve	Rogers	cap@email.com	75.0	45000.0	avenger	USA
6	Bruce	Wayne	thebatman@email.com	40.0	350000.0	vigilante	USA
7	Clark	Kent	None	NaN	NaN	superhero	Krypton

Aggregating

# look at mean salaries
# ignore the NaN values
df['salary'].mean()

182142.85714285713

# look at median salaries
# ignore the NaN values
df['salary'].median()

100000.0

# run the median function on entire dataframe
# only for numerical values
df.median()

age           40.0
salary    100000.0
dtype: float64

# how many people have a salary
df['salary'].count()

# what kind of work do the people do
df['work']

0      analyst
1    developer
2    developer
3      avenger
4      avenger
5      avenger
6    vigilante
7    superhero
Name: work, dtype: object

# how many people are avengers 
df['work'].value_counts()

avenger      3
developer    2
vigilante    1
analyst      1
superhero    1
Name: work, dtype: int64

We can use the normalize=True argument for the value_counts method to view percentages.

# what percentage of people are avengers
df['work'].value_counts(normalize=True)

avenger      0.375
developer    0.250
vigilante    0.125
analyst      0.125
superhero    0.125
Name: work, dtype: float64

37.5% of the people are avengers. 25% are developers.

Grouping

Let’s group the people by their work.

df.groupby(['country'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30d4c6d610>

This DataFrameGroupBy object contains a bunch of groups.
Let’s set the GroupBy object as a variable to reuse it later.

# set GroupBy object as a variable
country_grp = df.groupby(['country'])

# get all the people from USA
country_grp.get_group('USA')

	first	last	email	age	salary	work	country
3	Thor	Odinson	thor@email.com	90.0	80000.0	avenger	USA
4	Tony	Stark	ironman@email.com	50.0	500000.0	avenger	USA
5	Steve	Rogers	cap@email.com	75.0	45000.0	avenger	USA
6	Bruce	Wayne	thebatman@email.com	40.0	350000.0	vigilante	USA

We can also use a filter to perform the same task as above.

filt = df['country'] == 'USA'
df.loc[filt]

	first	last	email	age	salary	work	country
3	Thor	Odinson	thor@email.com	90.0	80000.0	avenger	USA
4	Tony	Stark	ironman@email.com	50.0	500000.0	avenger	USA
5	Steve	Rogers	cap@email.com	75.0	45000.0	avenger	USA
6	Bruce	Wayne	thebatman@email.com	40.0	350000.0	vigilante	USA

# let's check work by country, here for the USA
df.loc[filt]['work'].value_counts()

avenger      3
vigilante    1
Name: work, dtype: int64

# let's check the same data as above for India
filt = df['country'] == 'India'
df.loc[filt]

	first	last	email	age	salary	work	country
0	Pranav	Shirole	123pranav@email.com	30.0	80000.0	analyst	India
1	Jane	Doe	janedoe@email.com	35.0	120000.0	developer	India
2	John	Doe	johndoe@email.com	25.0	100000.0	developer	India

df.loc[filt]['work'].value_counts()

developer    2
analyst      1
Name: work, dtype: int64

country_grp['work'].value_counts().loc['India']

work
developer    2
analyst      1
Name: work, dtype: int64

The above code (or DataFrameGroupBy object) is useful because we can run one code for each country in the dataset without using a filter for every other country.

country_grp['work'].value_counts().loc['USA']

work
avenger      3
vigilante    1
Name: work, dtype: int64

# median salaries by country
country_grp['salary'].median()

country
India      100000.0
Krypton         NaN
USA        215000.0
Name: salary, dtype: float64

# median salary for India
country_grp['salary'].median().loc['India']

100000.0

Multiple aggregate functions

We can also get multiple aggregate functions at once using agg.

country_grp['salary'].agg(['median', 'mean'])

	median	mean
country
India	100000.0	100000.0
Krypton	NaN	NaN
USA	215000.0	243750.0

# for USA
country_grp['salary'].agg(['median', 'mean']).loc['USA']

median    215000.0
mean      243750.0
Name: USA, dtype: float64

# for USA and India
country_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]

	median	mean
country
USA	215000.0	243750.0
India	100000.0	100000.0

We can calculate sums using sum. It works on numbers as well as Boolean data types (where it will take True as 1 and False as 0).

# how many people are developers in India
filt = df['country'] == 'India'
df.loc[filt]['work'].str.contains('deve').sum()

# how many people are developers in each country
avengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())
avengers

country
India      0
Krypton    0
USA        3
Name: work, dtype: int64

# total number of people from each country
country_respondents = df['country'].value_counts()
country_respondents

USA        4
India      3
Krypton    1
Name: country, dtype: int64

We can combine more than one Series together using concat.

# how many people are avengers, group by country
avengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)
avengers_df

	country	work
India	3	0
Krypton	1	0
USA	4	3

# rename columns appropriately
avengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)
avengers_df

	number of people	are avengers
India	3	0
Krypton	1	0
USA	4	3

# add another column for the percentage of people who are avengers
avengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100
avengers_df

	number of people	are avengers	pct are avengers
India	3	0	0.0
Krypton	1	0	0.0
USA	4	3	75.0

# sort countries by the largest number of people who are avengers
avengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)
avengers_df

	number of people	are avengers	pct are avengers
USA	4	3	75.0
India	3	0	0.0
Krypton	1	0	0.0

# for USA
avengers_df.loc['USA']

number of people     4.0
are avengers         3.0
pct are avengers    75.0
Name: USA, dtype: float64

Cleaning Data

people5 = {
    'first': ['Pranav', 'Jane', 'John', 'Bruce', np.nan, None, 'NA'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Wayne', np.nan, None, 'Missing'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 
              None, np.nan, 'anonymous@email.com', 'Missing'],
    'age': ['32', '38', '40', '45', None, None, 'Missing']
}

df = pd.DataFrame(people5)

df

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	None	45
4	NaN	NaN	NaN	None
5	None	None	anonymous@email.com	None
6	NA	Missing	Missing	Missing

# drop missing values
df.dropna()

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
6	NA	Missing	Missing	Missing

Two of the default arguments of dropna are as follows: > df.dropna(axis='index', how='any')

The axis argument can either be set to index or columns. index will tell pandas to drop NA values from rows that have missing values. columns will tell pandas to drop columns with NA values. THe how argument can either be set to any or all. any will drop the rows with one or more missing values. all will drop rows in which all values are missing.

df.dropna(axis='columns', how='all')

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	None	45
4	NaN	NaN	NaN	None
5	None	None	anonymous@email.com	None
6	NA	Missing	Missing	Missing

df.dropna(axis='columns', how='any')


0
1
2
3
4
5
6

We get an empty dataframe because any columns with even a single missing value are dropped.

# get only rows that do not have a missing email address
df.dropna(axis='index', how='any', subset=['email'])

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
5	None	None	anonymous@email.com	None
6	NA	Missing	Missing	Missing

# drop all rows that don't have a last name or an email
df.dropna(axis='index', how='all', subset=['last', 'email'])

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	None	45
5	None	None	anonymous@email.com	None
6	NA	Missing	Missing	Missing

# replace "NA" and "Missing" values
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)

df

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	None	45
4	NaN	NaN	NaN	None
5	None	None	anonymous@email.com	None
6	NaN	NaN	NaN	NaN

df.dropna()

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40

df.dropna(axis='index', how='all', subset=['last', 'email'])

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	None	45
5	None	None	anonymous@email.com	None

You can see which values would and would not be treated as NA by using the isna method.

# check which values are missing
df.isna()

	first	last	email	age
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
3	False	False	True	False
4	True	True	True	True
5	True	True	False	True
6	True	True	True	True

To substitute NA values with a specific value, use the fillna method.

# fill NA values with another value
df.fillna('MISSING')

	first	last	email	age
0	Pranav	Shirole	123pranav@email.com	32
1	Jane	Doe	janedoe@email.com	38
2	John	Doe	johndoe@email.com	40
3	Bruce	Wayne	MISSING	45
4	MISSING	MISSING	MISSING	MISSING
5	MISSING	MISSING	anonymous@email.com	MISSING
6	MISSING	MISSING	MISSING	MISSING

Check the data types using thee dtypes attribute.

df.dtypes

first    object
last     object
email    object
age      object
dtype: object

So if we wanted the average age, it wouldn’t work with the current object data type.

type(np.nan)

float

As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype method of the DataFrame object. For e.g., you can convert all int columns to float using df.astype(float).

# convert age dtype to float
df['age'] = df['age'].astype(float)

df.dtypes

first     object
last      object
email     object
age      float64
dtype: object

# take average of the age column
df['age'].mean()

38.75