Pandas is an open-source library that is built on top of the NumPy library. It is used for working with data sets, i.e. analyzing, cleaning, exploring, and manipulating data.
It’s better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we’ll get an error.
type(df['email'])
pandas.core.series.Series
A Series is rows of data of a single column. It’s a 1-D array.
A DataFrame is rows and columns of data, a 2-D array. It’s a container of multiple Series objects.
# access multiple columns# pass a list of columns inside the bracketsdf[['last', 'email']]
last
email
0
Shirole
123pranav@email.com
1
Doe
janedoe@email.com
2
Doe
johndoe@email.com
The above code returns a DataFrame, a filtered-down dataframe.
# this is no longer a series since it has more than 1 row and columntype(df[['last', 'email']])
pandas.core.frame.DataFrame
# see all columnsdf.columns
Index(['first', 'last', 'email'], dtype='object')
Get rows and columns using loc and iloc
iloc
iloc allows us to access rows by integer location.
# get first row# returns a Series that contains the values of the first row of datadf.iloc[0]
first Pranav
last Shirole
email 123pranav@email.com
Name: 0, dtype: object
The above code returns a Series that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.
# get first and second row / multiple rows# returns a DataFramedf.iloc[[0, 1]]
first
last
email
0
Pranav
Shirole
123pranav@email.com
1
Jane
Doe
janedoe@email.com
The above code returns a DataFrame.
We can also select columns with loc and iloc. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc and iloc as functions, we think of rows as the first argument and columns as the second argument.
# grab the email address of the first two rows# index of email column will be 2 since it's the third columndf.iloc[[0, 1], 2]
We can also pass in a list of columns with loc and iloc.
# get last name and email of first two rows# the output will follow the order in which the columns are inputted regardless of the main dataframedf.loc[[0, 1], ['email', 'last']]
We can set email column as the index for the dataframe. The dataframe doesn’t actually change unless you use inplace=True, which is nice since it lets us experiment with our dataset.
# set email column as the index for the dataframedf.set_index('email', inplace=True)
Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.
# see all information on someone by using emaildf.loc['123pranav@email.com']
first Pranav
last Shirole
Name: 123pranav@email.com, dtype: object
# get just the last name from the emaildf.loc['123pranav@email.com', 'last']
'Shirole'
Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc.
# The below code will now result in an error since the index has changed# df.loc[0]# but you can still use ilocdf.iloc[0]
first Pranav
last Shirole
Name: 123pranav@email.com, dtype: object
You can also reset the index back to the default.
# reset indexdf.reset_index(inplace=True)df
email
first
last
0
123pranav@email.com
Pranav
Shirole
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
Filtering
You can filter specific data from the dataset using pandas.
# get everyone with the last name Doedf['last'] =='Doe'
0 False
1 True
2 True
Name: last, dtype: bool
We get a Series object with Boolean values. True values are the ones that met our filter criteria. False values are the ones that did not meet our filter criteria.
filt = (df['last'] =='Doe')
df[filt]
email
first
last
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
# this also works without setting a variabledf[df['last'] =='Doe']
email
first
last
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
You can filter data using loc by passing in a series of Boolean values.
# this also does the same thing as abovedf.loc[filt]
email
first
last
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
This is great because we can still grab data for specific columns as well.
# get only email IDs for filter# remember: df.loc[rows, cols]df.loc[filt, 'email']
# change John Doe's last name to Smithdf[filt]['last']
2 Doe
Name: last, dtype: object
#collapse-output# this gives a warning and does not work# you cannot change your last name in this methoddf[filt]['last'] ='Smith'df
/tmp/ipykernel_183/3199811375.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[filt]['last'] = 'Smith'
first
last
email
0
Pranav
Shirole
123pranav@email.com
1
Jane
Doe
janedoe@email.com
2
John
Doe
JohnDoe@email.com
Just use loc or at when setting values.
df.loc[filt, 'last'] ='Smith'df
first
last
email
0
Pranav
Shirole
123pranav@email.com
1
Jane
Doe
janedoe@email.com
2
John
Smith
JohnDoe@email.com
# lowercase all emails# this does not actually make the changedf['email'].str.lower()
lambda functions are anonymous functions, without a name.
# conver emails back to lowercasedf['email'] = df['email'].apply(lambda x: x.lower())df
first
last
email
0
Pranav
Shirole
123pranav@email.com
1
Jane
Doe
janedoe@email.com
2
John
Smith
johndoe@email.com
Using apply on a DataFrame.
# running apply on a Series returns a Seriesdf['email'].apply(len)
0 19
1 17
2 17
Name: email, dtype: int64
# applies the len function to all the Series (i.e. columns) in the DataFramedf.apply(len)
first 3
last 3
email 3
dtype: int64
The above code returns the number of rows in each column.
# the above apply does same as this, but for every columnlen(df['email'])
3
# count across columns, i.e. sideways (how many columns for each index)df.apply(len, axis='columns')
0 3
1 3
2 3
dtype: int64
# grab minimum value for a Seriesdf.apply(pd.Series.min) # will choose alphabetically
first Jane
last Doe
email 123pranav@email.com
dtype: object
The above code returns the lowest value alphabetically in the first, last and email columns respectively.
# same as above (for a Series)df.apply(lambda x: x.min())
first Jane
last Doe
email 123pranav@email.com
dtype: object
This is usually more useful when your dataframe contains numerical data.
Running apply on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.
applymap
applymap applies a function to every individual element in the DataFrame. It only works on DataFrames.
df.applymap(len)
first
last
email
0
6
7
19
1
4
3
17
2
4
5
17
df.applymap(str.lower)
first
last
email
0
pranav
shirole
123pranav@email.com
1
jane
doe
janedoe@email.com
2
john
smith
johndoe@email.com
map
map method is used for substituting each value in a Series with another value. It only works on a Series.
# subsitute the first namesdf['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})
0 NaN
1 Mary
2 NaN
Name: first, dtype: object
Note that the values we didn’t substitute were converted to NaN values. That may not be what we want.
If you want to keep the other names, use the replace method.
The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand argument.
# assign list values to two different columnsdf['full_name'].str.split(' ', expand=True)
Since there is no inplace argument here, we need to set the dataframe to df to make the changes permanent.
df = df.append(df2, ignore_index=True)df
email
full_name
first
last
0
123pranav@email.com
Pranav Shirole
Pranav
Shirole
1
janedoe@email.com
Jane Doe
Jane
Doe
2
johndoe@email.com
John Doe
John
Doe
3
ironman@email.com
NaN
Tony
Stark
4
cap@email.com
NaN
Steve
Rogers
Remove rows
To remove rows, we can pass in the indexes that we want to remove using the drop method. To apply the changes, use inplace=True.
# drop index 4# to apply changes, use inplace=Truedf.drop(index=4)
email
full_name
first
last
0
123pranav@email.com
Pranav Shirole
Pranav
Shirole
1
janedoe@email.com
Jane Doe
Jane
Doe
2
johndoe@email.com
John Doe
John
Doe
3
ironman@email.com
NaN
Tony
Stark
To drop particular rows with a condition, you can pass in the indexes of the filter.
# drop rows where last name is Doe# note the index method used in the enddf.drop(index=df[df['last'] =='Doe'].index)
email
full_name
first
last
0
123pranav@email.com
Pranav Shirole
Pranav
Shirole
3
ironman@email.com
NaN
Tony
Stark
4
cap@email.com
NaN
Steve
Rogers
# or, same as above pulling the condition out# this is more readablefilt = df['last'] =='Doe'df.drop(index=df[filt].index)
email
full_name
first
last
0
123pranav@email.com
Pranav Shirole
Pranav
Shirole
3
ironman@email.com
NaN
Tony
Stark
4
cap@email.com
NaN
Steve
Rogers
df.drop(['full_name'], axis=1, inplace=True)
Sorting
Sort columns
# sort by last namedf.sort_values(by='last')
email
first
last
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
4
cap@email.com
Steve
Rogers
0
123pranav@email.com
Pranav
Shirole
3
ironman@email.com
Tony
Stark
# sort by descending orderdf.sort_values(by='last', ascending=False)
email
first
last
3
ironman@email.com
Tony
Stark
0
123pranav@email.com
Pranav
Shirole
4
cap@email.com
Steve
Rogers
1
janedoe@email.com
Jane
Doe
2
johndoe@email.com
John
Doe
When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.
# pass a list to sort bydf.sort_values(by=['last', 'first'])
We can also have one column sorted in ascending order and another in descending order.
# sort one column by ascending and another by descending# last name in descending and first in ascending# pass a list to ascending toodf.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)df
first
last
email
0
Pranav
Shirole
123pranav@email.com
3
Thor
Odinson
thor@email.com
1
Jane
Doe
janedoe@email.com
2
John
Doe
johndoe@email.com
Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index.
df.sort_index(inplace=True)df
first
last
email
0
Pranav
Shirole
123pranav@email.com
1
Jane
Doe
janedoe@email.com
2
John
Doe
johndoe@email.com
3
Thor
Odinson
thor@email.com
To sort only a single column, i.e. a Series, we can use sort_values.
# sort a single column (i.e. a single Series)# just leave the arguments blankdf['last'].sort_values()
1 Doe
2 Doe
3 Odinson
0 Shirole
Name: last, dtype: object
37.5% of the people are avengers. 25% are developers.
Grouping
Let’s group the people by their work.
df.groupby(['country'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30d4c6d610>
This DataFrameGroupBy object contains a bunch of groups.
Let’s set the GroupBy object as a variable to reuse it later.
# set GroupBy object as a variablecountry_grp = df.groupby(['country'])
# get all the people from USAcountry_grp.get_group('USA')
first
last
email
age
salary
work
country
3
Thor
Odinson
thor@email.com
90.0
80000.0
avenger
USA
4
Tony
Stark
ironman@email.com
50.0
500000.0
avenger
USA
5
Steve
Rogers
cap@email.com
75.0
45000.0
avenger
USA
6
Bruce
Wayne
thebatman@email.com
40.0
350000.0
vigilante
USA
We can also use a filter to perform the same task as above.
filt = df['country'] =='USA'df.loc[filt]
first
last
email
age
salary
work
country
3
Thor
Odinson
thor@email.com
90.0
80000.0
avenger
USA
4
Tony
Stark
ironman@email.com
50.0
500000.0
avenger
USA
5
Steve
Rogers
cap@email.com
75.0
45000.0
avenger
USA
6
Bruce
Wayne
thebatman@email.com
40.0
350000.0
vigilante
USA
# let's check work by country, here for the USAdf.loc[filt]['work'].value_counts()
avenger 3
vigilante 1
Name: work, dtype: int64
# let's check the same data as above for Indiafilt = df['country'] =='India'df.loc[filt]
first
last
email
age
salary
work
country
0
Pranav
Shirole
123pranav@email.com
30.0
80000.0
analyst
India
1
Jane
Doe
janedoe@email.com
35.0
120000.0
developer
India
2
John
Doe
johndoe@email.com
25.0
100000.0
developer
India
df.loc[filt]['work'].value_counts()
developer 2
analyst 1
Name: work, dtype: int64
country_grp['work'].value_counts().loc['India']
work
developer 2
analyst 1
Name: work, dtype: int64
The above code (or DataFrameGroupBy object) is useful because we can run one code for each country in the dataset without using a filter for every other country.
country_grp['work'].value_counts().loc['USA']
work
avenger 3
vigilante 1
Name: work, dtype: int64
# median salaries by countrycountry_grp['salary'].median()
country
India 100000.0
Krypton NaN
USA 215000.0
Name: salary, dtype: float64
# median salary for Indiacountry_grp['salary'].median().loc['India']
100000.0
Multiple aggregate functions
We can also get multiple aggregate functions at once using agg.
country_grp['salary'].agg(['median', 'mean'])
median
mean
country
India
100000.0
100000.0
Krypton
NaN
NaN
USA
215000.0
243750.0
# for USAcountry_grp['salary'].agg(['median', 'mean']).loc['USA']
median 215000.0
mean 243750.0
Name: USA, dtype: float64
# for USA and Indiacountry_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]
median
mean
country
USA
215000.0
243750.0
India
100000.0
100000.0
We can calculate sums using sum. It works on numbers as well as Boolean data types (where it will take True as 1 and False as 0).
# how many people are developers in Indiafilt = df['country'] =='India'df.loc[filt]['work'].str.contains('deve').sum()
2
# how many people are developers in each countryavengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())avengers
country
India 0
Krypton 0
USA 3
Name: work, dtype: int64
# total number of people from each countrycountry_respondents = df['country'].value_counts()country_respondents
USA 4
India 3
Krypton 1
Name: country, dtype: int64
We can combine more than one Series together using concat.
# how many people are avengers, group by countryavengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)avengers_df
country
work
India
3
0
Krypton
1
0
USA
4
3
# rename columns appropriatelyavengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)avengers_df
number of people
are avengers
India
3
0
Krypton
1
0
USA
4
3
# add another column for the percentage of people who are avengersavengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100avengers_df
number of people
are avengers
pct are avengers
India
3
0
0.0
Krypton
1
0
0.0
USA
4
3
75.0
# sort countries by the largest number of people who are avengersavengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)avengers_df
number of people
are avengers
pct are avengers
USA
4
3
75.0
India
3
0
0.0
Krypton
1
0
0.0
# for USAavengers_df.loc['USA']
number of people 4.0
are avengers 3.0
pct are avengers 75.0
Name: USA, dtype: float64
Two of the default arguments of dropna are as follows: > df.dropna(axis='index', how='any')
The axis argument can either be set to index or columns. index will tell pandas to drop NA values from rows that have missing values. columns will tell pandas to drop columns with NA values. THe how argument can either be set to any or all. any will drop the rows with one or more missing values. all will drop rows in which all values are missing.
df.dropna(axis='columns', how='all')
first
last
email
age
0
Pranav
Shirole
123pranav@email.com
32
1
Jane
Doe
janedoe@email.com
38
2
John
Doe
johndoe@email.com
40
3
Bruce
Wayne
None
45
4
NaN
NaN
NaN
None
5
None
None
anonymous@email.com
None
6
NA
Missing
Missing
Missing
df.dropna(axis='columns', how='any')
0
1
2
3
4
5
6
We get an empty dataframe because any columns with even a single missing value are dropped.
# get only rows that do not have a missing email addressdf.dropna(axis='index', how='any', subset=['email'])
first
last
email
age
0
Pranav
Shirole
123pranav@email.com
32
1
Jane
Doe
janedoe@email.com
38
2
John
Doe
johndoe@email.com
40
5
None
None
anonymous@email.com
None
6
NA
Missing
Missing
Missing
# drop all rows that don't have a last name or an emaildf.dropna(axis='index', how='all', subset=['last', 'email'])
first
last
email
age
0
Pranav
Shirole
123pranav@email.com
32
1
Jane
Doe
janedoe@email.com
38
2
John
Doe
johndoe@email.com
40
3
Bruce
Wayne
None
45
5
None
None
anonymous@email.com
None
6
NA
Missing
Missing
Missing
# replace "NA" and "Missing" valuesdf.replace('NA', np.nan, inplace=True)df.replace('Missing', np.nan, inplace=True)
You can see which values would and would not be treated as NA by using the isna method.
# check which values are missingdf.isna()
first
last
email
age
0
False
False
False
False
1
False
False
False
False
2
False
False
False
False
3
False
False
True
False
4
True
True
True
True
5
True
True
False
True
6
True
True
True
True
To substitute NA values with a specific value, use the fillna method.
# fill NA values with another valuedf.fillna('MISSING')
first
last
email
age
0
Pranav
Shirole
123pranav@email.com
32
1
Jane
Doe
janedoe@email.com
38
2
John
Doe
johndoe@email.com
40
3
Bruce
Wayne
MISSING
45
4
MISSING
MISSING
MISSING
MISSING
5
MISSING
MISSING
anonymous@email.com
MISSING
6
MISSING
MISSING
MISSING
MISSING
Check the data types using thee dtypes attribute.
df.dtypes
first object
last object
email object
age object
dtype: object
So if we wanted the average age, it wouldn’t work with the current object data type.
type(np.nan)
float
As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype method of the DataFrame object. For e.g., you can convert all int columns to float using df.astype(float).
# convert age dtype to floatdf['age'] = df['age'].astype(float)
df.dtypes
first object
last object
email object
age float64
dtype: object