Cleaning and preparing data for model training (Part 2 - Time Series Data)

We will be working with a time series dataset that gives us information about flights; the airline names, dates of journey, cities, times, etc. The dataset already consists of a train set and a test set.
What we will be doing is preparing the data for modeling.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

matplotlib.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.max_columns', None)

df_train = pd.read_excel('data/flights_india/train.xlsx')
df_train.head()

	Airline	Date_of_Journey	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info	Price
0	IndiGo	24/03/2019	Banglore	New Delhi	BLR → DEL	22:20	01:10 22 Mar	2h 50m	non-stop	No info	3897
1	Air India	1/05/2019	Kolkata	Banglore	CCU → IXR → BBI → BLR	05:50	13:15	7h 25m	2 stops	No info	7662
2	Jet Airways	9/06/2019	Delhi	Cochin	DEL → LKO → BOM → COK	09:25	04:25 10 Jun	19h	2 stops	No info	13882
3	IndiGo	12/05/2019	Kolkata	Banglore	CCU → NAG → BLR	18:05	23:30	5h 25m	1 stop	No info	6218
4	IndiGo	01/03/2019	Banglore	New Delhi	BLR → NAG → DEL	16:50	21:35	4h 45m	1 stop	No info	13302

df_train.shape

(10683, 11)

df_test = pd.read_excel('data/flights_india/test.xlsx')
df_test.head()

	Airline	Date_of_Journey	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info
0	Jet Airways	6/06/2019	Delhi	Cochin	DEL → BOM → COK	17:30	04:25 07 Jun	10h 55m	1 stop	No info
1	IndiGo	12/05/2019	Kolkata	Banglore	CCU → MAA → BLR	06:20	10:20	4h	1 stop	No info
2	Jet Airways	21/05/2019	Delhi	Cochin	DEL → BOM → COK	19:15	19:00 22 May	23h 45m	1 stop	In-flight meal not included
3	Multiple carriers	21/05/2019	Delhi	Cochin	DEL → BOM → COK	08:00	21:00	13h	1 stop	No info
4	Air Asia	24/06/2019	Banglore	Delhi	BLR → DEL	23:55	02:45 25 Jun	2h 50m	non-stop	No info

df_test.shape

(2671, 10)

Let’s combine the datasets.

df = df_train.append(df_test)
df.head(2)

/var/folders/45/dyvnkjyn6s1gwy371h4b1lk00000gn/T/ipykernel_4029/3796357120.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df_train.append(df_test)

	Airline	Date_of_Journey	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info	Price
0	IndiGo	24/03/2019	Banglore	New Delhi	BLR → DEL	22:20	01:10 22 Mar	2h 50m	non-stop	No info	3897.0
1	Air India	1/05/2019	Kolkata	Banglore	CCU → IXR → BBI → BLR	05:50	13:15	7h 25m	2 stops	No info	7662.0

df.shape

(13354, 11)

df.tail()

	Airline	Date_of_Journey	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info	Price
2666	Air India	6/06/2019	Kolkata	Banglore	CCU → DEL → BLR	20:30	20:25 07 Jun	23h 55m	1 stop	No info	NaN
2667	IndiGo	27/03/2019	Kolkata	Banglore	CCU → BLR	14:20	16:55	2h 35m	non-stop	No info	NaN
2668	Jet Airways	6/03/2019	Delhi	Cochin	DEL → BOM → COK	21:50	04:25 07 Mar	6h 35m	1 stop	No info	NaN
2669	Air India	6/03/2019	Delhi	Cochin	DEL → BOM → COK	04:00	19:15	15h 15m	1 stop	No info	NaN
2670	Multiple carriers	15/06/2019	Delhi	Cochin	DEL → BOM → COK	04:55	19:15	14h 20m	1 stop	No info	NaN

The date and time are in the string format, which should be converted to a datetime format. We should probably also derive features like day, month and year from the date.

Check for Missing Values

df.isnull().sum()

Airline               0
Date_of_Journey       0
Source                0
Destination           0
Route                 1
Dep_Time              0
Arrival_Time          0
Duration              0
Total_Stops           1
Additional_Info       0
Price              2671
dtype: int64

There are negligible missing values.

Feature Engineering

Date of journey

df['Date_of_Journey'].str.split('/')[0]

0    [24, 03, 2019]
0     [6, 06, 2019]
Name: Date_of_Journey, dtype: object

# get the date
df['Date_of_Journey'].str.split('/')[0].str[0]

0    24
0     6
Name: Date_of_Journey, dtype: object

df['Date'] = df['Date_of_Journey'].str.split('/').str[0]
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2]

df.head(2)

	Airline	Date_of_Journey	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info	Price	Date	Month	Year
0	IndiGo	24/03/2019	Banglore	New Delhi	BLR → DEL	22:20	01:10 22 Mar	2h 50m	non-stop	No info	3897.0	24	03	2019
1	Air India	1/05/2019	Kolkata	Banglore	CCU → IXR → BBI → BLR	05:50	13:15	7h 25m	2 stops	No info	7662.0	1	05	2019

Now we’ll change the data types of the new columns.

df['Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)

Let’s drop the Date_of_Journey column.

df.drop('Date_of_Journey', axis=1, inplace=True)

Arrival time

df['Arrival_Time'].head(30)

0     01:10 22 Mar
1            13:15
2     04:25 10 Jun
3            23:30
4            21:35
5            11:25
6     10:25 13 Mar
7     05:05 02 Mar
8     10:25 13 Mar
9            19:15
10           23:00
11           22:55
12           13:55
13    09:20 10 May
14           19:50
15    19:15 04 Mar
16           13:15
17    12:35 13 Jun
18    19:15 13 Jun
19    12:35 28 May
20           19:20
21    13:20 19 Mar
22           06:50
23           21:50
24    08:15 07 May
25    12:35 10 Jun
26    03:35 02 Jun
27           12:55
28           12:35
29           09:20
Name: Arrival_Time, dtype: object

Here we don’t require the date after the time since we already have that in a different column now.

df['Arrival_Time'].str.split(' ')

0       [01:10, 22, Mar]
1                [13:15]
2       [04:25, 10, Jun]
3                [23:30]
4                [21:35]
              ...       
2666    [20:25, 07, Jun]
2667             [16:55]
2668    [04:25, 07, Mar]
2669             [19:15]
2670             [19:15]
Name: Arrival_Time, Length: 13354, dtype: object

We can see that this is a list and we only need the first value from it.

df['Arrival_Time'].str.split(' ').str[0]

# we can also use a lambda function 
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])

0       01:10
1       13:15
2       04:25
3       23:30
4       21:35
        ...  
2666    20:25
2667    16:55
2668    04:25
2669    19:15
2670    19:15
Name: Arrival_Time, Length: 13354, dtype: object

df['Arrival_Time'] = df['Arrival_Time'].str.split(' ').str[0]

We’ll split this feature for the hour and minute of arrival.

df['Arrival_Hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_Minute'] = df['Arrival_Time'].str.split(':').str[1]

df.head(2)

	Airline	Source	Destination	Route	Dep_Time	Arrival_Time	Duration	Total_Stops	Additional_Info	Price	Date	Month	Year	Arrival_Hour	Arrival_Minute
0	IndiGo	Banglore	New Delhi	BLR → DEL	22:20	01:10	2h 50m	non-stop	No info	3897.0	24	3	2019	01	10
1	Air India	Kolkata	Banglore	CCU → IXR → BBI → BLR	05:50	13:15	7h 25m	2 stops	No info	7662.0	1	5	2019	13	15

# convert the data type
df['Arrival_Hour'] = df['Arrival_Hour'].astype(int)
df['Arrival_Minute'] = df['Arrival_Minute'].astype(int)

# drop column
df.drop('Arrival_Time', axis=1, inplace=True)

Departure Time

df['Dep_Time'].head()

0    22:20
1    05:50
2    09:25
3    18:05
4    16:50
Name: Dep_Time, dtype: object

Here we don’t require the date after the time since we already have that in a different column now.

df['Dep_Time'].str.split(' ')

0       [22:20]
1       [05:50]
2       [09:25]
3       [18:05]
4       [16:50]
         ...   
2666    [20:30]
2667    [14:20]
2668    [21:50]
2669    [04:00]
2670    [04:55]
Name: Dep_Time, Length: 13354, dtype: object

We can see that this is a list and we only need the first value from it.

df['Dep_Time'].str.split(' ').str[0]

# we can also use a lambda function 
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])

0       22:20
1       05:50
2       09:25
3       18:05
4       16:50
        ...  
2666    20:30
2667    14:20
2668    21:50
2669    04:00
2670    04:55
Name: Dep_Time, Length: 13354, dtype: object

df['Dep_Time'] = df['Dep_Time'].str.split(' ').str[0]

We’ll split this feature for the hour and minute of departure.

df['Dep_Hour'] = df['Dep_Time'].str.split(':').str[0]
df['Dep_Minute'] = df['Dep_Time'].str.split(':').str[1]

df.head(2)

	Airline	Source	Destination	Route	Dep_Time	Duration	Total_Stops	Additional_Info	Price	Date	Month	Year	Arrival_Hour	Arrival_Minute	Dep_Hour	Dep_Minute
0	IndiGo	Banglore	New Delhi	BLR → DEL	22:20	2h 50m	non-stop	No info	3897.0	24	3	2019	1	10	22	20
1	Air India	Kolkata	Banglore	CCU → IXR → BBI → BLR	05:50	7h 25m	2 stops	No info	7662.0	1	5	2019	13	15	05	50

# convert the data type
df['Dep_Hour'] = df['Dep_Hour'].astype(int)
df['Dep_Minute'] = df['Dep_Minute'].astype(int)

# drop column
df.drop('Dep_Time', axis=1, inplace=True)

Route

df['Route'].head()

0                BLR → DEL
1    CCU → IXR → BBI → BLR
2    DEL → LKO → BOM → COK
3          CCU → NAG → BLR
4          BLR → NAG → DEL
Name: Route, dtype: object

We can see that there can be multiple destinations for a particular route.

Total stops

We know that there is one null value in this column.

df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

# number of null values
df['Total_Stops'].isnull().sum()

# which row is this?
df[df['Total_Stops'].isnull()]

	Airline	Source	Destination	Route	Duration	Total_Stops	Additional_Info	Price	Date	Month	Year	Arrival_Hour	Arrival_Minute	Dep_Hour	Dep_Minute
9039	Air India	Delhi	Cochin	NaN	23h 40m	NaN	No info	7480.0	6	5	2019	9	25	9	45

We can see that the row which has a null value for Total_Stops also has a null value for the Route column - our only two missing values. It makes sense to just delete this row.

df = df.dropna(subset=['Route'])

df['Total_Stops'].isnull().sum()

Now we’ll map the values.

df['Total_Stops'] = df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4})

We’ll drop the Route column since it doesn’t look like it will help us.

# drop Route column
df.drop('Route', axis=1, inplace=True)

Duration

We should probably convert duration into minutes.

df['Duration'].head()

0    2h 50m
1    7h 25m
2       19h
3    5h 25m
4    4h 45m
Name: Duration, dtype: object

# convert values to Timedelta objects
df['Duration'] = df['Duration'].apply(pd.to_timedelta)

df['Duration_Min'] = df['Duration'].apply(lambda x: x.total_seconds() / 60)

df['Duration_Min'].head()

0     170.0
1     445.0
2    1140.0
3     325.0
4     285.0
Name: Duration_Min, dtype: float64

# drop column
df.drop('Duration', axis=1, inplace=True)

Additonal info

df['Additional_Info'].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13353 entries, 0 to 2670
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Airline          13353 non-null  object 
 1   Source           13353 non-null  object 
 2   Destination      13353 non-null  object 
 3   Total_Stops      13353 non-null  int64  
 4   Additional_Info  13353 non-null  object 
 5   Price            10682 non-null  float64
 6   Date             13353 non-null  int64  
 7   Month            13353 non-null  int64  
 8   Year             13353 non-null  int64  
 9   Arrival_Hour     13353 non-null  int64  
 10  Arrival_Minute   13353 non-null  int64  
 11  Dep_Hour         13353 non-null  int64  
 12  Dep_Minute       13353 non-null  int64  
 13  Duration_Min     13353 non-null  float64
dtypes: float64(2), int64(8), object(4)
memory usage: 1.5+ MB

Airline

df['Airline'].uniqueque()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

Let’s do label encoding for the airlines.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Airline'] = le.fit_transform(df['Airline'])

# do the same for other columns
df['Source'] = le.fit_transform(df['Source'])
df['Destination'] = le.fit_transform(df['Destination'])
df['Additional_Info'] = le.fit_transform(df['Additional_Info'])

df.shape

(13353, 14)

df.head()

	Airline	Source	Destination	Total_Stops	Additional_Info	Price	Date	Month	Year	Arrival_Hour	Arrival_Minute	Dep_Hour	Dep_Minute	Duration_Min
0	3	0	5	0	8	3897.0	24	3	2019	1	10	22	20	170.0
1	1	3	0	2	8	7662.0	1	5	2019	13	15	5	50	445.0
2	4	2	1	2	8	13882.0	9	6	2019	4	25	9	25	1140.0
3	3	3	0	1	8	6218.0	12	5	2019	23	30	18	5	325.0
4	3	0	5	1	8	13302.0	1	3	2019	21	35	16	50	285.0

The data is now ready for modeling.