Cleaning and preparing data for model training (Part 2 - Time Series Data)

Performing EDA on a time series dataset and preparing it for modeling
analysis
Published

July 1, 2022

We will be working with a time series dataset that gives us information about flights; the airline names, dates of journey, cities, times, etc. The dataset already consists of a train set and a test set.
What we will be doing is preparing the data for modeling.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

matplotlib.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.max_columns', None)
df_train = pd.read_excel('data/flights_india/train.xlsx')
df_train.head()
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price
0 IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 01:10 22 Mar 2h 50m non-stop No info 3897
1 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662
2 Jet Airways 9/06/2019 Delhi Cochin DEL → LKO → BOM → COK 09:25 04:25 10 Jun 19h 2 stops No info 13882
3 IndiGo 12/05/2019 Kolkata Banglore CCU → NAG → BLR 18:05 23:30 5h 25m 1 stop No info 6218
4 IndiGo 01/03/2019 Banglore New Delhi BLR → NAG → DEL 16:50 21:35 4h 45m 1 stop No info 13302
df_train.shape
(10683, 11)
df_test = pd.read_excel('data/flights_india/test.xlsx')
df_test.head()
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info
0 Jet Airways 6/06/2019 Delhi Cochin DEL → BOM → COK 17:30 04:25 07 Jun 10h 55m 1 stop No info
1 IndiGo 12/05/2019 Kolkata Banglore CCU → MAA → BLR 06:20 10:20 4h 1 stop No info
2 Jet Airways 21/05/2019 Delhi Cochin DEL → BOM → COK 19:15 19:00 22 May 23h 45m 1 stop In-flight meal not included
3 Multiple carriers 21/05/2019 Delhi Cochin DEL → BOM → COK 08:00 21:00 13h 1 stop No info
4 Air Asia 24/06/2019 Banglore Delhi BLR → DEL 23:55 02:45 25 Jun 2h 50m non-stop No info
df_test.shape
(2671, 10)

Let’s combine the datasets.

df = df_train.append(df_test)
df.head(2)
/var/folders/45/dyvnkjyn6s1gwy371h4b1lk00000gn/T/ipykernel_4029/3796357120.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df_train.append(df_test)
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price
0 IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 01:10 22 Mar 2h 50m non-stop No info 3897.0
1 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662.0
df.shape
(13354, 11)
df.tail()
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price
2666 Air India 6/06/2019 Kolkata Banglore CCU → DEL → BLR 20:30 20:25 07 Jun 23h 55m 1 stop No info NaN
2667 IndiGo 27/03/2019 Kolkata Banglore CCU → BLR 14:20 16:55 2h 35m non-stop No info NaN
2668 Jet Airways 6/03/2019 Delhi Cochin DEL → BOM → COK 21:50 04:25 07 Mar 6h 35m 1 stop No info NaN
2669 Air India 6/03/2019 Delhi Cochin DEL → BOM → COK 04:00 19:15 15h 15m 1 stop No info NaN
2670 Multiple carriers 15/06/2019 Delhi Cochin DEL → BOM → COK 04:55 19:15 14h 20m 1 stop No info NaN

The date and time are in the string format, which should be converted to a datetime format. We should probably also derive features like day, month and year from the date.

Check for Missing Values

df.isnull().sum()
Airline               0
Date_of_Journey       0
Source                0
Destination           0
Route                 1
Dep_Time              0
Arrival_Time          0
Duration              0
Total_Stops           1
Additional_Info       0
Price              2671
dtype: int64

There are negligible missing values.

Feature Engineering

Date of journey

df['Date_of_Journey'].str.split('/')[0]
0    [24, 03, 2019]
0     [6, 06, 2019]
Name: Date_of_Journey, dtype: object
# get the date
df['Date_of_Journey'].str.split('/')[0].str[0]
0    24
0     6
Name: Date_of_Journey, dtype: object
df['Date'] = df['Date_of_Journey'].str.split('/').str[0]
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2]
df.head(2)
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price Date Month Year
0 IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 01:10 22 Mar 2h 50m non-stop No info 3897.0 24 03 2019
1 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662.0 1 05 2019

Now we’ll change the data types of the new columns.

df['Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)

Let’s drop the Date_of_Journey column.

df.drop('Date_of_Journey', axis=1, inplace=True)

Arrival time

df['Arrival_Time'].head(30)
0     01:10 22 Mar
1            13:15
2     04:25 10 Jun
3            23:30
4            21:35
5            11:25
6     10:25 13 Mar
7     05:05 02 Mar
8     10:25 13 Mar
9            19:15
10           23:00
11           22:55
12           13:55
13    09:20 10 May
14           19:50
15    19:15 04 Mar
16           13:15
17    12:35 13 Jun
18    19:15 13 Jun
19    12:35 28 May
20           19:20
21    13:20 19 Mar
22           06:50
23           21:50
24    08:15 07 May
25    12:35 10 Jun
26    03:35 02 Jun
27           12:55
28           12:35
29           09:20
Name: Arrival_Time, dtype: object

Here we don’t require the date after the time since we already have that in a different column now.

df['Arrival_Time'].str.split(' ')
0       [01:10, 22, Mar]
1                [13:15]
2       [04:25, 10, Jun]
3                [23:30]
4                [21:35]
              ...       
2666    [20:25, 07, Jun]
2667             [16:55]
2668    [04:25, 07, Mar]
2669             [19:15]
2670             [19:15]
Name: Arrival_Time, Length: 13354, dtype: object

We can see that this is a list and we only need the first value from it.

df['Arrival_Time'].str.split(' ').str[0]

# we can also use a lambda function 
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
0       01:10
1       13:15
2       04:25
3       23:30
4       21:35
        ...  
2666    20:25
2667    16:55
2668    04:25
2669    19:15
2670    19:15
Name: Arrival_Time, Length: 13354, dtype: object
df['Arrival_Time'] = df['Arrival_Time'].str.split(' ').str[0]

We’ll split this feature for the hour and minute of arrival.

df['Arrival_Hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_Minute'] = df['Arrival_Time'].str.split(':').str[1]
df.head(2)
Airline Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price Date Month Year Arrival_Hour Arrival_Minute
0 IndiGo Banglore New Delhi BLR → DEL 22:20 01:10 2h 50m non-stop No info 3897.0 24 3 2019 01 10
1 Air India Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662.0 1 5 2019 13 15
# convert the data type
df['Arrival_Hour'] = df['Arrival_Hour'].astype(int)
df['Arrival_Minute'] = df['Arrival_Minute'].astype(int)
# drop column
df.drop('Arrival_Time', axis=1, inplace=True)

Departure Time

df['Dep_Time'].head()
0    22:20
1    05:50
2    09:25
3    18:05
4    16:50
Name: Dep_Time, dtype: object

Here we don’t require the date after the time since we already have that in a different column now.

df['Dep_Time'].str.split(' ')
0       [22:20]
1       [05:50]
2       [09:25]
3       [18:05]
4       [16:50]
         ...   
2666    [20:30]
2667    [14:20]
2668    [21:50]
2669    [04:00]
2670    [04:55]
Name: Dep_Time, Length: 13354, dtype: object

We can see that this is a list and we only need the first value from it.

df['Dep_Time'].str.split(' ').str[0]

# we can also use a lambda function 
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
0       22:20
1       05:50
2       09:25
3       18:05
4       16:50
        ...  
2666    20:30
2667    14:20
2668    21:50
2669    04:00
2670    04:55
Name: Dep_Time, Length: 13354, dtype: object
df['Dep_Time'] = df['Dep_Time'].str.split(' ').str[0]

We’ll split this feature for the hour and minute of departure.

df['Dep_Hour'] = df['Dep_Time'].str.split(':').str[0]
df['Dep_Minute'] = df['Dep_Time'].str.split(':').str[1]
df.head(2)
Airline Source Destination Route Dep_Time Duration Total_Stops Additional_Info Price Date Month Year Arrival_Hour Arrival_Minute Dep_Hour Dep_Minute
0 IndiGo Banglore New Delhi BLR → DEL 22:20 2h 50m non-stop No info 3897.0 24 3 2019 1 10 22 20
1 Air India Kolkata Banglore CCU → IXR → BBI → BLR 05:50 7h 25m 2 stops No info 7662.0 1 5 2019 13 15 05 50
# convert the data type
df['Dep_Hour'] = df['Dep_Hour'].astype(int)
df['Dep_Minute'] = df['Dep_Minute'].astype(int)
# drop column
df.drop('Dep_Time', axis=1, inplace=True)

Route

df['Route'].head()
0                BLR → DEL
1    CCU → IXR → BBI → BLR
2    DEL → LKO → BOM → COK
3          CCU → NAG → BLR
4          BLR → NAG → DEL
Name: Route, dtype: object

We can see that there can be multiple destinations for a particular route.

Total stops

We know that there is one null value in this column.

df['Total_Stops'].unique()
array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)
# number of null values
df['Total_Stops'].isnull().sum()
1
# which row is this?
df[df['Total_Stops'].isnull()]
Airline Source Destination Route Duration Total_Stops Additional_Info Price Date Month Year Arrival_Hour Arrival_Minute Dep_Hour Dep_Minute
9039 Air India Delhi Cochin NaN 23h 40m NaN No info 7480.0 6 5 2019 9 25 9 45

We can see that the row which has a null value for Total_Stops also has a null value for the Route column - our only two missing values. It makes sense to just delete this row.

df = df.dropna(subset=['Route'])
df['Total_Stops'].isnull().sum()
0

Now we’ll map the values.

df['Total_Stops'] = df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4})

We’ll drop the Route column since it doesn’t look like it will help us.

# drop Route column
df.drop('Route', axis=1, inplace=True)

Duration

We should probably convert duration into minutes.

df['Duration'].head()
0    2h 50m
1    7h 25m
2       19h
3    5h 25m
4    4h 45m
Name: Duration, dtype: object
# convert values to Timedelta objects
df['Duration'] = df['Duration'].apply(pd.to_timedelta)

df['Duration_Min'] = df['Duration'].apply(lambda x: x.total_seconds() / 60)
df['Duration_Min'].head()
0     170.0
1     445.0
2    1140.0
3     325.0
4     285.0
Name: Duration_Min, dtype: float64
# drop column
df.drop('Duration', axis=1, inplace=True)

Additonal info

df['Additional_Info'].unique()
array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13353 entries, 0 to 2670
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Airline          13353 non-null  object 
 1   Source           13353 non-null  object 
 2   Destination      13353 non-null  object 
 3   Total_Stops      13353 non-null  int64  
 4   Additional_Info  13353 non-null  object 
 5   Price            10682 non-null  float64
 6   Date             13353 non-null  int64  
 7   Month            13353 non-null  int64  
 8   Year             13353 non-null  int64  
 9   Arrival_Hour     13353 non-null  int64  
 10  Arrival_Minute   13353 non-null  int64  
 11  Dep_Hour         13353 non-null  int64  
 12  Dep_Minute       13353 non-null  int64  
 13  Duration_Min     13353 non-null  float64
dtypes: float64(2), int64(8), object(4)
memory usage: 1.5+ MB

Airline

df['Airline'].uniqueque()
array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

Let’s do label encoding for the airlines.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Airline'] = le.fit_transform(df['Airline'])
# do the same for other columns
df['Source'] = le.fit_transform(df['Source'])
df['Destination'] = le.fit_transform(df['Destination'])
df['Additional_Info'] = le.fit_transform(df['Additional_Info'])
df.shape
(13353, 14)
df.head()
Airline Source Destination Total_Stops Additional_Info Price Date Month Year Arrival_Hour Arrival_Minute Dep_Hour Dep_Minute Duration_Min
0 3 0 5 0 8 3897.0 24 3 2019 1 10 22 20 170.0
1 1 3 0 2 8 7662.0 1 5 2019 13 15 5 50 445.0
2 4 2 1 2 8 13882.0 9 6 2019 4 25 9 25 1140.0
3 3 3 0 1 8 6218.0 12 5 2019 23 30 18 5 325.0
4 3 0 5 1 8 13302.0 1 3 2019 21 35 16 50 285.0

The data is now ready for modeling.