We will be working with a time series dataset that gives us information about flights; the airline names, dates of journey, cities, times, etc. The dataset already consists of a train set and a test set.
What we will be doing is preparing the data for modeling.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
'figure.figsize'] = (12, 6)
matplotlib.rcParams['display.max_columns', None) pd.set_option(
= pd.read_excel('data/flights_india/train.xlsx')
df_train df_train.head()
Airline | Date_of_Journey | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | IndiGo | 24/03/2019 | Banglore | New Delhi | BLR → DEL | 22:20 | 01:10 22 Mar | 2h 50m | non-stop | No info | 3897 |
1 | Air India | 1/05/2019 | Kolkata | Banglore | CCU → IXR → BBI → BLR | 05:50 | 13:15 | 7h 25m | 2 stops | No info | 7662 |
2 | Jet Airways | 9/06/2019 | Delhi | Cochin | DEL → LKO → BOM → COK | 09:25 | 04:25 10 Jun | 19h | 2 stops | No info | 13882 |
3 | IndiGo | 12/05/2019 | Kolkata | Banglore | CCU → NAG → BLR | 18:05 | 23:30 | 5h 25m | 1 stop | No info | 6218 |
4 | IndiGo | 01/03/2019 | Banglore | New Delhi | BLR → NAG → DEL | 16:50 | 21:35 | 4h 45m | 1 stop | No info | 13302 |
df_train.shape
(10683, 11)
= pd.read_excel('data/flights_india/test.xlsx')
df_test df_test.head()
Airline | Date_of_Journey | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Jet Airways | 6/06/2019 | Delhi | Cochin | DEL → BOM → COK | 17:30 | 04:25 07 Jun | 10h 55m | 1 stop | No info |
1 | IndiGo | 12/05/2019 | Kolkata | Banglore | CCU → MAA → BLR | 06:20 | 10:20 | 4h | 1 stop | No info |
2 | Jet Airways | 21/05/2019 | Delhi | Cochin | DEL → BOM → COK | 19:15 | 19:00 22 May | 23h 45m | 1 stop | In-flight meal not included |
3 | Multiple carriers | 21/05/2019 | Delhi | Cochin | DEL → BOM → COK | 08:00 | 21:00 | 13h | 1 stop | No info |
4 | Air Asia | 24/06/2019 | Banglore | Delhi | BLR → DEL | 23:55 | 02:45 25 Jun | 2h 50m | non-stop | No info |
df_test.shape
(2671, 10)
Let’s combine the datasets.
= df_train.append(df_test)
df 2) df.head(
/var/folders/45/dyvnkjyn6s1gwy371h4b1lk00000gn/T/ipykernel_4029/3796357120.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df = df_train.append(df_test)
Airline | Date_of_Journey | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | IndiGo | 24/03/2019 | Banglore | New Delhi | BLR → DEL | 22:20 | 01:10 22 Mar | 2h 50m | non-stop | No info | 3897.0 |
1 | Air India | 1/05/2019 | Kolkata | Banglore | CCU → IXR → BBI → BLR | 05:50 | 13:15 | 7h 25m | 2 stops | No info | 7662.0 |
df.shape
(13354, 11)
df.tail()
Airline | Date_of_Journey | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|
2666 | Air India | 6/06/2019 | Kolkata | Banglore | CCU → DEL → BLR | 20:30 | 20:25 07 Jun | 23h 55m | 1 stop | No info | NaN |
2667 | IndiGo | 27/03/2019 | Kolkata | Banglore | CCU → BLR | 14:20 | 16:55 | 2h 35m | non-stop | No info | NaN |
2668 | Jet Airways | 6/03/2019 | Delhi | Cochin | DEL → BOM → COK | 21:50 | 04:25 07 Mar | 6h 35m | 1 stop | No info | NaN |
2669 | Air India | 6/03/2019 | Delhi | Cochin | DEL → BOM → COK | 04:00 | 19:15 | 15h 15m | 1 stop | No info | NaN |
2670 | Multiple carriers | 15/06/2019 | Delhi | Cochin | DEL → BOM → COK | 04:55 | 19:15 | 14h 20m | 1 stop | No info | NaN |
The date and time are in the string format, which should be converted to a datetime format. We should probably also derive features like day, month and year from the date.
Check for Missing Values
sum() df.isnull().
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 1
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 1
Additional_Info 0
Price 2671
dtype: int64
There are negligible missing values.
Feature Engineering
Date of journey
'Date_of_Journey'].str.split('/')[0] df[
0 [24, 03, 2019]
0 [6, 06, 2019]
Name: Date_of_Journey, dtype: object
# get the date
'Date_of_Journey'].str.split('/')[0].str[0] df[
0 24
0 6
Name: Date_of_Journey, dtype: object
'Date'] = df['Date_of_Journey'].str.split('/').str[0]
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2] df[
2) df.head(
Airline | Date_of_Journey | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | Price | Date | Month | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | IndiGo | 24/03/2019 | Banglore | New Delhi | BLR → DEL | 22:20 | 01:10 22 Mar | 2h 50m | non-stop | No info | 3897.0 | 24 | 03 | 2019 |
1 | Air India | 1/05/2019 | Kolkata | Banglore | CCU → IXR → BBI → BLR | 05:50 | 13:15 | 7h 25m | 2 stops | No info | 7662.0 | 1 | 05 | 2019 |
Now we’ll change the data types of the new columns.
'Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int) df[
Let’s drop the Date_of_Journey
column.
'Date_of_Journey', axis=1, inplace=True) df.drop(
Arrival time
'Arrival_Time'].head(30) df[
0 01:10 22 Mar
1 13:15
2 04:25 10 Jun
3 23:30
4 21:35
5 11:25
6 10:25 13 Mar
7 05:05 02 Mar
8 10:25 13 Mar
9 19:15
10 23:00
11 22:55
12 13:55
13 09:20 10 May
14 19:50
15 19:15 04 Mar
16 13:15
17 12:35 13 Jun
18 19:15 13 Jun
19 12:35 28 May
20 19:20
21 13:20 19 Mar
22 06:50
23 21:50
24 08:15 07 May
25 12:35 10 Jun
26 03:35 02 Jun
27 12:55
28 12:35
29 09:20
Name: Arrival_Time, dtype: object
Here we don’t require the date after the time since we already have that in a different column now.
'Arrival_Time'].str.split(' ') df[
0 [01:10, 22, Mar]
1 [13:15]
2 [04:25, 10, Jun]
3 [23:30]
4 [21:35]
...
2666 [20:25, 07, Jun]
2667 [16:55]
2668 [04:25, 07, Mar]
2669 [19:15]
2670 [19:15]
Name: Arrival_Time, Length: 13354, dtype: object
We can see that this is a list and we only need the first value from it.
'Arrival_Time'].str.split(' ').str[0]
df[
# we can also use a lambda function
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
0 01:10
1 13:15
2 04:25
3 23:30
4 21:35
...
2666 20:25
2667 16:55
2668 04:25
2669 19:15
2670 19:15
Name: Arrival_Time, Length: 13354, dtype: object
'Arrival_Time'] = df['Arrival_Time'].str.split(' ').str[0] df[
We’ll split this feature for the hour and minute of arrival.
'Arrival_Hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_Minute'] = df['Arrival_Time'].str.split(':').str[1] df[
2) df.head(
Airline | Source | Destination | Route | Dep_Time | Arrival_Time | Duration | Total_Stops | Additional_Info | Price | Date | Month | Year | Arrival_Hour | Arrival_Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | IndiGo | Banglore | New Delhi | BLR → DEL | 22:20 | 01:10 | 2h 50m | non-stop | No info | 3897.0 | 24 | 3 | 2019 | 01 | 10 |
1 | Air India | Kolkata | Banglore | CCU → IXR → BBI → BLR | 05:50 | 13:15 | 7h 25m | 2 stops | No info | 7662.0 | 1 | 5 | 2019 | 13 | 15 |
# convert the data type
'Arrival_Hour'] = df['Arrival_Hour'].astype(int)
df['Arrival_Minute'] = df['Arrival_Minute'].astype(int) df[
# drop column
'Arrival_Time', axis=1, inplace=True) df.drop(
Departure Time
'Dep_Time'].head() df[
0 22:20
1 05:50
2 09:25
3 18:05
4 16:50
Name: Dep_Time, dtype: object
Here we don’t require the date after the time since we already have that in a different column now.
'Dep_Time'].str.split(' ') df[
0 [22:20]
1 [05:50]
2 [09:25]
3 [18:05]
4 [16:50]
...
2666 [20:30]
2667 [14:20]
2668 [21:50]
2669 [04:00]
2670 [04:55]
Name: Dep_Time, Length: 13354, dtype: object
We can see that this is a list and we only need the first value from it.
'Dep_Time'].str.split(' ').str[0]
df[
# we can also use a lambda function
# df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
0 22:20
1 05:50
2 09:25
3 18:05
4 16:50
...
2666 20:30
2667 14:20
2668 21:50
2669 04:00
2670 04:55
Name: Dep_Time, Length: 13354, dtype: object
'Dep_Time'] = df['Dep_Time'].str.split(' ').str[0] df[
We’ll split this feature for the hour and minute of departure.
'Dep_Hour'] = df['Dep_Time'].str.split(':').str[0]
df['Dep_Minute'] = df['Dep_Time'].str.split(':').str[1] df[
2) df.head(
Airline | Source | Destination | Route | Dep_Time | Duration | Total_Stops | Additional_Info | Price | Date | Month | Year | Arrival_Hour | Arrival_Minute | Dep_Hour | Dep_Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | IndiGo | Banglore | New Delhi | BLR → DEL | 22:20 | 2h 50m | non-stop | No info | 3897.0 | 24 | 3 | 2019 | 1 | 10 | 22 | 20 |
1 | Air India | Kolkata | Banglore | CCU → IXR → BBI → BLR | 05:50 | 7h 25m | 2 stops | No info | 7662.0 | 1 | 5 | 2019 | 13 | 15 | 05 | 50 |
# convert the data type
'Dep_Hour'] = df['Dep_Hour'].astype(int)
df['Dep_Minute'] = df['Dep_Minute'].astype(int) df[
# drop column
'Dep_Time', axis=1, inplace=True) df.drop(
Route
'Route'].head() df[
0 BLR → DEL
1 CCU → IXR → BBI → BLR
2 DEL → LKO → BOM → COK
3 CCU → NAG → BLR
4 BLR → NAG → DEL
Name: Route, dtype: object
We can see that there can be multiple destinations for a particular route.
Total stops
We know that there is one null value in this column.
'Total_Stops'].unique() df[
array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
dtype=object)
# number of null values
'Total_Stops'].isnull().sum() df[
1
# which row is this?
'Total_Stops'].isnull()] df[df[
Airline | Source | Destination | Route | Duration | Total_Stops | Additional_Info | Price | Date | Month | Year | Arrival_Hour | Arrival_Minute | Dep_Hour | Dep_Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9039 | Air India | Delhi | Cochin | NaN | 23h 40m | NaN | No info | 7480.0 | 6 | 5 | 2019 | 9 | 25 | 9 | 45 |
We can see that the row which has a null value for Total_Stops
also has a null value for the Route
column - our only two missing values. It makes sense to just delete this row.
= df.dropna(subset=['Route']) df
'Total_Stops'].isnull().sum() df[
0
Now we’ll map the values.
'Total_Stops'] = df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4}) df[
We’ll drop the Route
column since it doesn’t look like it will help us.
# drop Route column
'Route', axis=1, inplace=True) df.drop(
Duration
We should probably convert duration into minutes.
'Duration'].head() df[
0 2h 50m
1 7h 25m
2 19h
3 5h 25m
4 4h 45m
Name: Duration, dtype: object
# convert values to Timedelta objects
'Duration'] = df['Duration'].apply(pd.to_timedelta)
df[
'Duration_Min'] = df['Duration'].apply(lambda x: x.total_seconds() / 60) df[
'Duration_Min'].head() df[
0 170.0
1 445.0
2 1140.0
3 325.0
4 285.0
Name: Duration_Min, dtype: float64
# drop column
'Duration', axis=1, inplace=True) df.drop(
Additonal info
'Additional_Info'].unique() df[
array(['No info', 'In-flight meal not included',
'No check-in baggage included', '1 Short layover', 'No Info',
'1 Long layover', 'Change airports', 'Business class',
'Red-eye flight', '2 Long layover'], dtype=object)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13353 entries, 0 to 2670
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 13353 non-null object
1 Source 13353 non-null object
2 Destination 13353 non-null object
3 Total_Stops 13353 non-null int64
4 Additional_Info 13353 non-null object
5 Price 10682 non-null float64
6 Date 13353 non-null int64
7 Month 13353 non-null int64
8 Year 13353 non-null int64
9 Arrival_Hour 13353 non-null int64
10 Arrival_Minute 13353 non-null int64
11 Dep_Hour 13353 non-null int64
12 Dep_Minute 13353 non-null int64
13 Duration_Min 13353 non-null float64
dtypes: float64(2), int64(8), object(4)
memory usage: 1.5+ MB
Airline
'Airline'].uniqueque() df[
array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
'Vistara Premium economy', 'Jet Airways Business',
'Multiple carriers Premium economy', 'Trujet'], dtype=object)
Let’s do label encoding for the airlines.
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le 'Airline'] = le.fit_transform(df['Airline']) df[
# do the same for other columns
'Source'] = le.fit_transform(df['Source'])
df['Destination'] = le.fit_transform(df['Destination'])
df['Additional_Info'] = le.fit_transform(df['Additional_Info']) df[
df.shape
(13353, 14)
df.head()
Airline | Source | Destination | Total_Stops | Additional_Info | Price | Date | Month | Year | Arrival_Hour | Arrival_Minute | Dep_Hour | Dep_Minute | Duration_Min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0 | 5 | 0 | 8 | 3897.0 | 24 | 3 | 2019 | 1 | 10 | 22 | 20 | 170.0 |
1 | 1 | 3 | 0 | 2 | 8 | 7662.0 | 1 | 5 | 2019 | 13 | 15 | 5 | 50 | 445.0 |
2 | 4 | 2 | 1 | 2 | 8 | 13882.0 | 9 | 6 | 2019 | 4 | 25 | 9 | 25 | 1140.0 |
3 | 3 | 3 | 0 | 1 | 8 | 6218.0 | 12 | 5 | 2019 | 23 | 30 | 18 | 5 | 325.0 |
4 | 3 | 0 | 5 | 1 | 8 | 13302.0 | 1 | 3 | 2019 | 21 | 35 | 16 | 50 | 285.0 |
The data is now ready for modeling.