Handling categorical variables

A blog post on how to deal with categorical variables while analyzing data
analysis
tutorial
Published

March 15, 2022

A categorical (or qualitative) variable is a variable that can take on one of a limited, and usually fixed, number of possible values. In this blog post, we will see how we can deal with categorical variables in any dataset. Most machine learning algorithms do not work well with string values as their input variables, and hence we will discuss about ways to convert these string variables into numerical ones. This process is called categorical variable encoding.

Types of categorical variables

  • Ordinal categorical variables: These are variables whose values follow a natural order. For e.g., a temperature variable can have values like low, medium or high.
  • Nominal categorical variables: These are variables whose values do not follow a natural order. For e.g., gender values like male and female do not have any order.

Types of encoding

We’ll discuss two different types of encoding: - one-hot encoding: We create a new set of dummy (binary) variables that is equal to the number of categories (k) in the variable. - dummy encoding: This also uses dummy (binary) variables, but instead of create a number of dummy variables that is equal to the number of categories (k) in the variable, dummy encoding uses k-1 dummy variables. Dummy encoding removes a duplicate category present in the one-hot encoding.


Implementation with Pandas

Both one-hot encoding and dummy encoding can be implemented in Pandas by using the get_dummies function.

pd.get_dummies(data, prefix, dummy_na, columns, drop_first)

  • data: Here we specify the data we need to encode. It can be a NumPy array, or a Pandas Series or Dataframe. - prefix: If we specify a prefix, it will add to the column names so that we can easily identify the columns. The prefix can be specified as a string for a single column name. For multiple column names, it is defined as a dictionary mapping column names to prefixes.
  • dummy_na: If False (default), missing values (NaN) are ignored when encoding the variables. If True, this will return missing data in a separate category.
  • columns: This specifies the column names to be encoded. If None (default), all categorical columns in the data parameter will be encoded. If you specify column names as a list, only the specified columns will be encoded.
  • drop_first: This is the most important parameter. If False (default), this will perform one-hot encoding. If True, this will drop the first category of each variable, create k-1 dummy variables for each categorical variable, and perform dummy encoding.

Let’s see this in action.

Analyzing diamond data

The Diamonds dataset (available here) contains the prices and other attributes of almost 54,000 diamonds. Let’s explore it.

import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv('data/diamonds.csv')
df.head()
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
df.drop(columns=df.columns[0], axis=1,  inplace=True)
df.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
df.shape
(53940, 10)

The dataset consists of 53,940 rows and 11 columns.

# check for missing values
df.isnull().sum().sum()
0

There are no missing values in the dataset.
Let’s check how many categorical variables are present in the dataset.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB

There are three categorical variables in the dataset. They are cut, color, and clarity.
Let’s check their unique categories.

df['cut'].unique()
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
df['color'].unique()
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
df['clarity'].unique()
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
      dtype=object)

Implementing one-hot encoding with Pandas

Let’s apply one-hot encoding to the color variable.

one_hot = pd.get_dummies(df['color'], prefix='color', drop_first=False)
one_hot.head()
color_D color_E color_F color_G color_H color_I color_J
0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0
4 0 0 0 0 0 0 1
one_hot.shape
(53940, 7)

This returns a pandas dataframe of encoded data. The text specified in the prefix parameter is combined with the category names of the color. Now let’s add this one-hot encoding to the main dataset.

df_one_hot = pd.get_dummies(df, prefix='color', columns=['color'], drop_first=False)
df_one_hot.head()
carat cut clarity depth table price x y z color_D color_E color_F color_G color_H color_I color_J
0 0.23 Ideal SI2 61.5 55.0 326 3.95 3.98 2.43 0 1 0 0 0 0 0
1 0.21 Premium SI1 59.8 61.0 326 3.89 3.84 2.31 0 1 0 0 0 0 0
2 0.23 Good VS1 56.9 65.0 327 4.05 4.07 2.31 0 1 0 0 0 0 0
3 0.29 Premium VS2 62.4 58.0 334 4.20 4.23 2.63 0 0 0 0 0 1 0
4 0.31 Good SI2 63.3 58.0 335 4.34 4.35 2.75 0 0 0 0 0 0 1

Let’s apply one-hot encoding to all the categorical variables in the dataset.

df_one_hot = pd.get_dummies(df, prefix={
            'color':'color', 'cut':'cut', 'clarity':'clarity'}, 
            drop_first=False)
df_one_hot.head()
carat depth table price x y z cut_Fair cut_Good cut_Ideal cut_Premium cut_Very Good color_D color_E color_F color_G color_H color_I color_J clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2
0 0.23 61.5 55.0 326 3.95 3.98 2.43 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
1 0.21 59.8 61.0 326 3.89 3.84 2.31 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
2 0.23 56.9 65.0 327 4.05 4.07 2.31 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
3 0.29 62.4 58.0 334 4.20 4.23 2.63 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
4 0.31 63.3 58.0 335 4.34 4.35 2.75 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
df_one_hot.shape
(53940, 27)

One-hot encoding has added 20 dummy variables to the dataset.

Implementing dummy encoding with Pandas

To implement dummy encoding, we can follow the same steps as above, with the only difference being that the drop_first parameter must be set to True.

df_dummy = pd.get_dummies(df, prefix={
            'color':'color', 'cut':'cut', 'clarity':'clarity'}, 
            drop_first=True)
df_dummy.head()
carat depth table price x y z cut_Good cut_Ideal cut_Premium cut_Very Good color_E color_F color_G color_H color_I color_J clarity_IF clarity_SI1 clarity_SI2 clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2
0 0.23 61.5 55.0 326 3.95 3.98 2.43 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
1 0.21 59.8 61.0 326 3.89 3.84 2.31 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0
2 0.23 56.9 65.0 327 4.05 4.07 2.31 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
3 0.29 62.4 58.0 334 4.20 4.23 2.63 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0
4 0.31 63.3 58.0 335 4.34 4.35 2.75 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
df_dummy.shape
(53940, 24)

Dummy encoding has added 17 dummy variables to the dataset.
So both, one-hot and dummy encoding expand the feature space or dimensionality in your dataset.


Implementing encoding with Scikit-learn

Both one-hot and dummy encoding can be implemented in Scikit-learn by using its OneHotEncoder function.
ohe = OneHotEncoder(categories, drop, sparse)
encoded_data = ohe.fit_transform(original_data)
This returns a NumPy array of encoded data.

  • categories: The default is auto that automatically determines the categories in each variable.
  • drop: The default is None that performs one-hot encoding. To preform dummy encoding, set this parameter to first that drops the first category of each variable.
  • sparse: Set this to False to return the output as a NumPy array. The default is True which returns a sparse matrix.

Implementing one-hot encoding with Scikit-Learn

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop=None, sparse_output=False)

# only pass categorical variables
df_ohe = pd.DataFrame(ohe.fit_transform(df[['cut', 'color', 'clarity']]))
df_ohe.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
df_ohe.shape
(53940, 20)

Implementing dummy encoding with Scikit-Learn

dum = OneHotEncoder(categories='auto', drop='first', sparse_output=False)
# only pass categorical variables
df_dum = pd.DataFrame(dum.fit_transform(df[['cut', 'color', 'clarity']]))
df_dum.shape
(53940, 17)

Conclusion

Advantages of dummy encoding over one-hot encoding

  • Dummy encoding adds fewer dummy variables than one-hot encoding does.
  • Dummy encoding removes a duplicate category in each categorical variable.

Advantages of Pandas get_dummies() over Scikit-learn OneHotEncoder()

  • The get_dummies() function return encoded data with variable names. We can also add prefixes to dummy variables in each categorical variable name.
  • The get_dummies() function returns the entire dataset with numerical variables also.

When to use what?

Both, one-hot and dummy encoding, can be used for nominal and ordinal categorical variables. However, if you strictly want to keep the natural order of ordinal variables, you can use label encoding instead of the above two types.
One advantage of label encoding is that it does not expand the feature space at all as we just replace category names with numbers. Here, we do not use dummy variables.
The major disadvantage of label encoding is that machine learning algorithms may consider there may be relationships between the encoded categories. For example, let’s say we have a categorical variable Quality with three categories called “Fair”, “Good” and “Premium”. An algorithm may interpret Premium (2) as two times better than Good (1). Actually, there is no such relationship between the categories.
To avoid this, label encoding should only be applied to target (y) values, not to input (X) values.
In summary, label encoding is suitable for ordinal categorical variables, while one-hot encoding is preferred for nominal categorical variables. Label encoding is simple and space-efficient but may introduce arbitrary numerical relationships. One-hot encoding provides explicit representations, compatibility with algorithms, and insensitivity to magnitude but can lead to high dimensionality and sparse data. The choice depends on the specific characteristics of the data and the requirements of the machine learning problem at hand.
Label encoding can be applied by using Scikit-learn’s LabelEncoder() function. Now, we apply it to the cut variable in our diamonds dataset. This is for illustration purposes only as we do not use label encoding to encode input (X) values.

from sklearn.preprocessing import LabelEncoder

df['cut_enc'] = LabelEncoder().fit_transform(df['cut'])
df.head(10)
carat cut color clarity depth table price x y z cut_enc
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 2
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 3
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 1
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 3
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 1
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48 4
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47 4
7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53 4
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49 0
9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39 4

Since the new encoded data column cut_enc has been added, we can now remove the cut column.