Handling categorical variables

A categorical (or qualitative) variable is a variable that can take on one of a limited, and usually fixed, number of possible values. In this blog post, we will see how we can deal with categorical variables in any dataset. Most machine learning algorithms do not work well with string values as their input variables, and hence we will discuss about ways to convert these string variables into numerical ones. This process is called categorical variable encoding.

Types of categorical variables

Ordinal categorical variables: These are variables whose values follow a natural order. For e.g., a temperature variable can have values like low, medium or high.
Nominal categorical variables: These are variables whose values do not follow a natural order. For e.g., gender values like male and female do not have any order.

Types of encoding

We’ll discuss two different types of encoding: - one-hot encoding: We create a new set of dummy (binary) variables that is equal to the number of categories (k) in the variable. - dummy encoding: This also uses dummy (binary) variables, but instead of create a number of dummy variables that is equal to the number of categories (k) in the variable, dummy encoding uses k-1 dummy variables. Dummy encoding removes a duplicate category present in the one-hot encoding.

Implementation with Pandas

Both one-hot encoding and dummy encoding can be implemented in Pandas by using the get_dummies function.

pd.get_dummies(data, prefix, dummy_na, columns, drop_first)

data: Here we specify the data we need to encode. It can be a NumPy array, or a Pandas Series or Dataframe. - prefix: If we specify a prefix, it will add to the column names so that we can easily identify the columns. The prefix can be specified as a string for a single column name. For multiple column names, it is defined as a dictionary mapping column names to prefixes.
dummy_na: If False (default), missing values (NaN) are ignored when encoding the variables. If True, this will return missing data in a separate category.
columns: This specifies the column names to be encoded. If None (default), all categorical columns in the data parameter will be encoded. If you specify column names as a list, only the specified columns will be encoded.
drop_first: This is the most important parameter. If False (default), this will perform one-hot encoding. If True, this will drop the first category of each variable, create k-1 dummy variables for each categorical variable, and perform dummy encoding.

Let’s see this in action.

Analyzing diamond data

The Diamonds dataset (available here) contains the prices and other attributes of almost 54,000 diamonds. Let’s explore it.

import pandas as pd
pd.set_option('display.max_columns', None)

df = pd.read_csv('data/diamonds.csv')
df.head()

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

df.drop(columns=df.columns[0], axis=1,  inplace=True)
df.head()

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

df.shape

(53940, 10)

The dataset consists of 53,940 rows and 11 columns.

# check for missing values
df.isnull().sum().sum()

There are no missing values in the dataset.
Let’s check how many categorical variables are present in the dataset.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB

There are three categorical variables in the dataset. They are cut, color, and clarity.
Let’s check their unique categories.

df['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

df['color'].unique()

array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)

df['clarity'].unique()

array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
      dtype=object)

Implementing one-hot encoding with Pandas

Let’s apply one-hot encoding to the color variable.

one_hot = pd.get_dummies(df['color'], prefix='color', drop_first=False)
one_hot.head()

	color_E	color_I	color_J
0	1	0	0
1	1	0	0
2	1	0	0
3	0	1	0
4	0	0	1

one_hot.shape

(53940, 7)

This returns a pandas dataframe of encoded data. The text specified in the prefix parameter is combined with the category names of the color. Now let’s add this one-hot encoding to the main dataset.

df_one_hot = pd.get_dummies(df, prefix='color', columns=['color'], drop_first=False)
df_one_hot.head()

	carat	cut	clarity	depth	table	price	x	y	z	color_E	color_I	color_J
0	0.23	Ideal	SI2	61.5	55.0	326	3.95	3.98	2.43	1	0	0
1	0.21	Premium	SI1	59.8	61.0	326	3.89	3.84	2.31	1	0	0
2	0.23	Good	VS1	56.9	65.0	327	4.05	4.07	2.31	1	0	0
3	0.29	Premium	VS2	62.4	58.0	334	4.20	4.23	2.63	0	1	0
4	0.31	Good	SI2	63.3	58.0	335	4.34	4.35	2.75	0	0	1

Let’s apply one-hot encoding to all the categorical variables in the dataset.

df_one_hot = pd.get_dummies(df, prefix={
            'color':'color', 'cut':'cut', 'clarity':'clarity'}, 
            drop_first=False)

df_one_hot.head()

	carat	depth	table	price	x	y	z	cut_Good	cut_Ideal	cut_Premium	color_E	color_I	color_J	clarity_SI1	clarity_SI2	clarity_VS1	clarity_VS2
0	0.23	61.5	55.0	326	3.95	3.98	2.43	0	1	0	1	0	0	0	1	0	0
1	0.21	59.8	61.0	326	3.89	3.84	2.31	0	0	1	1	0	0	1	0	0	0
2	0.23	56.9	65.0	327	4.05	4.07	2.31	1	0	0	1	0	0	0	0	1	0
3	0.29	62.4	58.0	334	4.20	4.23	2.63	0	0	1	0	1	0	0	0	0	1
4	0.31	63.3	58.0	335	4.34	4.35	2.75	1	0	0	0	0	1	0	1	0	0

df_one_hot.shape

(53940, 27)

One-hot encoding has added 20 dummy variables to the dataset.

Implementing dummy encoding with Pandas

To implement dummy encoding, we can follow the same steps as above, with the only difference being that the drop_first parameter must be set to True.

df_dummy = pd.get_dummies(df, prefix={
            'color':'color', 'cut':'cut', 'clarity':'clarity'}, 
            drop_first=True)

df_dummy.head()

	carat	depth	table	price	x	y	z	cut_Good	cut_Ideal	cut_Premium	color_E	color_I	color_J	clarity_SI1	clarity_SI2	clarity_VS1	clarity_VS2
0	0.23	61.5	55.0	326	3.95	3.98	2.43	0	1	0	1	0	0	0	1	0	0
1	0.21	59.8	61.0	326	3.89	3.84	2.31	0	0	1	1	0	0	1	0	0	0
2	0.23	56.9	65.0	327	4.05	4.07	2.31	1	0	0	1	0	0	0	0	1	0
3	0.29	62.4	58.0	334	4.20	4.23	2.63	0	0	1	0	1	0	0	0	0	1
4	0.31	63.3	58.0	335	4.34	4.35	2.75	1	0	0	0	0	1	0	1	0	0

df_dummy.shape

(53940, 24)

Dummy encoding has added 17 dummy variables to the dataset.
So both, one-hot and dummy encoding expand the feature space or dimensionality in your dataset.

Implementing encoding with Scikit-learn

Both one-hot and dummy encoding can be implemented in Scikit-learn by using its OneHotEncoder function.
ohe = OneHotEncoder(categories, drop, sparse)
encoded_data = ohe.fit_transform(original_data)
This returns a NumPy array of encoded data.

categories: The default is auto that automatically determines the categories in each variable.
drop: The default is None that performs one-hot encoding. To preform dummy encoding, set this parameter to first that drops the first category of each variable.
sparse: Set this to False to return the output as a NumPy array. The default is True which returns a sparse matrix.

Implementing one-hot encoding with Scikit-Learn

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop=None, sparse_output=False)

# only pass categorical variables
df_ohe = pd.DataFrame(ohe.fit_transform(df[['cut', 'color', 'clarity']]))

df_ohe.head()

	1	2	3	6	10	11	14	15	16	17
0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
1	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0
2	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
3	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0
4	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0

df_ohe.shape

(53940, 20)

Implementing dummy encoding with Scikit-Learn

dum = OneHotEncoder(categories='auto', drop='first', sparse_output=False)

# only pass categorical variables
df_dum = pd.DataFrame(dum.fit_transform(df[['cut', 'color', 'clarity']]))

df_dum.shape

(53940, 17)

Conclusion

Advantages of dummy encoding over one-hot encoding

Dummy encoding adds fewer dummy variables than one-hot encoding does.
Dummy encoding removes a duplicate category in each categorical variable.

Advantages of Pandas `get_dummies()` over Scikit-learn `OneHotEncoder()`

The get_dummies() function return encoded data with variable names. We can also add prefixes to dummy variables in each categorical variable name.
The get_dummies() function returns the entire dataset with numerical variables also.

When to use what?

Both, one-hot and dummy encoding, can be used for nominal and ordinal categorical variables. However, if you strictly want to keep the natural order of ordinal variables, you can use label encoding instead of the above two types.
One advantage of label encoding is that it does not expand the feature space at all as we just replace category names with numbers. Here, we do not use dummy variables.
The major disadvantage of label encoding is that machine learning algorithms may consider there may be relationships between the encoded categories. For example, let’s say we have a categorical variable Quality with three categories called “Fair”, “Good” and “Premium”. An algorithm may interpret Premium (2) as two times better than Good (1). Actually, there is no such relationship between the categories.
To avoid this, label encoding should only be applied to target (y) values, not to input (X) values.
In summary, label encoding is suitable for ordinal categorical variables, while one-hot encoding is preferred for nominal categorical variables. Label encoding is simple and space-efficient but may introduce arbitrary numerical relationships. One-hot encoding provides explicit representations, compatibility with algorithms, and insensitivity to magnitude but can lead to high dimensionality and sparse data. The choice depends on the specific characteristics of the data and the requirements of the machine learning problem at hand.
Label encoding can be applied by using Scikit-learn’s LabelEncoder() function. Now, we apply it to the cut variable in our diamonds dataset. This is for illustration purposes only as we do not use label encoding to encode input (X) values.

from sklearn.preprocessing import LabelEncoder

df['cut_enc'] = LabelEncoder().fit_transform(df['cut'])
df.head(10)

	carat	cut	color	clarity	depth	table	price	x	y	z	cut_enc
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43	2
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31	3
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31	1
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63	3
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75	1
5	0.24	Very Good	J	VVS2	62.8	57.0	336	3.94	3.96	2.48	4
6	0.24	Very Good	I	VVS1	62.3	57.0	336	3.95	3.98	2.47	4
7	0.26	Very Good	H	SI1	61.9	55.0	337	4.07	4.11	2.53	4
8	0.22	Fair	E	VS2	65.1	61.0	337	3.87	3.78	2.49	0
9	0.23	Very Good	H	VS1	59.4	61.0	338	4.00	4.05	2.39	4

Since the new encoded data column cut_enc has been added, we can now remove the cut column.