A categorical (or qualitative) variable is a variable that can take on one of a limited, and usually fixed, number of possible values. In this blog post, we will see how we can deal with categorical variables in any dataset. Most machine learning algorithms do not work well with string values as their input variables, and hence we will discuss about ways to convert these string variables into numerical ones. This process is called categorical variable encoding.
Types of categorical variables
- Ordinal categorical variables: These are variables whose values follow a natural order. For e.g., a temperature variable can have values like low, medium or high.
- Nominal categorical variables: These are variables whose values do not follow a natural order. For e.g., gender values like male and female do not have any order.
Types of encoding
We’ll discuss two different types of encoding: - one-hot encoding: We create a new set of dummy (binary) variables that is equal to the number of categories (k) in the variable. - dummy encoding: This also uses dummy (binary) variables, but instead of create a number of dummy variables that is equal to the number of categories (k) in the variable, dummy encoding uses k-1 dummy variables. Dummy encoding removes a duplicate category present in the one-hot encoding.
Implementation with Pandas
Both one-hot encoding and dummy encoding can be implemented in Pandas by using the get_dummies
function.
pd.get_dummies(data, prefix, dummy_na, columns, drop_first)
data
: Here we specify the data we need to encode. It can be a NumPy array, or a Pandas Series or Dataframe. -prefix
: If we specify a prefix, it will add to the column names so that we can easily identify the columns. The prefix can be specified as a string for a single column name. For multiple column names, it is defined as a dictionary mapping column names to prefixes.dummy_na
: IfFalse
(default), missing values (NaN) are ignored when encoding the variables. IfTrue
, this will return missing data in a separate category.columns
: This specifies the column names to be encoded. IfNone
(default), all categorical columns in thedata
parameter will be encoded. If you specify column names as a list, only the specified columns will be encoded.drop_first
: This is the most important parameter. IfFalse
(default), this will perform one-hot encoding. IfTrue
, this will drop the first category of each variable, create k-1 dummy variables for each categorical variable, and perform dummy encoding.
Let’s see this in action.
Analyzing diamond data
The Diamonds dataset (available here) contains the prices and other attributes of almost 54,000 diamonds. Let’s explore it.
import pandas as pd
'display.max_columns', None) pd.set_option(
= pd.read_csv('data/diamonds.csv')
df df.head()
Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
=df.columns[0], axis=1, inplace=True)
df.drop(columns df.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
df.shape
(53940, 10)
The dataset consists of 53,940 rows and 11 columns.
# check for missing values
sum().sum() df.isnull().
0
There are no missing values in the dataset.
Let’s check how many categorical variables are present in the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
There are three categorical variables in the dataset. They are cut
, color
, and clarity
.
Let’s check their unique categories.
'cut'].unique() df[
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
'color'].unique() df[
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
'clarity'].unique() df[
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
dtype=object)
Implementing one-hot encoding with Pandas
Let’s apply one-hot encoding to the color variable.
= pd.get_dummies(df['color'], prefix='color', drop_first=False)
one_hot one_hot.head()
color_D | color_E | color_F | color_G | color_H | color_I | color_J | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
one_hot.shape
(53940, 7)
This returns a pandas dataframe of encoded data. The text specified in the prefix parameter is combined with the category names of the color
. Now let’s add this one-hot encoding to the main dataset.
= pd.get_dummies(df, prefix='color', columns=['color'], drop_first=False)
df_one_hot df_one_hot.head()
carat | cut | clarity | depth | table | price | x | y | z | color_D | color_E | color_F | color_G | color_H | color_I | color_J | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0.21 | Premium | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0.23 | Good | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0.29 | Premium | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0.31 | Good | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Let’s apply one-hot encoding to all the categorical variables in the dataset.
= pd.get_dummies(df, prefix={
df_one_hot 'color':'color', 'cut':'cut', 'clarity':'clarity'},
=False) drop_first
df_one_hot.head()
carat | depth | table | price | x | y | z | cut_Fair | cut_Good | cut_Ideal | cut_Premium | cut_Very Good | color_D | color_E | color_F | color_G | color_H | color_I | color_J | clarity_I1 | clarity_IF | clarity_SI1 | clarity_SI2 | clarity_VS1 | clarity_VS2 | clarity_VVS1 | clarity_VVS2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0.21 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0.23 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0.29 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.31 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
df_one_hot.shape
(53940, 27)
One-hot encoding has added 20 dummy variables to the dataset.
Implementing dummy encoding with Pandas
To implement dummy encoding, we can follow the same steps as above, with the only difference being that the drop_first
parameter must be set to True
.
= pd.get_dummies(df, prefix={
df_dummy 'color':'color', 'cut':'cut', 'clarity':'clarity'},
=True) drop_first
df_dummy.head()
carat | depth | table | price | x | y | z | cut_Good | cut_Ideal | cut_Premium | cut_Very Good | color_E | color_F | color_G | color_H | color_I | color_J | clarity_IF | clarity_SI1 | clarity_SI2 | clarity_VS1 | clarity_VS2 | clarity_VVS1 | clarity_VVS2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0.21 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0.23 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0.29 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.31 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
df_dummy.shape
(53940, 24)
Dummy encoding has added 17 dummy variables to the dataset.
So both, one-hot and dummy encoding expand the feature space or dimensionality in your dataset.
Implementing encoding with Scikit-learn
Both one-hot and dummy encoding can be implemented in Scikit-learn by using its OneHotEncoder
function.
ohe = OneHotEncoder(categories, drop, sparse)
encoded_data = ohe.fit_transform(original_data)
This returns a NumPy array of encoded data.
categories
: The default isauto
that automatically determines the categories in each variable.drop
: The default isNone
that performs one-hot encoding. To preform dummy encoding, set this parameter tofirst
that drops the first category of each variable.sparse
: Set this toFalse
to return the output as a NumPy array. The default isTrue
which returns a sparse matrix.
Implementing one-hot encoding with Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(categories='auto', drop=None, sparse_output=False)
ohe
# only pass categorical variables
= pd.DataFrame(ohe.fit_transform(df[['cut', 'color', 'clarity']])) df_ohe
df_ohe.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
df_ohe.shape
(53940, 20)
Implementing dummy encoding with Scikit-Learn
= OneHotEncoder(categories='auto', drop='first', sparse_output=False) dum
# only pass categorical variables
= pd.DataFrame(dum.fit_transform(df[['cut', 'color', 'clarity']])) df_dum
df_dum.shape
(53940, 17)
Conclusion
Advantages of dummy encoding over one-hot encoding
- Dummy encoding adds fewer dummy variables than one-hot encoding does.
- Dummy encoding removes a duplicate category in each categorical variable.
Advantages of Pandas get_dummies()
over Scikit-learn OneHotEncoder()
- The
get_dummies()
function return encoded data with variable names. We can also add prefixes to dummy variables in each categorical variable name. - The
get_dummies()
function returns the entire dataset with numerical variables also.
When to use what?
Both, one-hot and dummy encoding, can be used for nominal and ordinal categorical variables. However, if you strictly want to keep the natural order of ordinal variables, you can use label encoding instead of the above two types.
One advantage of label encoding is that it does not expand the feature space at all as we just replace category names with numbers. Here, we do not use dummy variables.
The major disadvantage of label encoding is that machine learning algorithms may consider there may be relationships between the encoded categories. For example, let’s say we have a categorical variable Quality with three categories called “Fair”, “Good” and “Premium”. An algorithm may interpret Premium (2) as two times better than Good (1). Actually, there is no such relationship between the categories.
To avoid this, label encoding should only be applied to target (y) values, not to input (X) values.
In summary, label encoding is suitable for ordinal categorical variables, while one-hot encoding is preferred for nominal categorical variables. Label encoding is simple and space-efficient but may introduce arbitrary numerical relationships. One-hot encoding provides explicit representations, compatibility with algorithms, and insensitivity to magnitude but can lead to high dimensionality and sparse data. The choice depends on the specific characteristics of the data and the requirements of the machine learning problem at hand.
Label encoding can be applied by using Scikit-learn’s LabelEncoder()
function. Now, we apply it to the cut variable in our diamonds dataset. This is for illustration purposes only as we do not use label encoding to encode input (X) values.
from sklearn.preprocessing import LabelEncoder
'cut_enc'] = LabelEncoder().fit_transform(df['cut'])
df[10) df.head(
carat | cut | color | clarity | depth | table | price | x | y | z | cut_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 2 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 3 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 1 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 3 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 1 |
5 | 0.24 | Very Good | J | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 | 4 |
6 | 0.24 | Very Good | I | VVS1 | 62.3 | 57.0 | 336 | 3.95 | 3.98 | 2.47 | 4 |
7 | 0.26 | Very Good | H | SI1 | 61.9 | 55.0 | 337 | 4.07 | 4.11 | 2.53 | 4 |
8 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 | 0 |
9 | 0.23 | Very Good | H | VS1 | 59.4 | 61.0 | 338 | 4.00 | 4.05 | 2.39 | 4 |
Since the new encoded data column cut_enc
has been added, we can now remove the cut
column.