Your Onestop Guide on Handling Categorical Features!
Every real-world dataset comes with its unique blends. Sometimes working with real-world data, you will have to deal with categorical data. Categorical data are those types of data whose features’ values contain a limited number of categories. Take an example of feature gender that can have two categories: male and female
.
Why do we have to handle categorical features?
The reason is that most Machine Learning algorithms accept numerical values at the input. So, we have to manipulate these types of categories to be in the proper format accepted by these learning algorithms.
In this article, I want to talk about five techniques to encode or convert categorical features into numbers, which are:
- Mapping Method
- Ordinary Encoding
- Label Encoding
- Pandas Dummy Method
- OneHot Encoding
Note that some of these encoding techniques can produce the same output, the difference is the only implementation. The first 3 will produce the numerical outputs while the latter will produce the one hot matrix (with 1s and 0s).
Before we go through each technique, let’s load the data that we will work with. We are going to use a titanic dataset, which is a classical dataset in Machine Learning, and no wonder you know it.
For some reasons, I found that it’s much easier to get it from Seaborn datasets
(Python visualization tool). Let's get it.
#Importing relevant librariesimport seaborn as sns
import pandas as pdtitanic = sns.load_dataset('titanic')
titanic.head()
Let’s take a quick glance through the dataset, it’s always good practice.
#Checking missing values
titanic.isnull().sum()
#Checking data info
titanic.info()
If you look at the Dtype column, you will see the data types of all features. Features like class and deck have category
type, but there are other categorical features that have object
type.
Let’s also inspect some categorical features. We can use titanic['feature_name'].value_counts()
. To make that quick, let's display some of them together.
Every time I am working with data, I like to inspect some features to have an idea of what I have in my hands and it has remarkably helped me to understand the latter. Now that we have an idea of how our data look like, let’s come back to our initial goal. How do we handle categorical features?
1. Mapping Method
The mapping method is straight forward way to encode categorical features with few categories. Let’s apply it to the class feature that has three categories: Third, First, Second
. We create a dictionary whose keys are categories and values are numerics to encode into and then map it to the data frame.
Here is how it is done:
map_dict = {
'First':0,
'Second': 1,
'Third': 2
}titanic['class'] = titanic['class'].map(map_dict)
If you look at the resulting data frame, the feature class
no longer contains First, Second, or Third. Where we had First, it is 0. And the same thing for other categories.
2. Ordinary Encoding
Ordinary encoding is a sklearn preprocessing function used to convert text features into numbers. Its output is a NumPy array, but we can convert it back into Pandas data frame.
from sklearn.preprocessing import OrdinalEncodercats_feats = titanic[['alive', 'alone']]encoder = OrdinalEncoder()cats_encoded = encoder.fit_transform(cats_feats)
The output cats_encoded
is an array. Let's convert it back to the data frame.
titanic[['alive', 'alone']] = pd.DataFrame(cats_encoded, columns=cats_feats.columns, index=cats_feats.index)
Here is how the resulting data will look like. Look closely to the concerned features: alive, alone
.
3. Label Encoding
Label Encoding can be used to handle categorical target features ( per sklearn documentation) but otherwise, it can also be used to achieve our purpose of encoding other categorical features.
It also can’t support missing values. So, to make it simple, let’s drop all missing values first.
titanic = sns.load_dataset('titanic')
titanic_cleaned = titanic.dropna()
Now we can encode the features. Let’s apply it to the deck
that has 7 categories.
from sklearn.preprocessing import LabelEncoderdeck_feat = titanic_cleaned[['deck']]label_encoder = LabelEncoder()deck_encoded = label_encoder.fit_transform(deck_feat)
Same as ordinary Encoder, the output of label_encoder
is a NumPy array. Let's convert it to a dataframe.
titanic_cleaned['deck'] = pd.DataFrame(deck_encoded, columns=deck_feat.columns, index=deck_feat.index)
Take a look at the resulting dataframe, specifically at the deck feature.
If you want to display the encoded classes, you can get them too.
4. Pandas Dummies Method
This is also a simple way to handle categorical features. It will create extra features based on the available categories. Let’s apply it to the feature who
. And it is only one line of code...
titanic[['man', 'woman']] = pd.get_dummies(titanic['who'], drop_first=True)
5. One Hot Encoding
One Hot Encoding is also another commonly used encoding technique for dealing with categorical features and most effective in unordered categories. This is what I mean by saying unordered categories: If you have 3 cities and encode them with numbers (1,2,3) respectively, a machine learning model may learn that city 1 is close to city 2 and to city 3. As that is a false assumption to make, the model will likely give incorrect predictions if the city feature plays an important role in the analysis. On the flip side, if you have the feature of ordered ranges like low, medium, and high, then numbers can be an effective way because you want to keep the sequence of these ranges.
Just like the dummies, we saw previously, One Hot Encoding will also create additional features corresponding to the values of the given categories. And the output is one hot matrix. One hot matrix is a binary representation, so it will be in 1s and 0s. In other words, the active category instance will be hot(1), whereas the other remains cold(0). Here is how one hot matrix looks like:
Now that we understand the idea behind one hot matrix, let’s implement it.
from sklearn.preprocessing import OneHotEncoderone_hot = OneHotEncoder()town_encoded = one_hot.fit_transform(titanic_cleaned[['embark_town']])
town_encoded = town_encoded.toarray()#The output of One hot encoder is a sparse matrix.
#We will need to convert it into NumPy array.town_encoded = town_encoded.toarray()
columns = list(one_hot.categories_)town_df = pd.DataFrame(town_encoded, columns =columns)drop_embark = titanic_cleaned.drop('embark_town',axis=1)
drop_embark[['Cherbourg', 'Queenstown', 'Southampton']] = town_df
Here is the resulting data frame.
Thank you for reading. Hopefully, this was helpful to you and you won’t ever have to struggle when dealing with categorical features! Would you like the codes? Here you go!
Every week, I write one article about Machine Learning. I am aiming to write more about the most pressing techniques and ideas. Connect with me on Twitter!