In this post I’m going to talk about one of the most creative aspects of data science: feature encoding. If you haven’t followed my previous posts on feature engineering, you can find them in this link before reading this one.
- Problem Definition
- Data Acquisition
- Feature Engineering
- Data Exploration
- Feature Extraction
- Missing Values Imputation
- Feature Encoding
- Feature Selection – Dimensionality Reduction
- Predictive Modelling
- Model Deployment
I wasn’t able to find any decent definition of feature encoding, so I will try to write one myself. Well… feature encoding is the collection of methods that render the original features into new ones so that, firstly, they can be fed to current ML algorithm implementations and, secondly, they carry more information than the original ones.
OK, let’s explain the first point. A feature I use in the F1 prediction model is the driver. This is a categorical feature that takes several discrete values (e.g. ‘Alonso’, ‘Hamilton’ etc). Many of the current ML algorithms do not directly accept a categorical feature in the form of strings. A major example is neural networks. This format would only be accepted by some tree-based implementations. The way to transform this feature into something that can, indeed, be fed to an ML algorithm is through feature encoding.
Let’s discuss about categorical variables first which almost always need some treatment (as discussed above).
This is a very simple method and is useful only for non-linear tree-based algorithms. The idea is to assign a unique numerical value to each distinct categorical value. For instance, [‘Alonso’, ‘Hamilton’, ‘Vettel’] would become [0, 1, 2]. Caution here: this won’t work with linear methods since the model will interpret that Vettel is further from Alonso compared to Hamilton which does not have any physical meaning. I’d also discourage using this even with tree-based algorithms to be honest. On the good side, using this method the dimensionality of the dataset remains the same.
This is another simple but sometimes useful method. The concept is to replace categorical variables with their count in the training set (otherwise overfitting is almost guaranteed)! This method can be used with both linear and non-linear models and can handle unseen values replacing them with ‘1’. As an example, [‘Alonso’, ‘Hamilton’, ‘Vettel’, ‘Alonso’] would become [2, 1, 1, 2]. As you can see, this may give the same encoding (here ‘1’) to different feature values (Hamilton and Vettel).
A method that combined the previous two ideas is LabelCount encoding where the categorical variables are ranked by count in the training set. This method won’t give the same encoding (here ‘1’) to different variables. Ties are broken by chance. In this case, [‘Alonso’, ‘Hamilton’, ‘Vettel’, ‘Alonso’] could become [3, 2, 1, 3].
One hot encoding (OHE)
OHE, aka one-of-K encoding, takes each distinct value of the original feature and turns it into a new column. The value for each of the new columns is either 1 or 0 indicating the presence or absence of this value in the specific sample (i.e. row). Using OHE, the feature [‘Alonso’, ‘Hamilton’, ‘Vettel’] would become [[1, 0, 0], [0, 1, 0], [0, 0, 1]] with the first column indicating the presence of Alonso, the second of Hamilton and so on.
As you can imagine, this process can create a very sparse dataset and if a sparse format is utilized, this can be a very memory-friendly option. One drawback of OHE is that it creates very sparse data which lead to difficulties in the subsequent modelling phase and another one is that most implementations cannot handle values that are only seen in the test dataset and not during training.
This is also known as feature hashing or the hashing trick. In essence, this is just OHE while the output array has a fixed length. Instead of turning each unique value of a feature into a new column, the value is passed through a hash function and its hash value is used as an index to an array (remember here that a hash function is any function that can be used to map data of arbitrary size to data of fixed size). In the previous example, let’s assume we use a hash function with output dimensionality equal to two. The feature [‘Alonso’, ‘Hamilton’, ‘Vettel’] could become [[1, 0], [0, 1], [1, 0]]. As you can observe, the was a collision since Alonso and Vettel were mapped to the same bucket. This may degrade the results but this method creates less sparsity in the resulting data and can handle new variables that appear on the testing dataset.
Target (or likelihood) encoding
One method that utilizes the actual target variable for encoding an independent categorical feature is target encoding. This method encodes each categorical value by their ratio (in case of binary classification) or mean (in case of regression) of the target value. This has to be constructed within a cross-validation loop in order to avoid overfitting. This method is a bit tricky to implement but, from personal experience, I think it’s probably the most powerful encoding technique. A practical tip here: add some random noise to the target value to reduce overfitting or calculate its value using nested cross-validation!
Although this is not easy to illustrate with an example, I’ll do my best. So, the categorical feature is [‘Alonso’, ‘Hamilton’, ‘Alonso’, ‘Hamilton’] and the target value is [1, 2, 2, 5]. The encoding would product [1.5, 3.5, 1.5, 3.5] as these are the mean values of each driver. In fact, this is a form of model stacking since the encoding is actually a single-variable model which ‘predicts’ the average target value.
- NaN values should not be ignored but should be treated like another category level
- You may explore feature interactions, i.e. interactions between the categorical variables
- You can create new categorical features (e.g. country, city, street name, street number) from a single original feature (e.g. home address)
I hope you’re still following me. Now let’s discuss about the numerical features. I promise this part will be shorter.
Rounding a numerical feature retains the most significant part of it while ignoring the – sometimes unneeded – precision that can be just noise.
Break a numerical feature into a number of bins and encode it using the bin-ID. The break-points used for binning can be set manually, by using quantiles, evenly or using models to find the optimal bins. As an example, let’s assume that the original feature is [10, 1, 5, 6, 20] and the bin break-points are [2, 8]. The result would be [2, 0, 1, 1, 2].
This method scales a numerical variable by passing it through a function. Examples are MinMax scaling (which scales the variable into a pre-defined range), standard (z) scaling, square root scaling, log scaling or Anscombe transformation.
Wonder why use something like this? Well there are two purposes. Firstly, many algorithms, including neural networks and regularized linear regression, make assumptions on the distribution of the input features and will not converge to the optimal solution if the input variables have not been scaled properly. Secondly, imagine trying to predict a target variable that is highly skewed. I bet your model will have difficulties doing it properly. If you root or log transform your target variable, it’s shape will be closer to a gaussian distribution and I’m sure your regression model will make more accurate predictions (of course, don’t forget to do the inverse transformation on the predicted values).
There are many more tricks of the trade that you can learn when practically playing around with a real task. However, I feel the methods above are more than enough to help you get started with this part of machine learning. If you are an ML practitioner, please do invest A LOT of time in feature encoding when solving an actual problem. The time spent here will surely show increased performance on the subsequent modelling phase.
If you have any questions or ideas, I’ll be happy to help. If you just wonder how you’ll handle all this data we’ve just created, just subscribe to the newsletter so you don’t miss the next post on feature selection!