Up till now we have discussed on how we can create some features and how we can encode them. We’ve, hopefully, ended up with lots of variables and we are seeking a way to keep the top-performing ones. In this post I’m going to explain ‘why’ and ‘how’ we should cleverly select our features.
- Problem Definition
- Data Acquisition
- Feature Engineering
- Data Exploration
- Feature Extraction
- Missing Values Imputation
- Feature Encoding
- Feature Selection – Dimensionality Reduction
- Predictive Modelling
- Model Deployment
Actually I lied above.. Feature selection may also be part of the predictive modelling process since you could use some modelling algorithms that perform variable selection within their optimization process. Anyway, I’ll get back to this later on.
The main idea for performing feature selection is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. Most times, feature selection also improves the model performance because irrelevant or redundant (i.e. correlated) features can affect the performance of many models. Although feature selection should be distinguished from feature extraction, both methods are presented together here since they have the same result on the data; they decrease the dimensionality of the dataset.
- it simplifies the models making them easier to interpret
- it helps shorten the training times
- it boosts generalization performance by reducing overfitting
There are three feature selection principles and they are distinguished based on how they combine the feature selection algorithm and the model building itself.
Filter methods select variables regardless of the predictive modelling that will follow. These methods calculate a statistic on each feature (e.g. correlation of the feature with the target variable) and select the N features with the highest scores. However, filter methods tend to select redundant variables because they do not consider the relationships between variables. Examples statistics include chi-square and mutual information.
Wrapper methods evaluate several subsets of features and assess the variable selection optimality based on the consequent predictive modelling performance. In other words, a predictive model is built on various subsets of features; the one with best performance is kept and the respective feature subset is considered the best one. This process, however, is impractical since it usually takes too much time to build and evaluate a model on every subset of some variables and secondly because it may lead to overfitting especially if the number of observations is not large enough.
Embedded methods perform the feature selection process within the predictive modelling phase. This is what I was talking about when I said I was lying above. There are algorithms that during their internal optimization process, they reduce the effect or totally ignore some variables. Therefore, the result of the feature selection is assessed based on the predictive modelling performance. Such algorithms perform feature selection and classification or regression simultaneously. Example algorithms are all tree-based models (i.e. decision trees, random forests, GBMs) and regularized regression models (e.g. lasso regression).
Feature extraction methods are another way to reduce the dimensionality of a dataset. Instead of keeping some of the variables, these techniques keep all of them but project them onto a lower dimensionality space, i.e. they combine all the features and transform them into new ones, the number of which is much lower than the original ones. Although such techniques make it much harder (if not impossible) to interpret the resulting models, they are great for visualizing the dataset (see examples in my previous post) and may also lead to increased predictive performance.
You can find more details about feature selection and dimensionality reduction algorithms in my thesis project I completed a few years ago. I’ve also run some comparison experiments and the code can also be found on my github page.
Based on my practical experience, both filter and embedded methods can be proven useful to most applied ML problems. I cannot stress enough how important this is especially in cases where the number of observations is low and the number of features is high. F1-predictor’s model mainly uses embedded methods for feature selection, but I also used some wrapper methods to further optimize the final feature set. As always, there is no free lunch in feature selection as well. In each case, try and assess the performance of various techniques (or combinations thereof) and see what works best for you!