After creating all features, it is quickly evident that there are a lot of missing data. Missing data should be dealt with before going on to the next phases of our model building. In this post, I’m going to quickly describe the reasons behind the missing values and ways to treat them.
- Problem Definition
- Data Acquisition
- Feature Engineering
- Data Exploration
- Feature Extraction
- Missing Values Imputation
- Feature Encoding
- Feature Selection – Dimensionality Reduction
- Predictive Modelling
- Model Deployment
During the previous feature extraction phase, there are two cases that lead to missing values; the first one refers to the cases where the data are indeed not available while the second one refers to ‘pseudo-missing’ data.
There are many examples of the first type of missing values. For instance, we have a feature called ‘previous race final position’. What if the driver did not participate in the previous race? Another case regards the feature ‘percentage difference in qualifying time from pole position’. What about the years where the qualifying format was different and there were no data for all three qualifying sessions?
The second type of missing values I mentioned above are artificial ones and are due to the merging process of different datasets. For example, when we left join the main dataset (from the results table) with another dataset showing the races won in season, we get a lot of NaNs since the second extract does include all rows included in the first one (since most drivers never win in a season). In such cases, the ‘pseudo-missing’ data are just filled in with zeros.
Referring to the cases when the data are indeed not available, here are the most common ways of imputing missing data:
Complete-case analysis, aka listwise-deletion, deletes any cases (i.e. rows) with missing data so only complete cases are left. If the data are not missing at random or the dataset is not very large, this is not a good practice since the remaining dataset will be too small to train any useful models. It also can lead to bias if observations with missing values differ systematically from complete cases.
This case consists of computing the mean (or mode for categorical variables) of feature X using the non-missing values and using it to impute missing values of X. However, this distorts the distribution of the variable and distorts relationships between variables since correlations will be pulled towards zero. This method is also known as marginal mean imputation.
Indicator variables for missingness of predictors
For categorical predictors, a simple and often useful approach to imputation is to add an extra category for the variable indicating missingness. For continuous features, a popular approach is to include an extra indicator identifying which observations on that variable have missing data. Then the missing values in the partially observed predictor can be replaced with zero, with a very large (negative or positive) value or with the median/mean. From personal practical experience, this method is by far the most practical and most commonly-used one in applied machine learning.
This is a more advanced method but is more computationally demanding. The idea is to train a predictive model on all cases without any missing values; the target variable is the one we want to impute the missing data. Then, we use the trained model to predict the values for those cases it is missing. The goal here is not causal inference, it is accurate prediction. This technique is also called conditional mean imputation. Ideally, this approach should be applied inside each cross-validation fold so that the validation set has never seen the training set.
This method differs from the previous ones since the imputation process generates several data sets, each with different imputed values. There are three basic steps:
1. Impute the missing entries of the incomplete data sets m times. This step results is m complete data sets. Missing values should be imputed using an appropriate model that incorporates random variation.
2. Analyze each of the m completed data sets. This step results in m analyses.
3. Integrate the m analysis results into a final result.
However, I have never heard of someone using this method in practice for applied ML problems.
Treating missing values is a well-researched area in literature. The above is just a quick introduction to the topic and you can find more information available on the internet. There are many more imputation methods. Here is a link for those who want to delve a little bit deeper into the subject:
Data Analysis Using Regression and Multilevel/Hierarchical Models – Chapter 25
Do you have any experience using some other method in a practical ML application?