After discussing in my previous post about Data Acquisition, here I’m going to describe the most important part of any Machine Learning pipeline i.e. Feature Engineering.
- Problem Definition
- Data Acquisition
- Feature Engineering
- Data Exploration
- Feature Extraction
- Missing Values Imputation
- Feature Encoding
- Feature Selection – Dimensionality Reduction
- Predictive Modelling
- Model Deployment
In my opinion, Feature Engineering is the most neglected ML component in the literature; it’s not taught in any school, it is difficult – if not impossible – to completely automate and it is the most creative, difficult and time-consuming task of any data scientist. As it’s often said on Kaggle, everyone uses the same algorithms. The biggest gains can be attained by focusing on feature engineering. But what exactly is it?
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive (Wikipedia).
In this post I’m going to give a short description of the first part of feature engineering.
Working with data, especially those collected by someone else, is always a very hard thing to do since you don’t how how they’ve been captured. The first action every data scientist should take before running all the cool algorithms is to check what data is available.
When I started building the F1 race prediction model I was wondering what data do I, exactly, have? Do I have those data for all the previous seasons? Are the data up-to-date? Are there any missing values, inconsistencies, outliers?
Fortunately, the response for the 2nd and 3rd question is almost positive, and I’ll explain this ‘almost’ a bit later. The data are organized in 13 tables with a star-like schema. The main table is the results table which holds all information regarding the outcome of each race (i.e. what position each driver started from, what position he ended, what car he was driving, whether he retired or not etc.).
There is also the drivers and the constructors tables including information such as driver name, nationality, birth date and a link to the respectively article on wikipedia. Other tables show drivers and constructors standings after each race. The qualifying table shows the exact lap times for each qualifying session while the lapTimes and pitStops tables provide, as expected, each driver’s race lap times and pit stop laps.
Having identified what data are available, I scanned all tables trying to identify what might cause me problems in the downstream process. As always, there were some issues (I’d be really surprised if there weren’t). Below are just a few examples of such issues which also explain the ‘almost’ I said before:
- Race lap times are only available since 2011. Lap times sound like very useful data to create features from, but would I be able to use it given the limited data?
- The qualifying session format has changed a lot during the years. There’s a nice write-up of the various formats in this article. This makes the ‘featurization’ of the qualifying sessions quite difficult.
- The ‘Spygate’ scandal in 2007, led McLaren to be excluded from the championship (and also pay 100 million USD). Therefore, McLaren took the last position in the constructors championship. However, its car was actually really fast – if not the fastest. So, if I’m going to use the constructors standings as features, I have to take care of this ‘artificial’ demotion.
- Qualifying lap times have different formats (e.g. a lap could be shown as 1:36:129, 1.36.129, 1:36.129 etc.) or contain errors (e.g. ‘1:36:129*’) so parsing them needs custom handling for each case.
After identifying such issues, I’ve written down on a notepad ideas about the next step, feature extraction, which I’m going to explain in a next post.