Building an F1 prediction engine – Feature Engineering Part III

Building an F1 prediction engine – Feature Engineering Part III

After creating all features, it is quickly evident that there are a lot of missing data. Missing data should be dealt with before going on to the next phases of our model building. In this post, I’m going to quickly describe the reasons behind the missing values and ways to treat them.

As always, here is an overview of the pipeline:

  1. Problem Definition
  2. Data Acquisition
  3. Feature Engineering
    • Data Exploration
    • Feature Extraction
    • Missing Values Imputation
    • Feature Encoding
    • Feature Selection – Dimensionality Reduction
  4. Predictive Modelling
  5. Model Deployment

During the previous feature extraction phase, there are two cases that lead to missing values; the first one refers to the cases where the data are indeed not available while the second one refers to ‘pseudo-missing’ data.

There are many examples of the first type of missing values. For instance, we have a feature called ‘previous race final position’. What if the driver did not participate in the previous race? Another case regards the feature ‘percentage difference in qualifying time from pole position’. What about the years where the qualifying format was different and there were no data for all three qualifying sessions?

The second type of missing values I mentioned above are artificial ones and are due to the merging process of different datasets. For example, when we left join the main dataset (from the results table) with another dataset showing the races won in season, we get a lot of NaNs since the second extract does include all rows included in the first one (since most drivers never win in a season). In such cases, the ‘pseudo-missing’ data are just filled in with zeros.

Referring to the cases when the data are indeed not available, here are the most common ways of imputing missing data:

Complete-case analysis
Complete-case analysis, aka listwise-deletion, deletes any cases (i.e. rows) with missing data so only complete cases are left.  If the data are not missing at random or the dataset is not very large, this is not a good practice since the remaining dataset will be too small to train any useful models. It also can lead to bias if observations with missing values differ systematically from complete cases.

Mean/Mode imputation
This case consists of computing the mean (or mode for categorical variables) of feature X using the non-missing values and using it to impute missing values of X. However, this distorts the distribution of the variable and distorts relationships between variables since correlations will be pulled towards zero. This method is also known as marginal mean imputation.

Indicator variables for missingness of predictors
For categorical predictors, a simple and often useful approach to imputation is to add an extra category for the variable indicating missingness. For continuous features, a popular approach is to include an extra indicator identifying which observations on that variable have missing data. Then the missing values in the partially observed predictor can be replaced with zero, with a very large (negative or positive) value or with the median/mean. From personal practical experience, this method is by far the most practical and most commonly-used one in applied machine learning.

Model-based imputation
This is a more advanced method but is more computationally demanding. The idea is to train a predictive model on all cases without any missing values; the target variable is the one we want to impute the missing data. Then, we use the trained model to predict the values for those cases it is missing. The goal here is not causal inference, it is accurate prediction. This technique is also called conditional mean imputation. Ideally, this approach should be applied inside each cross-validation fold so that the validation set has never seen the training set.

Multiple imputation
This method differs from the previous ones since the imputation process generates several data sets, each with different imputed values. There are three basic steps:
1. Impute the missing entries of the incomplete data sets m times. This step results is m complete data sets. Missing values should be imputed using an appropriate model that incorporates random variation.
2. Analyze each of the m completed data sets. This step results in m analyses.
3. Integrate the m analysis results into a final result.
However, I have never heard of someone using this method in practice for applied ML problems.

Treating missing values is a well-researched area in literature. The above is just a quick introduction to the topic and you can find more information available on the internet. There are many more imputation methods. Here is a link for those who want to delve a little bit deeper into the subject:
Data Analysis Using Regression and Multilevel/Hierarchical Models – Chapter 25

Do you have any experience using some other method in a practical ML application?

Please share the post:

21 thoughts on “Building an F1 prediction engine – Feature Engineering Part III

  1. Very interesting and, at least now it doesn’t seems to me completely not understandable. I have some thoughts about it and will write it soon here in the comments.

  2. For instance, we have a feature called ‘previous race final position’. What if the driver did not participate in the previous race?

    I’m sorry in advance, because I’m still totally nub here. But I think we should just eliminate results like this. For Example Kimi and the last race, should this results affect his average result? I think no. The same with Max, should his DNFs affect his overall performance and ability to take hi places? I think no. The other question – I would add a new variable called, let’s say “stability” which would be as finishes/starts and show probability of finish of the racer. For example Max can win, but his low stability (6 DNF in 16 races if I’m not mistaken) decrease the probability of win. Again sorry for explanation, I wanted to give a short answer. If you have some questions feel free to ask. I’d discuss about it with pleasure.

    1. You’re right. You have to decide what to do with such missing values. In this case, I would add a new column as a missing value indicator and use something like ‘-1’ as the value itself.

      Here’s what I actually do: I remove from the dataset all rows where the respective driver has retired from the race. So, there’s no need to add any new column and to fill-in these values. Therefore, future predictions are not affected by any recent DNFs.

      For the ‘stability’ feature, yeap it’s worth creating it and check via cross-validation if it improves performance or not.

      1. A few more thoughts about ‘stability’ – It can be rely also on the knowledge of how many races engine (and all this MGUs) already work. It will be actual especially now, at the end of the season. But there’s some exceptions, as I know Vettel had had a new engine in the Malaysian GP, but it was with defects. This variable will not affect the probable position in a race, but can affect prediction of a better driver in a race from pair. For example, if we try to predict who will be better in a race Verstappen or Ricciardo. If we know that Verstappen have a relatively low stability it can lower his probability to be higher than Ricciardo, because he more probably will be DNF. This situation took place in several GPs this year.

        1. I agree with you one that. However, I’m not sure where you can find these data. Have you found any reliable source?

  3. Yes, I usually found something in the news before GP, for example, before Malaysia – gearboxes:

    Energy store on Vettel’s car

    And news about changes – new MGU-H for Williams and Force India , Magnussen have new MGU-H, MGU-K and accumulator. Grosjean have new MGU-K and so on. I found it here:

    Sorry it is on russian language, but I think it’s not so hard to translate it. Also I’m pretty sure that the same news exists on some english language site about F1.

    1. OK. But in this format it’s a bit difficult to combine this information with the main data. And I guess it will require quite some manual effort.

      1. Unfortunately yes, I can’t find any API to get this information automatically. But definitely it can affect the final result as we see it with Vettel qualification for example.
        Actually, the knowledge about his technical difficulties before qualification worth more than all another statistics data when I’m tried to predict who will be better – he or Kimi in the qualification. So far I try to predict it manually, but I’m on my way to make a model of drivers pair prediction and trying to decide which data I am needed for it.
        And now I think that the data which I need for it is (by the importance):
        1) Some difficulties, technical, crash, health etc of driver.
        2) Results of 3th free practice.
        3) Results of 1th and 2nd practices.
        4) Overall score of qualifications between these drivers (for example Hamilton:Bottas is 10:5)

        Everything else has a small affection of result, I think.

  4. I hope in a month I will have it, so far I’m watching basic courses about ML (like this, azureML and R, and there’s many new things for me so far ).
    The reason why these technologies? Well, azureML is free and user friendly, R is because many examples written on it and it was designed exactly for this. If you have any suggestions, propositions etc about technologies I will be happy to hear it.

    1. I’ve never used Azure (although I’m planning to do so in the next few months) but I’m using R in my day-to-day job. However, the code for this blog is written in Python. I just find it easier for ML tasks due to the existence of scikit-learn.

      Can you reveal what exactly you’ll try to do? Is it something similar to f1-predictor or not?

  5. I want to try just predict who from two pilots of one team will be faster in qualification, because I think it is relatively easy to predict in comparison to prediction exact place in qualification or even in race.

    1. Great. From my personal experience so far, qualifying results are much harder to predict than race results.

  6. Maybe, if trying to predict exact place, because it is not so addictive from previous results as race results addictive from qual results. But if we just trying to predict “will racer1 faster than racer2?”(simple yes/no decision) I think it is simplier and don’t have many factors that affect the result.

    1. This approach (i.e. using driver-per-driver comparisons to predict the final race result) is called the pairwise approach and this is what I’m actually doing 🙂
      I’m writing a blog post on the possible approaches to this prediction problem and I’ll share it as soon as it’s ready.

  7. Just now made datasets of all practices in this year, trying to arrange the data by reading this book (and google) in Azure ML studio ). Slowly but surely it going forward. Hope on next week I will post in my blog my first steps.

Leave a Reply

Your email address will not be published. Required fields are marked *