Building an F1 prediction engine – Feature Engineering Part II

Building an F1 prediction engine – Feature Engineering Part II

Having explored the data, checked what is available and what’s not, found any inconsistencies and potential problems, it is time to get creative. Feature extraction is the step where an experienced data scientist can really make the difference and improve the subsequent model’s accuracy; many times more than it will be possible through algorithm selection and fine-tuning.

In this step, you should try to brainstorm as many features as we can. For this reason, it is absolutely essential to spend as much time as possible on this. Do not censor any ideas here. Just note them down, implement them and the data will tell you whether it helps or not.

This is where we are in our ML pipeline:

  1. Problem Definition
  2. Data Acquisition
  3. Feature Engineering
    • Data Exploration
    • Feature Extraction
    • Missing Values Imputation
    • Feature Encoding
    • Feature Selection – Dimensionality Reduction
  4. Predictive Modelling
  5. Model Deployment

In order to come up with features, it helped me group my thoughts around three different feature groups; one referring to the drivers, one referring to the constructors and one for features not directly related to the previous two categories. Below I show some features for each of the categories. Note that the lists are indicative and not exhaustive.

Driver Features

  • Qualifying position
  • Driver name
  • Driver age at that time
  • Years in F1
  • Percentage difference in qualifying time from pole position * 100  (e.g. 102.3%)
  • Starts in front-row
  • Races won in career
  • Races won in season till that race
  • Races started
  • Races finished
  • Pole positions won
  • Drivers championships won
  • Driver championship classification last year
  • Drivers championship position this season
  • Max, min, avg positions gained/lost during last X races
  • Max, min, avg finishing position in the last X races
  • Correlation between qualifying and race results per driver
  • Previous race final position
  • Previous race qualifying position
  • Positions gained in previous race
  • Race and Qualifying position in same race last year
  • Positions gained in same race last year
  • Percentage difference from winner (in time) in the last race * 100  (e.g. 102.3%)
  • Number of pit-stops in same race last year
  • Avg lap-time excl. pit stops in last race
  • Avg lap-time consistency excl. pit stops in last race
  • Max/min/avg/std speed in previous race
  • Rank on avg/std of speed in previous race

Constructors Features

  • Constructors name
  • Constructors championship won
  • Constructors races won
  • Constructors races won this that season
  • Constructors championship won in last X years
  • Constructors championship classification last year
  • Constructors championship position at the time
  • Max (Team-mate qualifying position, Driver qualifying position)
  • Max, min, avg positions gained/lost during last X races
  • Max, min, avg position in the last X races
  • Percentage difference in top-speed from top in last-race * 100 (e.g. 99.5%)
  • Times retired
  • Times retired in last X races
  • Max/min/avg speed in previous race
  • Rank on avg/std of speed in previous race

Other Features

  • Circuit name
  • Race rank in season (i.e. 1-21)
  • Year
  • Average overtakes per race
  • Correlation between race and qualifying results per circuit

Please share your thoughts and suggestions on anything you could add to this list!

Please share the post:

18 thoughts on “Building an F1 prediction engine – Feature Engineering Part II

  1. Driver’s nationality, usually local drivers performs better on their tracks. Driver’s position on this track last year, or last few years, maybe it’s a lovely/unlovely track for someone.

    1. Hadn’t though of the 1st one, thanks! I’ll try to check if this proved useful. The 2nd one is already implemented (Race and Qualifying position in same race last year) 🙂

  2. It will be interesting to see more details. Because I’m also thinking about building prediction model for F1 results (actually, I’ve asked google about it and your blog was in the first page, that’s how I find this site). But I will using different technologies for store and show results. Maybe soon I will ask you about some technical details of prediction engine and hoping for any help ).

  3. Hey Stergio, just found out about this site and I was excited with your idea.
    Even though this post might be outdated (about 8 months) I could suggest some features on the constructor side regarding the ‘Retiring’ case. For example Total Retirements in the past X years, or even in the past X races. Also, based on the weather forecast these days you could add another variable (some weather-encoding one).

    I’d be happy to hear what you think of it.
    Keep up the good work!

    1. Hi Andrea,

      Thanks for your comments 🙂

      I totally agree with your suggestions. Especially about the weather data! I would love to find some source of consistent data for each quali and race day. I haven’t managed to find such a dataset though. Do you happen to know anything on that?

      1. Hey Stergio,
        I could suggest the Python weather module which retuns predictions based on the locations.
        I don’t know the accuracy level of those predictions, but in order for you to experiment with the features it should be ok.
        Furthermore, the Weather API could be useful to you. In terms of re-training your models or even doing some historical EDA based on weather historical data. You can check it out here:

        Looking forward to further suggestions/ improvements/recommendations and discussion on this stuff.

        Have a great day!

        1. Thank you Andrea! Thanks I’ll have a look at those APIs. However, from a first look, they do not seem to offer historical data for a long period of time.

          The current model is trained on data since 2000, so I would need to have some weather data at least for the past few years.

          I think that the ‘easiest’ improvements to the model could come from better feature engineering on data. Any ideas? What would help discriminate if a driver will finish ahead of another one?

        1. Hi Athanasie, I checked the link and it’s a really nice source of data. However, it provides historical data up to 7 days ago.

          I’ll see if I find time to create a script to collect weather data from now on.
          Thanks a lot anyway! 🙂

          1. I hope it will help in the future 😉

            For starters I suggest the creation of just a few nominal weather flags will be enough eg. rain, fog, sunshine, etc. something like that. I believe that if all available weather data will be included in the dataset, since the data instances are not so many, they will raise the complexity without giving enough valuable information to the classifier. Sometimes less are better transforming continuous variables to nominal!

            You have an amazing project Stergios and we all learn from it!

          2. Yeah, totally agree with ‘classifying’ the weather data into a few specific categories. I hope I’ll be able to collect these kind of data in the future!

            Thanks a lot for your nice words!! 🙂

  4. Hi, this is super interesting stuff. I’ve got a bit of a quants background from uni but haven’t done any in years, so not sure if my comment makes a lot of sense…
    Just curious if your model takes into account type of track (e.g. favors straight line speed vs cornering grip) and related quality of the car.
    So, for example, using historic data on quali/race finishing positions you should be able to work out that when a car performs well on track x it is more likely to perform well on tracks a and b but less likely to perform well on tracks c and d.
    Hope that makes sense and would be very interested to hear your thoughts,

    1. Hi Erik, yes your comment makes total sense!
      I haven’t done exactly what you’re suggesting (like finding similar tracks where a certain car performs better) but I am including the race name as a feature. So the model is (or should be, at least) learning from the past data that car A is usually finishing within top-N positions in certain circuits while in others it doesn’t. It should also be learning that in certain circuits (e.g. Monaco) the starting position matters more compared to others.

      But it sure worths investigating whether I can ‘group’ some circuit together that have similar characteristics and see whether this improves the model or not. Thanks!

      1. Hey,
        Awesome! Here are a few more thoughts:
        – Regarding ‘race name’ or the various ‘same race last year’ features, does the model take into account drivers swapping teams or constructors swapping engine manufacturers or designing completely new chassis? Because depending on these variables the performance at the same race last year obviously may be more or less significant.
        – Regarding ‘positions gained’, is this positions gained out of possible positions to be gained? I.e. Vettel starting in pole and finishing 1st obviously is as good as he could have possibly done.
        – Regarding retirements, does the model take into account whether the retirement was caused by the driver or the constructor? E.g. Verstappen making contact with Hamilton has no impact on Ricciardo, however Haas double dnf and Raikkonen dnf all because of Ferrari wheelnut design could impact Vettel.
        – Are the ‘performance in the last X races’ features weighted? I.e. form over the last 3 races may be more important than form over the last 6 races due to car updates and driver mentality.
        – Some other possible features: Results from practice sessions (laps completed, fastest lap, crash, etc.), upgrades to car made before weekend, driver success prior to F1.
        All the best,

        1. That’s the best feedback I’ve got so far! 🙂

          All your suggestions are truly useful and provide great ides for feature engineering. My answers are below:
          – No, it does not do that directly. However, I’m also using information like ‘car-year’ which is capturing any big differences in performance year-over-year. Unfortunately, the source data do not have engine information at all. Regarding drivers swapping teams, of course I include the driver’s name and the respective car in the data but not in the calculation of the ‘same race last year’ features. FYI, these kind of features do add some value to the model but are not the most important ones.

          – That’s a great I hadn’t thought of! Definitely try implementing it!

          – Retirements are currently ignored by the model. I’ve thought about it many times but I did not come up with a solution of how I can incorporate them to produce better predictions. Still, I do not have such detailed information on the retirement causes.

          – The ML model is an gradient boosting machine, i.e. a tree-based model. So, there is not anything similar to ‘weights’ as in linear regression. I’m letting the model figure out which of these features are more important than others.

          – Including data from practice sessions is something that I desperately want to have. However, such are not included in the source data (coming from and are difficult to collect in a consistent way for the past, say 20, years of F1 races. Finally, regarding upgrades, I do not have such data (and even if I had, you cannot not if an upgrade will give increased performance – this would be better captured by using data from practice sessions). I’ve never though of driver success prior to F1. I guess this would only be useful for rookie drivers. Still, I’m too lazy to gather such data!

          If you have any concrete ideas on how I could include retirements for improving the predictions, I’d be happy to hear them!

  5. Hi Stergios,

    I’ve just found your website and I am really enthusiastic about it! Currently I am studying Econometrics and for my Bachelor’s thesis I wrote a replication of a paper from Andrew Phillips you are probably familiar with. If not, his paper is a nice read and I’ve attached the reference below.

    Regarding the feature engineering, I think a possible idea is to include economic or political factors. Ross Brawn mentions in his book on his career that there are areas where one has to win in order to be successful in F1: technically, economically and politically. Especially with upcoming regulation changes, it might be useful to include economic factors such as budgets and political factors although I’m not sure how this can be incorporated. These factors will probably have a “lagged” effect, since if a team has a high budget in 2015, it will hit the development of the 2015 car possibly a bit, but will effect the 2016 car even more.

    Another example: will Ferrari lose its historic pay in the coming Concorde Agreement? It is very much possible that they will withdraw from F1, but if not, the budget will be lower, ceteris paribus.

    I have no data sources ready to provide such information, but I think it it nice to include such factors, do you agree?

    Regarding the data on retirements. I know that Phillips has such a data set where he rates these retirements as driver error or not. I reached out to him to use this data for my thesis and I got it, but I’m not sure if I’m allowed to share it with you. You might try to contact him via Twitter! I think he would find this website very interesting.

    In the spirit of data science: is it an idea to make a webscraping tool to scrape the practice results off the F1 website? Not sure if it’s possible but maybe a nice project on this website.

    I am curious what you think.

    Kind regards,

    F1 paper:
    K., Phillips Andrew J., (2014), Uncovering Formula One driver performances from 1950 to 2013 by adjusting for team and competition effects, Journal of Quantitative Analysis in Sports, 10, issue 2.

    1. Hi Daan,

      Thank you for your nice comments!

      I’m aware of this paper from Andrew Phillips although I didn’t have to chance to go through it yet. Definitely I should!

      Regarding the economical (e.g. budget) features, it makes total sense that those may help the model. Of course, collecting consistent data for the past, say, 20 years is rather hard. Please let me know if you happen to find any such source. On the other hand, the budget of any team is going to show up as increased performance. For instance, if Ferrari have a high budget then they will rank hing in the constructors championship. Therefore, the model is indirectly already getting this information.

      On the political features, can you clarify what you mean? What political factors could be added and how can they be encoded to be used by a model?

      Lastly, about the F1 practice webscraping tool, getting the final classification for each practice session should be fairly easy (from I know that FIA also provides detailed lap times in pdf format. This should make any parser trickier. I may work on it some time later although at the moment I have frozen the development of my model due to limited time. I’d be happy to work with you if you decide to pursue this.

      Thank you,

Leave a Reply

Your email address will not be published. Required fields are marked *