Building an F1 prediction engine – Feature Engineering Part II

June 4, 2017 Stergios Comments 18 comments

Having explored the data, checked what is available and what’s not, found any inconsistencies and potential problems, it is time to get creative. Feature extraction is the step where an experienced data scientist can really make the difference and improve the subsequent model’s accuracy; many times more than it will be possible through algorithm selection and fine-tuning.

In this step, you should try to brainstorm as many features as we can. For this reason, it is absolutely essential to spend as much time as possible on this. Do not censor any ideas here. Just note them down, implement them and the data will tell you whether it helps or not.

This is where we are in our ML pipeline:

Problem Definition
Data Acquisition
Feature Engineering
- Data Exploration
- Feature Extraction
- Missing Values Imputation
- Feature Encoding
- Feature Selection – Dimensionality Reduction
Predictive Modelling
Model Deployment

In order to come up with features, it helped me group my thoughts around three different feature groups; one referring to the drivers, one referring to the constructors and one for features not directly related to the previous two categories. Below I show some features for each of the categories. Note that the lists are indicative and not exhaustive.

Driver Features

Qualifying position
Driver name
Driver age at that time
Years in F1
Percentage difference in qualifying time from pole position * 100 (e.g. 102.3%)
Starts in front-row
Races won in career
Races won in season till that race
Races started
Races finished
Pole positions won
Drivers championships won
Driver championship classification last year
Drivers championship position this season
Max, min, avg positions gained/lost during last X races
Max, min, avg finishing position in the last X races
Correlation between qualifying and race results per driver
Previous race final position
Previous race qualifying position
Positions gained in previous race
Race and Qualifying position in same race last year
Positions gained in same race last year
Percentage difference from winner (in time) in the last race * 100 (e.g. 102.3%)
Number of pit-stops in same race last year
Avg lap-time excl. pit stops in last race
Avg lap-time consistency excl. pit stops in last race
Max/min/avg/std speed in previous race
Rank on avg/std of speed in previous race

Constructors Features

Constructors name
Constructors championship won
Constructors races won
Constructors races won this that season
Constructors championship won in last X years
Constructors championship classification last year
Constructors championship position at the time
Max (Team-mate qualifying position, Driver qualifying position)
Max, min, avg positions gained/lost during last X races
Max, min, avg position in the last X races
Percentage difference in top-speed from top in last-race * 100 (e.g. 99.5%)
Times retired
Times retired in last X races
Max/min/avg speed in previous race
Rank on avg/std of speed in previous race

Other Features

Circuit name
Race rank in season (i.e. 1-21)
Year
Average overtakes per race
Correlation between race and qualifying results per circuit

Please share your thoughts and suggestions on anything you could add to this list!

18 thoughts on “Building an F1 prediction engine – Feature Engineering Part II”

Bogdan says:

June 4, 2017 at 7:03 pm

Driver’s nationality, usually local drivers performs better on their tracks. Driver’s position on this track last year, or last few years, maybe it’s a lovely/unlovely track for someone.

Reply
1. Stergios says:
  
  June 4, 2017 at 7:33 pm
  
  Hadn’t though of the 1st one, thanks! I’ll try to check if this proved useful. The 2nd one is already implemented (Race and Qualifying position in same race last year) 🙂
  
  Reply
Bogdan says:

June 5, 2017 at 12:55 am

It will be interesting to see more details. Because I’m also thinking about building prediction model for F1 results (actually, I’ve asked google about it and your blog was in the first page, that’s how I find this site). But I will using different technologies for store and show results. Maybe soon I will ask you about some technical details of prediction engine and hoping for any help ).

Reply
1. Stergios says:
  
  June 5, 2017 at 9:15 am
  
  Hey, that’s great! Sure, I’ll be happy to help!
  
  Reply
Andreas says:

February 15, 2018 at 11:21 pm

Hey Stergio, just found out about this site and I was excited with your idea.
Even though this post might be outdated (about 8 months) I could suggest some features on the constructor side regarding the ‘Retiring’ case. For example Total Retirements in the past X years, or even in the past X races. Also, based on the weather forecast these days you could add another variable (some weather-encoding one).

I’d be happy to hear what you think of it.
Keep up the good work!

Reply
1. Stergios says:
  
  February 17, 2018 at 5:52 pm
  
  Hi Andrea,
  
  Thanks for your comments 🙂
  
  I totally agree with your suggestions. Especially about the weather data! I would love to find some source of consistent data for each quali and race day. I haven’t managed to find such a dataset though. Do you happen to know anything on that?
  
  Reply
  1. Andreas says:
    
    February 22, 2018 at 11:21 am
    
    Hey Stergio,
    I could suggest the Python weather module which retuns predictions based on the locations.
    https://pypi.python.org/pypi/weather-api/0.0.4
    I don’t know the accuracy level of those predictions, but in order for you to experiment with the features it should be ok.
    Furthermore, the Weather API could be useful to you. In terms of re-training your models or even doing some historical EDA based on weather historical data. You can check it out here:
    https://openweathermap.org/api.
    
    Looking forward to further suggestions/ improvements/recommendations and discussion on this stuff.
    
    Have a great day!
    
    Reply
    1. Stergios says:
      
      February 24, 2018 at 12:17 am
      
      Thank you Andrea! Thanks I’ll have a look at those APIs. However, from a first look, they do not seem to offer historical data for a long period of time.
      
      The current model is trained on data since 2000, so I would need to have some weather data at least for the past few years.
      
      I think that the ‘easiest’ improvements to the model could come from better feature engineering on ergast.com data. Any ideas? What would help discriminate if a driver will finish ahead of another one?
      
      Reply
  2. Athanasios says:
    
    February 23, 2018 at 2:24 pm
    
    Hi guys,
    
    I found this:
    http://www.myweather2.com/motor-racing.aspx
    
    It has weather data for motor and thus for f1 races. I haven’t checked if it has old weather data available but I hope it will help for future data collection.
    
    Cheers,
    Athanasios
    
    Reply
    1. Stergios says:
      
      February 24, 2018 at 12:19 am
      
      Hi Athanasie, I checked the link and it’s a really nice source of data. However, it provides historical data up to 7 days ago.
      
      I’ll see if I find time to create a script to collect weather data from now on.
      Thanks a lot anyway! 🙂
      
      Reply
      1. Athanasios says:
        
        March 12, 2018 at 2:37 pm
        
        I hope it will help in the future 😉
        
        For starters I suggest the creation of just a few nominal weather flags will be enough eg. rain, fog, sunshine, etc. something like that. I believe that if all available weather data will be included in the dataset, since the data instances are not so many, they will raise the complexity without giving enough valuable information to the classifier. Sometimes less are better transforming continuous variables to nominal!
        
        You have an amazing project Stergios and we all learn from it!
      2. Stergios says:
        
        March 12, 2018 at 8:17 pm
        
        Yeah, totally agree with ‘classifying’ the weather data into a few specific categories. I hope I’ll be able to collect these kind of data in the future!
        
        Thanks a lot for your nice words!! 🙂
Erik says:

April 10, 2018 at 8:23 pm

Hi, this is super interesting stuff. I’ve got a bit of a quants background from uni but haven’t done any in years, so not sure if my comment makes a lot of sense…
Just curious if your model takes into account type of track (e.g. favors straight line speed vs cornering grip) and related quality of the car.
So, for example, using historic data on quali/race finishing positions you should be able to work out that when a car performs well on track x it is more likely to perform well on tracks a and b but less likely to perform well on tracks c and d.
Hope that makes sense and would be very interested to hear your thoughts,
Erik

Reply
1. Stergios says:
  
  April 11, 2018 at 9:47 am
  
  Hi Erik, yes your comment makes total sense!
  I haven’t done exactly what you’re suggesting (like finding similar tracks where a certain car performs better) but I am including the race name as a feature. So the model is (or should be, at least) learning from the past data that car A is usually finishing within top-N positions in certain circuits while in others it doesn’t. It should also be learning that in certain circuits (e.g. Monaco) the starting position matters more compared to others.
  
  But it sure worths investigating whether I can ‘group’ some circuit together that have similar characteristics and see whether this improves the model or not. Thanks!
  
  Reply
  1. Erik says:
    
    April 12, 2018 at 6:55 pm
    
    Hey,
    Awesome! Here are a few more thoughts:
    – Regarding ‘race name’ or the various ‘same race last year’ features, does the model take into account drivers swapping teams or constructors swapping engine manufacturers or designing completely new chassis? Because depending on these variables the performance at the same race last year obviously may be more or less significant.
    – Regarding ‘positions gained’, is this positions gained out of possible positions to be gained? I.e. Vettel starting in pole and finishing 1st obviously is as good as he could have possibly done.
    – Regarding retirements, does the model take into account whether the retirement was caused by the driver or the constructor? E.g. Verstappen making contact with Hamilton has no impact on Ricciardo, however Haas double dnf and Raikkonen dnf all because of Ferrari wheelnut design could impact Vettel.
    – Are the ‘performance in the last X races’ features weighted? I.e. form over the last 3 races may be more important than form over the last 6 races due to car updates and driver mentality.
    – Some other possible features: Results from practice sessions (laps completed, fastest lap, crash, etc.), upgrades to car made before weekend, driver success prior to F1.
    All the best,
    Erik
    
    Reply
    1. Stergios says:
      
      April 13, 2018 at 9:35 am
      
      That’s the best feedback I’ve got so far! 🙂
      
      All your suggestions are truly useful and provide great ides for feature engineering. My answers are below:
      – No, it does not do that directly. However, I’m also using information like ‘car-year’ which is capturing any big differences in performance year-over-year. Unfortunately, the source data do not have engine information at all. Regarding drivers swapping teams, of course I include the driver’s name and the respective car in the data but not in the calculation of the ‘same race last year’ features. FYI, these kind of features do add some value to the model but are not the most important ones.
      
      – That’s a great I hadn’t thought of! Definitely try implementing it!
      
      – Retirements are currently ignored by the model. I’ve thought about it many times but I did not come up with a solution of how I can incorporate them to produce better predictions. Still, I do not have such detailed information on the retirement causes.
      
      – The ML model is an gradient boosting machine, i.e. a tree-based model. So, there is not anything similar to ‘weights’ as in linear regression. I’m letting the model figure out which of these features are more important than others.
      
      – Including data from practice sessions is something that I desperately want to have. However, such are not included in the source data (coming from ergast.com) and are difficult to collect in a consistent way for the past, say 20, years of F1 races. Finally, regarding upgrades, I do not have such data (and even if I had, you cannot not if an upgrade will give increased performance – this would be better captured by using data from practice sessions). I’ve never though of driver success prior to F1. I guess this would only be useful for rookie drivers. Still, I’m too lazy to gather such data!
      
      If you have any concrete ideas on how I could include retirements for improving the predictions, I’d be happy to hear them!
      
      Reply
Daan says:

August 6, 2019 at 3:45 pm

Hi Stergios,

I’ve just found your website and I am really enthusiastic about it! Currently I am studying Econometrics and for my Bachelor’s thesis I wrote a replication of a paper from Andrew Phillips you are probably familiar with. If not, his paper is a nice read and I’ve attached the reference below.

Regarding the feature engineering, I think a possible idea is to include economic or political factors. Ross Brawn mentions in his book on his career that there are areas where one has to win in order to be successful in F1: technically, economically and politically. Especially with upcoming regulation changes, it might be useful to include economic factors such as budgets and political factors although I’m not sure how this can be incorporated. These factors will probably have a “lagged” effect, since if a team has a high budget in 2015, it will hit the development of the 2015 car possibly a bit, but will effect the 2016 car even more.

Another example: will Ferrari lose its historic pay in the coming Concorde Agreement? It is very much possible that they will withdraw from F1, but if not, the budget will be lower, ceteris paribus.

I have no data sources ready to provide such information, but I think it it nice to include such factors, do you agree?

Regarding the data on retirements. I know that Phillips has such a data set where he rates these retirements as driver error or not. I reached out to him to use this data for my thesis and I got it, but I’m not sure if I’m allowed to share it with you. You might try to contact him via Twitter! I think he would find this website very interesting.

In the spirit of data science: is it an idea to make a webscraping tool to scrape the practice results off the F1 website? Not sure if it’s possible but maybe a nice project on this website.

I am curious what you think.

Kind regards,
Daan

F1 paper:
K., Phillips Andrew J., (2014), Uncovering Formula One driver performances from 1950 to 2013 by adjusting for team and competition effects, Journal of Quantitative Analysis in Sports, 10, issue 2.

Reply
1. Stergios says:
  
  August 8, 2019 at 1:57 pm
  
  Hi Daan,
  
  Thank you for your nice comments!
  
  I’m aware of this paper from Andrew Phillips although I didn’t have to chance to go through it yet. Definitely I should!
  
  Regarding the economical (e.g. budget) features, it makes total sense that those may help the model. Of course, collecting consistent data for the past, say, 20 years is rather hard. Please let me know if you happen to find any such source. On the other hand, the budget of any team is going to show up as increased performance. For instance, if Ferrari have a high budget then they will rank hing in the constructors championship. Therefore, the model is indirectly already getting this information.
  
  On the political features, can you clarify what you mean? What political factors could be added and how can they be encoded to be used by a model?
  
  Lastly, about the F1 practice webscraping tool, getting the final classification for each practice session should be fairly easy (from F1.com). I know that FIA also provides detailed lap times in pdf format. This should make any parser trickier. I may work on it some time later although at the moment I have frozen the development of my model due to limited time. I’d be happy to work with you if you decide to pursue this.
  
  Thank you,
  Stergios
  
  Reply

F1 predictor

Machine-learning based F1 race prediction engine

Building an F1 prediction engine – Feature Engineering Part II

June 4, 2017 Stergios Comments 18 comments

18 thoughts on “Building an F1 prediction engine – Feature Engineering Part II”

Leave a Reply to Bogdan Cancel reply