Having defined what we are trying to achieve, it’s time to start thinking about what data we need and how we are going to obtain them in order to create features and train our models.
- Problem Definition
- Data Acquisition
- Feature Engineering
- Predictive Modelling
- Model Deployment
Since the challenge of predicting F1 race outcomes is so open, you have to think about what data you need and what data are indeed available. My first thought was that I needed historical data for the Grands Prix like who won each race, in which place he started, what car he drove, how much experienced he was and so on. Where will I get this data from? Should I go and scrap F1-related websites or is there some ready-to-use data source?
Fortunately, there is and it is called ergast.com. Ergast.com is “an experimental web service which provides a historical record of motor racing data“. It provides historical data since 1950 (!!) and is updated after the end of each race. Since I needed a lot of historical data (usually the more the better for ML algorithms), I decided to use the database image instead of making many queries over the API.
I am more acquainted with PostgreSQL so I decided to upload the data to a local PostgreSQL database. However, the database images provided are in MySQL format. 1st problem here; as I found out, there is no easy way of importing this format on a PostgreSQL DB. There are several type mismatches between the two formats that you have to modify manually. As this will be a procedure I’ll have to do after each Grand Prix (since the DB is updated and I’ll need to download it again in order to update the prediction model), I had to some-how automate it. Therefore, I searched for some script that could do that for me and I found this great MySQL to PostgreSQL converter. With a few additions to the code, my PostgreSQL-compatible .sql file was ready!
After running the .sql file, my DB was full of data in 13 tables. The DB has a star-like schema while the main table is the results table which holds all information regarding the outcome of each race. This table can be joined with the rest tables in order to get information like driver name, circuit name, race data, finishing status (e.g. retired) etc.
The next step after getting the data in the DB, is to play around with them to see exactly what’s available and what’s not. This is what I’ll discuss on my next post!