In the previous series of posts I discussed and explained the steps involved in Feature Engineering. In this series, I will talk about the coolest part of applied ML; the Predictive Modelling phase. This is where you get to use all the ‘magic’ power of machine learning algorithms and see the performance of any models you build. In this post I’ll start by showing the most common evaluation metrics and then reveal the custom evaluation metric I use for assessing the F1-predictor results.
- Data Acquisition
- Feature Engineering
- Predictive Modelling
- Evaluation Metric Selection
- Cross-Validation Set-up
- Algorithmic Approach
- ML Algorithm Selection – Hyper-parameter Tuning
- Model Deployment
So, what is an evaluation metric and why do we need one (and only one)? Simply put, an evaluation metric is a formula that takes a model’s predictions and the actual target variable, compares them and returns a number showing how well or bad the model performs.
But why do we need one? We can always check the predictions ourselves and get a sense of the performance! Hmm… no, we can’t actually do that. How would you check thousands if not millions of predictions? Even if you could, would you trust your sense for a real world ML project? The answer is that you need an evaluation metric so that a computer can automatically and objectively assess the performance of the model. This way, you have a trusted assessment of the model and you can check whether a new feature improved this metric or not, whether a certain ML algorithm performs better than another one and so on. This metric will guide you to the right direction for many choices you have to make.
Now that we are persuaded about the necessity of such a metric, let’s see the most common ones.
Accuracy is the most intuitive metric used in classification problems and shows the percentage of cases that have been correctly predicted by the model. The major drawback of this metric is that it does not care about the prediction probabilities of a model, but only about whether the predicted class is correct or not. For example, this metric treats the same a model that correctly says that a dog picture is indeed a dog picture with 51% probability and another one that gives 99% probability to this class.
For this reason (and due not having some nice mathematical properties), classification accuracy is rarely (if ever) used as an evaluation metric in the ML optimization process. However, it’s usually reported along the other results since it gives a clear understanding of the performance.
Logarithmic loss (or log loss in short) combats the two issues mentioned above about the classification accuracy. This metric is used when we not only need a classifier that gives the correct class but also gives high probabilities to its most probable output. As a result, any predictions with high probabilities on the wrong class are heavily penalized. This metric does have some mathematical properties that makes it perfect for use in ML model assessment. Note that there is also the multiclass version of this metric that works for multiclass (i.e. not binary) classification problems.
Root Mean Squared Error (RMSE)
RMSE is the most common metric for regression problems. Instead of calculating the average of the absolute errors (MAE), this metric first squares them, averages them and then takes their square root. This way, any predictions that are way off their actual target are penalized much more heavily. Whether RMSE is more suitable than MAE is subject to the problem at hand.
After this very concise summary of the metrics (you can view more of them on the sklearn page), here’s the custom metric used by F1-predictor:
Average Root Mean Squared Error excluding Retirements per Race (ARMSEeRpR)
I’m sure I didn’t need to provide an abbreviation of that metric 😛
Given that F1-predictor’s model tries to predict the finishing position for each driver for each race (i.e. not the probability of finishing to a specific position), this sounded to me a regression-like task, not a classification task. The rationale is that I want to progressively penalize any mistakes made by the prediction engine and give zero penalty to absolutely correct predictions.
If I had treated this task as a classification task, then any prediction that was not correct would be considered as equally wrong. For instance, if the model predicted 2nd position, it would get the same penalty not matter if the driver finished 3rd or 20th. There also some learning-to-rank ML tasks and related metrics (e.g. NDGC) but I don’t think they fitted well with the problem at hand.
- Any drivers who retire or disqualify from the race are excluded from the calculation while the rest of the grid is promoted by the respective number of places.
- The RMSEs for all drivers who completed a race are averaged in order to produce a single metric for each race.
The above is best explained in the following image:
The first three columns show the driver name, the predicted position and the actual finishing position of each driver. Drivers who did not get classified at that race (for whatever reason) are denoted with an ‘R’. The 4th column shows that final ordering of the drivers after ‘removing’ the ones who retired. You can see that the rest of the drivers where promoted some places higher up the grid depending on the number of drivers who retired and were predicted to finish ahead of them. The ‘error’ column is just the difference between the predicted (after excluding retirements) and the actual finishing position. The ‘squared error’ column is just the ‘error’ column squared. Then, we calculate the average of this last column to get the ‘Mean Squared Error’ which, in this example, is 4.71. It’s square root equals 2.17. If we average that last metric across all historical races, we get the final ARMSEeRpR.
In later posts I’ll reveal what’s the average ARMSEeRpR of F1-predictor in the past few years’ races. To get an idea of whether this is good or not, I’ll compare it to a benchmark. For qualifying predictions, the benchmark will be the previous race qualifying result while for the race the benchmark is the starting position.
Do you have any questions or suggestions on the topic? Always happy to hear them 🙂