This is something I wanted to create for the past three years: find out if the results of the drivers in a race are reflected on the sentiment of their fans’ posts on social media. After launching this blog, I found the motivation to actually implement it. In this post I present a nice R Shiny app showing the results and I describe the process behind creating it (code included).
For those who want to skip the ‘how’ and just want to see the end result, here is the F1 sentiment analysis dashboard (best viewed on desktop).
- Getting the data
- Performing sentiment analysis
- Creation of an R Shiny app
Getting the data
What data are we going to use to calculate the fans’ sentiment about the drivers and teams? There are a few options here e.g: Facebook, Twitter, other social media, comments on blog posts. For this analysis I used Twitter data because I thought they are just ‘more’; ‘more’ in the sense that I’d get more data points (tweets in this case) compared to other data sources.
To this end, I created a python script that leverages the Twitter streaming API and this python library and collects all tweets that refer to a specific driver or team. For instance tweets that contain ‘@alo_oficial’ refer to Fernando Alonso. Caution here: they are not written from him, they are fans’ tweets that refer to him. Using the Twitter streaming API was the only option if I wanted to get all the relevant tweets. There are certain limits in the REST API that do not allow this.
After finding all drivers’ and teams’ Twitter accounts (and discovering to my disappointment that neither Ferrari driver has one), I wrote this script and setup a cronjob in an Amazon EC2 instance so that data collection is running 24/7/365. The script saves the results to CSV files (one file per week) and I can download them locally via an FTP client. To get an idea of the raw size of the data, this process has collected more than 2.5 million tweets since mid-February.
Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. A common use case for this technology is to discover how people feel about a particular topic (source: lexalytics.com).
There were a few options on how we could perform this task. One obvious way would be to use an API (like lexalytics.com) but the number of texts you can analyze in the free trial version is quite low given the amount of data I collect. The other way is to train my own sentiment classifier. To pull this off, I’d need a training dataset with manually tagged sentiment of tweets. Using the training data, I’d train the model and then apply it on the F1 tweets.
And that’s what I did. Since I wanted to make something as fast as possible, I followed this excellent blog post on analyzing sentiment using the text2vec R package. For those who don’t have time to check it out, the author uses a dataset of 1.6 million classified tweets, pre-processes them via TF-IDF and trains a linear model that achieves 0.875 AUC score on the test data. I won’t go into further details on how the model calculates the sentiment. The main point here is that the model outputs a number between 0 (zero) and 1 (one) for each tweet; 0 indicates that the comment is totally negative and 1 that it is positive. A score of 0.5 means that the sentiment is neutral.
Of course, I’m well aware of the limitations here. For instance, the training data and the actual data (F1 tweets) are different from each other. This means that the performance of the classifier will be somewhat overestimated. Secondly, there may be other classifiers and data processing methods that achieve higher accuracy. As an AUC score of 0.875 is a decent performance, I did not try to use different models/pre-processing techniques although I may do so in the future.
After training this model, I applied it to the tweets referring to F1. For each driver and team, I calculated the average sentiment per day. For some drivers and teams, there are no tweets in some days; these are shown as NA in the data. This analysis is performed offline and the data will be manually updated every now and then. You can find the complete code here.
R Shiny App
Designing and implementing a cool analysis is one thing, presenting it in a nice way is another. I could just show some aggregate stats and a few charts here in this blog post, however creating an interactive R Shiny app is more fun (and is actually not that hard to develop). You can find the current code for the shiny app here.
In the dashboard you can select up to 3 drivers or teams and compare their fans’ sentiment across time. The dates of the races are shown on the chart so you detect if the race result had a direct impact on the fans’ feelings. The dashboard reads the data from CSV files uploaded on Google Drive. It’s an easy solution if you want a permanent data storage without messing around with databases.
There a couple of things that pop-up directly. Firstly, drivers or teams with fewer tweets per day tend to show higher fluctuations. You can view the raw sentiment scores by setting the ‘smoothing’ slider to zero. Secondly, the classifier tends to produce scores exactly equal to 0.7021. This was leading to no major differences across drivers and teams. Therefore, some tweets were removed and the graph below shows the distribution of sentiment after this process.
- Show race results on top of the chart in order to ‘correlate’ the sentiment with the result
- Show top negative/positive tweets per driver or team selected
- Show a weekly aggregation of the sentiment score
- Improve sentiment analysis classifier
Any ideas to enhance the dashboard are, of course, welcome 🙂