Did We Predict the Hung Parliament?
In two of the most recent major democratic votes, Brexit and the U.S elections, two things have happened.
1. The result has been a massive surprise.
2. Social media has been an accurate indictor of the outcome.
So when the snap election was announced in April this year, it got me thinking. Could we use Twitter and our Arrow reference architecture for big data to predict the outcome?
I decided it was worth a try.
The first thing I needed to decide was what was I actually going to determine. I settled on trying to determine the percentage share of positive voice on Twitter for one of the main three parties.
Next task was to try and figure out how I was going to determine this. I tried to understand the various metrics required in quantifying this on Twitter and it quickly became clear that this was a much more complex task than I had imagined, due to the dimensionality of opinions and political biases.
Classifying users as one political bias and one sentiment class doesn’t work - as they may talk positively about the Conservative party, then negatively about Labour.
fig.1 Twitter users' 10 dimensions of political opinion
Therefore, if I classify them once I would miss out on impactful negative commentary.
But to keep pace with the flow of Twitter data we were detecting (at peak circa ~1000 messages a second) I couldn’t classify every Tweet that came in, as I am leveraging IBM Watson Cognitive REST APIs to provide unstructured analytics.
I therefore implemented a number of ETL processes using the Arrow Reference Architecture for Big Data to refine the data pre-Watson and then speed up the processing once classification had occurred.
I firstly used the keyword search API from Twitter to filter the returned results and then waited for 10 occurrences of a particular twitter user id. This helped me to focus in and analyse only users that were tweeting regularly about the election.
Once I had determined that someone was regularly tweeting about the election, it was then appropriate to further investigate them.
I sent the 10 tweets we had stored previously to a custom IBM Watson Natural Language Processing (NLP) Artificial Neural Network (ANN) to classify the users political sentiment and political bias.
At the same time it was important to determine the user’s influencer score. This is a metric I have calculated to determine the influence a particular user has to the Twitter audience.
I also wanted to determine the individual tweet’s influence score as well - so understanding how many times it was retweeted, favorited and by who.
We then took the positive commentary, plus the neutral commentary, minus the negative commentary - which gave us an arbitrary number for the size of voice.
We then represented this as a percentage share across the main three parties.
So how did we do?
During the election we detected ~90 million tweets with political content, of these 3.3 million tweets from 73,200 users were detected tweeting more than 10 times with political keywords.
Our size of voice calculations saw Labour having almost double the size of voice in all our measurements than their competition. The Conservatives and Liberal Democrats had on average 1 / 15th the size of voice of Labour. See the below graphic.
fig.2 showing the size of positive, negative and neutral voice of each party
Correlating the data
Twitter defines its demographic as 37% between ages of 18 and 29 and 25% of its users between 30-49. In the UK there are an estimated 13 million users.
The election prediction system saw a consistent trend of Labour holding between 60%- 69% share of positive voice, conservatives 30% - 38% and Lib Dem 1% - 3%
fig.3 Showing the sustained share of voice levels across the three parties for the duration of the election.
This correlates well to the Yougov.co.uk How Britain voted at the 2017 general election study, showing that the age demographic that uses Twitter matches the age demographic and percentage that voted Labour at the levels we saw during the run up to the election.
This shows once again that Twitter was an accurate indicator of the outcome of the election, which was once again not as predicted by polls.
fig.4 YouGov study showing the age demographics and how they voted.
To get a better idea of how all of this data looks in a visual representation, below is a photo of our big data reference architecture in action and displaying on the big screens in our Dowgate Office in London.
Arrow Bandwidth S3, Episode 10 | Let's Talk about Splunk
This week on Bandwidth the guys take a good lookout one of the most interesting and disruptive products in the Arrow line card, Splunk. Tune in to learn more.
Arrow Bandwidth Special | Introducing How Happy is London?
This week David and Rich take you through our brand new big data project. How happy is London? We know, do you?
Arrow Bandwidth Episode 10 – Big Data in Action with KnowNow
David and Rich are joined by Chris Cooper from KnowNow to discuss the real world outcomes of Big Data