Given that we had a lot of the components in place and a "certain level" of trust in the power of social media*, we decided to go for a wild experiment: predict the results of the upcoming (now gone) Euro elections using social signals, and publish the result before the polls closed. So, 48 days before the elections, we started monitoring the online discussions on Twitter around the names of the most popular parties (using their official names, Twitter account names and some frequently used abbreviations). All in all, we collected around 370K tweets during the pre-election period. At the same time, we also collected poll results from multiple sources in order to use them as a kind of "ground truth" for building our predictive models.
We then computed a set of features from the collected tweets. Those features were related to the percentage of messages mentioning a given party, the number of unique user accounts discussing about it, and the sentiment (positive/negative) expressed for each of those. Those features were computed on a daily and on a per-party basis. Due to the lack of a Greek sentiment lexicon (we are working on it!) and the not-so-effective performance of supervised learning approaches for sentiment detection in Greek, we had to turn to a heuristics-based method for automatically detecting the sentiment polarity (positive/negative) of a tweet: we used three different lexicons in English and translated them into Greek by using Google Translate. In order to detect a tweet's sentiment, we used a naïve counting method and assigned the majority class label (positive/negative) to it. Finally, we used a 7-day moving average filter for all features to smooth their values in time.
In terms of using public polls, we aggregated the results published by different polling companies during the time period between April 6th to May 24th to serve as target values for our predictive models. Every poll is usually conducted over a small period (2-3 days) on a sample of the country's population (normally around 1,000 to 1,400 people). For every poll, we ignored the percentage of participants. Then, we assigned the percentage reported for every party on the polling dates as the actual percentage it would achieve if the elections were held on one of these days. In the case that two or more polls were conducted on a certain date, we used the sample size of every poll as a weight and assigned the weighted average percentage to every party. Finally, after calculating the percentages of every party for every date that there was a poll conducted on, we assigned the “undecided” voters analogously to all parties based on their percentage.
Using the extracted features above and the target values from public polls, we trained four different regression models (Linear Regression, ε-SVM, Sequential Minimal Optimization, Gaussian Process) and came up with the Euro election predictions about three hours before the polls closed. To be more precise, as input features we used both Twitter features and the poll results at day t, and as output features we set the poll results at day t+1. In cases that there were no polls during a day, we used interpolation to derive a proxy value for that day.
Note that this is one of the few times that such predictions were published before the actual publication of Exit Poll estimates or official results. Obviously, we were very curious to see what was going to happen...
The results turned out to be pretty accurate! The table below presents presents our predictions side-by-side a number of different polls (note that for polls and exit-polls we made a re-adjustment of the results by assigning all "undecided" equally to all parties), the Exit-Polls, and the actual results at the time of writing.
|Results||Poll Watch||Meta Polls
||ΚΑΠΑ Research||Pulse||Alco||Rass||Public Issue||GPO||Marc|
To get a better idea of how close each poll or estimate is to the actual results, the following table presents the Mean Square Error between each of them and the actual results:
|Poll Watch||Meta Polls||ΚΑΠΑ Research||Pulse||Alco||Rass||Public Issue||GPO||Marc|
As can be seen, the SocialSensor approach produces better estimates compared to almost all competing poll results (with the exception of GPO and Marc that outperformed even the Exit Polls) and approaches the accuracy of the Exit Poll prediction. The SocialSensor predictions are especially accurate on the first four parties compared to other polls. Also note that the SocialSensor approach is much cheaper to derive compared to others, since it is mostly based on automatic data analysis components. Obviously, such results should be read with a grain of salt, since prediction is an inherently challenging problem with a large number of factors affecting the outcome. It will be interesting to see whether this approach (or an extension of it) can be successfully (and repeatedly) applied in the future.
More details on our approach will be provided as part of the upcoming deliverable D2.3 Social stream mining framework and will be submitted for publication to an upcoming venue. For more details, please get in touch with Adam Tsakalidis (@adtsakal) and Symeon Papadopoulos (@sympapadopoulos).
* Obviously there has been much prior research work on this exciting new area. Tumasjan et al. discovered some strong correlation between the volume and sentiment of Tweets and the power of political parties. In contrast, Metaxas et al. provide a critical review of the problem, pointing out the extremely challenging aspects of the problem. Choy et al. combine sentiment detection analysis with reweighting techniques to predict the outcome of the 2011 Presidential election in Singapore. Finally, Tjong Kim Sang and Bos used entity counts and sentiment analysis to predict the results of the 2011 Dutch Senate Election. More recently, Lampos et al. argue that a substantial amount of messages referring to election entities (parties, candidates) should be filtered and not used for predictive modelling and propose a highly effective method to improve prediction performance.