Averaging Polls and Sampling Duration
In deciding how to aggregate public opinion data, the ORACLE team was given the task of deciding which polls should have more of an effect on the model. A poll taken one day before the election, for example, should have a larger impact on the average than a poll taken seventy-six days before the election. This scheme can apply to many factors, including poll grade, sample size, and more.
Although our class ultimately voted to use the z-test method, in which we take a new batch of polls each week and replace the old ones if necessary, we are still curious as to whether a poll’s duration -- the time elapsed between the first and last response -- factors into its accuracy.
To explain why duration could be an important factor, let’s use an example. Imagine a tight race between Candidate A and Candidate B. A polling company plans to take two national polls, one over the course of a week, and another on just the Monday of that week. On Sunday night, a viral news story shows that Candidate A habitually steals gum from his local pharmacy. But early Tuesday morning, the report is discredited as Candidate A claims he has been allergic to gum for his entire life.
Here’s how the polls turned out:
Week-Long Poll: Candidate A - 48% Candidate B - 46%
Monday Poll: Candidate A - 40% Candidate B - 52%
There’s an eye-popping difference between these two polls, and it’s not hard to see why. Every respondent to the Monday poll was under the impression that Candidate A is a shoplifter, while only about 1 in 7 respondents to the week-long poll had the same misconception. This is an example of a low-duration poll capturing a small shift in momentum, which can lead to incorrect data interpretation.
To test this concept, we weighted the polls in our 2016 collection by duration to see if the new average was considerably different. Because some poll durations were outliers (many elapsed for 20 or 30 days), we took the logarithm of each duration to bring values closer together. Then, we multiplied each of those logarithms by the respective Clinton percentage of the poll, and divided the total by the sum of logarithmic durations. Nationally, the two-party Clinton percentage went from 51.3% (before adjustment) to 51.1% (after adjustment), showing little, if any difference.
|State||2party Clinton||Weighting||New 2party Clinton|
|US||0.513||Multiply every poll by its duration, divide by sum||0.505|
|US||0.513||Multiply every poll by log(duration+1), divide by sum||0.511|
|Michigan||0.533||Multiply every poll by its duration, divide by sum||0.538|
|Michigan||0.533||Multiply every poll by log(duration+1), divide by sum||0.532|
|Pennsylvania||0.526||Multiply every poll by its duration, divide by sum||0.525|
|Pennsylvania||0.526||Multiply every poll by log(duration+1), divide by sum||0.526|
Another method we used to check the effect of duration was analyzing whether polls with duration of 2 days or higher were more accurate than all polls as a whole. After performing this method on specific Trump-carried swing states, including Michigan, Florida, and Ohio, we found no states with a difference greater than 1%.
|State||Rawpoll_Clinton||Rawpoll_Clinton D>2||2p_Clinton||2p_Clinton D>2|
Results do not confirm the effect of duration we were looking for. There are many reasons for the absence of this effect, the most notable being that short-term momentum shifts do not radically change polls. This makes sense, as research has shown that the majority of voters know who they are voting for months before the election.
Overall, the 2018 method of averaging polls was not chosen by the 2020 class, partly as a result of it over predicting Clinton’s vote. The z-test method described earlier only overpredicted Clinton by 1.36 percentage points, while the averaging polls method (which weighted by time) predicted by 3 percentage points.
When looking at late polls in this election cycle, it may be helpful to analyze the poll’s key qualities -- time, grade, sample size -- before deciding whether that poll shows any new information. However, no individual feature should be considered make-or-break. Following the 2020 results, pollsters should analyze whether duration was a better predictor of accuracy than it was in 2016.