I had decided to hold off talking about the election versus the media until everyone had calmed down. It's not that we don't have a lot of work ahead of us (we do), nor that there aren't plenty of things to talk about (there is), but everyone is so upset right now that it's hard to have a constructive argument.
But there is one thing that I need to address today, because it is about to mislead itself again. And it is the constant narrative over the past few days that 'the polls failed us'.
This is simply not true, and blaming our failures on the polls is a lie. The polls were pretty good. What was terrible was our constant need for turning variable data into absolute facts.
Let me explain:
During this entire election, we have been told that Hillary Clinton would win the popular vote by about 4%, with a margin of error of about 3.5% on average.
This was exactly what happened. Hillary Clinton did win the popular vote, and when the remaining numbers come in, she is likely to end up winning the popular vote by 1-3%.
The polls weren't wrong about this. They were spot on.
The problem is that this wasn't the narrative in the media. In the media we talked about the 'chances of winning', which we presented like this (this is from FiveThirtyEight, but other media had the chances of winning even higher):
This might be mathematically correct in its theoretical form, but think about the narrative here. Think about what you are telling your readers.
What the media did was take data that was about 50/50, and from these very minor fluctuations (well within the statistical variances that would render them irrelevant) we concluded that Clinton had a 71.4% chance of winning.
Think of this narrative.
We are not saying that Clinton 'might win' or that she would probably win. No, it's that she has a 71.4% chance of winning, a very specific and absolute number that makes no sense to report given the margin of error. And it's a number that is massively higher than the 4% lead that the polls actually predicted.
It's the same on the state level. A constant narrative over the past couple of days has been that "State-level polls were all wrong."
Again, no they weren't. They were mostly on point. I went over to the RealClearPolitics' polling database and compared the polls with the actual election, and the result looks like this:
As you can see, generally the polls were pretty spot on and within or very close to ranges polled. There are a few outliers (which we will get back to in a moment), but the real lesson here is just how big the variances are.
Take a place like Minnesota. If we only look at the polls themselves, we see that there was a 16 percentage point variance between them, predicting either that Clinton would win with 13 points, or that Trump would win with 3 points. And when we then account for the margin of error, we actually see a 24 percentage point variance between them.
But this was then reported as a sure win by the press, where we obsessed about single polls.
Democratic presidential nominee Hillary Clinton has expanded her lead over Republican Donald Trump in the state, according to a new Star Tribune Minnesota Poll.
Clinton leads Trump 47 to 39 percent in the poll of 625 registered Minnesota voters taken after last week's third and final presidential debate. It has a margin of sampling error of plus or minus 4 percentage points.
While she did not break 50 percent, Clinton made gains by nearly every one of the Minnesota Poll's measures. She leads among voters between ages 18 and 64, with her biggest lead in the 18-34 group; Trump catches up only among voters 65 and older, where the two candidates are tied.
This sounds like a sure thing.
The result, of course, was that Clinton didn't win 47-39, she won 46.9-45.4, meaning that instead of an 8 point lead, she only had a 1.5 point lead.
So, were the polls wrong? No. Here is the graph for Minnesota:
What went wrong here was not the polls. The polls overall predicted exactly what would likely happen within the range of uncertainty that exists by only polling around 625 people each time. Within the limits of the data, the polls were spot on.
The people who failed here were us, the media, who obsessed over the numbers from a single poll, rather than thinking of it as a pattern. We should not report the result of single poll, because that is insane given how few people we measure.
Instead, whenever a new poll comes in, we should add it to the range of polls we already have so that we over time get a more refined picture.
Another problem is the margin of error. Polling companies repeatedly report ridiculously low margin of errors, which is made worse by the press who are reporting them as data points.
A margin of error is not a data point. It's a made-up guess of how wrong the polling company thinks their data might be, given how few people they measured.
Some claim to base this on a mathematical formula. They take the standard deviation of the population and divide that by the square root of the sample size, and then multiply that by a completely made up percentage of confidence.
It all sounds very fancy, but it has been repeatedly and conclusively proven to be a bunch of crap. The margin of error number is completely unreliable as a metric.
Again, look at Minnesota. Here the polls themselves had a 16 percentage point variance between them, and yet the polls only reported a 4% margin of error within each poll. Don't anyone realize how ridiculous that is?
This is yet another example of us looking at the data and us making completely insane conclusions.
There is generally nothing wrong with the polls. It's the narrative that fails. It's the media's unwillingness to report a 16 percentage point variance in the data. We are so afraid of telling people that we aren't sure about something that we instead focus on a single data point so that we can say that Clinton will get the specific number of 47% of the votes.
And as a result, we end up looking like idiots after every election. Our narrative told people something that turned out not to be true, even though the data was pretty good.
We need to stop this. We are shooting ourselves in the foot here.
But what about the outliers? What about the few states in which the polls were clearly not on point? Well, let's look at that.
As you can see in the graph above there were a few states where the actual results ended up being wildly different than the polls. Some of this can be explained simply by questioning the wrongly defined margin of error, and not having enough polls to compare with.
But let's take a deeper look. First, we have South Dakota. Here the polls predicted that Trump would win with 7 to 14 percentage points(which is more like a -3 to 24 percentage if you consider the real margin of error). As it happened he actually won by 29.8 percentage points over Clinton.
That's quite a difference.
So what happened? Well, let's look at what polls actually found.
If we look at the latest poll before the election (which only had a sample size of 600 people), we find this:
As you can see here, this one poll found that Clinton would get 35% of the polls, but she ended up with 31.7%. That's well within the expected margin of error, so nothing wrong with that.
But the real joker here was mostly the undecided, who almost entirely went to Trump, and in a minor way a part of 3rd party voters who also seem to have voted for Trump instead (although this too is within the margin of error).
So, this poll looks spectacularly wrong because it predicted only a 14 percentage point win to Trump, when he actually ended up winning with a 29.8% difference.
But again, it's not the data that is wrong. It's our conclusions. We concluded that the undecided would vote like everyone else. We ignored them as a data point, and mistakenly believed they wouldn't shift the data.
This is a common problem that we see not just in the media but with everyone who is working with studies. If you have a known audience of which 40% choose option A, and 60% choose option B, we have a tendency to think that the unknown audience will choose the same way.
But, this is generally not how the world works. And this is why we got it wrong.
It was not the poll that was wrong. It was our narrative stating that the variances didn't matter, and that the undecided wouldn't drastically turn to just one candidate. It was our conclusions that were wrong.
We decided to report this poll as having only a 4% margin of error, when we knew that it was 8-16% (the reported margin of error + the number of undecided).
How stupid is that?
Why are we doing this? Why are we reporting numbers as absolutes when the data clearly shows us that there is a huge level of uncertainty in them?
Also keep in mind that if a poll is asking 400 people how they would vote, and 3% said they would vote for the 3rd party candidate Gary Johnson, that's only 16 people who answered that. So it's no surprising that, when the election comes around, the 3rd party voters are a bit sketchy.
And we see the same pattern with the other outliers. Here we have West Virginia:
The problem in West Virginia was that it only polled once, and that was back in August (which means a lot probably changed since then), and it only polled 385 people. So, there is a massive amount of uncertainty here.
But just like in South Dakota, the pattern we see is that pretty much all the undecided went to Trump, with a substantial amount of the 3rd party candidates losing to Trump as well.
It's the same in Tennessee:
Again we see that the polls were pretty much spot on about Clinton, but all the undecided turned to Trump.
It's not the polls that were wrong. It's how we read the data. We knew that there were almost 10% undecided voters who might lean either way, and we knew that this poll only measured 600 people. And yet, we decided that we only had a 4% margin of error when in reality it was much higher.
Finally, there is one more thing we need to discuss, and that is what happened in all the states where either Clinton or Trump could have won.
In other words, these states:
As you can see here, for almost all of these, the variances within the polls and the actual election results were well within the boundaries. Sure Iowa and Ohio looks a bit fuzzy, but both states had a dominance of Trump winning overall. So the actual result is well within what was predicted.
The only three outliers here are Nevada, Pennsylvania and Wisconsin.
Nevada is an outlier because the polls generally defined Trump as a winner, whereas it was Clinton who actually won. Same with Pennsylvania, where most of the polls found Clinton to be a winner, but Trump actually won.
But again, it's well within the boundaries.
The only real problem was the the poll from Wisconsin that showed a win to Clinton by 6-8 percentage point, but it was actually Trump that ended up winning with 2%.
So what happened there? Well, here is one example of the polls versus the actual results:
This is quite interesting. What we see is that more people actually ended up voting for Clinton (even though she lost the state), but the joker here again is how the undecided all went to Trump, but also how the massive 12% who said they wanted to vote for a 3rd candidate actually ended up only being 4.7% ... with the rest voting for Trump instead.
It's the same pattern as before.
Again, the poll wasn't really wrong as such (although this one is a bit spiffy). It correctly predicted how many votes Clinton would get, but we failed to understand the undecided (which is not the poll's fault), and the poll was kind of wrong about the 3rd party candidate support.
BTW: There is an interesting sub-story here about why the undecided reacted this way, because we saw the same thing with Brexit.
The likely reason is that people might have agreed with Trump's confrontational views but didn't want to express that in public. But another problem is also that we failed to look at the bigger picture.
One of the things we have seen from studies done by Gallup is that the US public is really tired of Washington. In listing what people consider to be the 'most important problem', people responded:
And when we look specifically at the trust in government, we see how that has sharply declined for more than 10 years.
With this in mind, it's perhaps not surprising that the undecided so strongly turned to Trump. He was their spokesperson against the 'crooked Washington' and he even attacked his own party, while Clinton very firmly represented the establishment (nothing would have changed if she had been elected. It would just be another four years of total government gridlock).
But this is another lesson for us in the media.
We have a tendency to only look at the moment. We only look at the latest polls, and even within those polls only look at the simplest numbers. We don't put in the effort to take that step back to look at the whole, nor the macro trends that would have helped us understand why things were happening the way there were.
We need to stop blaming the data and instead blame our own narratives. We need to be better at not just reporting or reacting to what is happening around us and spend more time analysing the larger issues. We don't need less data, we need more of it, from different angles. And we need to embrace the unknowns in that data instead of spending our time coming up with conclusions.
For instance, before the start of the election, how many newspapers went out to study what concerns people actually had so that they could design their editorial coverage and focus on that?
Why didn't we do that?
Creating a propensity model is one of the most important tools publishers can have.
Many people say you can't measure trust. But you can, although before you do that, you first have to create trust.
When you are an independent publisher, analytics can sometimes be tricky because we don't enough data to work with.
Several publishers have found that reducing volume leads to an increase in revenue
The potential with machine learning is amazing, but it's not enough to identify a result. We also need to be able to do something about it.
Time is such a critical metric for publishers, but it's also a very complicated one.
When you are monetized by advertising, you tend to favor the least valuable metrics, but when you are focusing on subscriptions that changes to the most valuable metric.
Everyone talks about conversion rates, but that often doesn't tell you anything about how well you are converting people. Let's talk about conversion value.
Many large publishers are now turning to advanced analytics to understand their audiences, but what if you are not a big publisher? Can you still do it?
Publishers who start their own data studios need to take extra steps to identify real people.
Founder, media analyst, author, and publisher. Follow on Twitter
"Thomas Baekdal is one of Scandinavia's most sought-after experts in the digitization of media companies. He has made himself known for his analysis of how digitization has changed the way we consume media."
Swedish business magazine, Resumé