I’ll get back to the usual posting regimen on new research, but the recent election is a great time to popularize some ideas that are well known in the theory community, though perhaps not generally. Consider the problem of punditry. Nature is going to draw an election winner, perhaps in a correlated way, from 51 distributions representing each state plus DC. An “expert” is someone who knows the true distribution, e.g., “With .7 probability, independent from all other states, Obama will win in New Hampshire.” We wish to identify the true experts. You can see the problem: the true expert knows distributions, yet we who are evaluating the expert can only see one realization from each distribution.

When forecasts are made sequentially – imagine a weather forecaster declaring whether it will rain or not every day – there is a nice literature (done principally here at MEDS) about the problem of divining experts. Essentially, as first pointed out by Foster and Vohra in a 1998 Biometrika, imagine that you set a rule such that a true expert, who knows the underlying distribution each period, “passes” the rule with very high probability. It then turns out (this can be proven using a version of the minmax theorem) that a complete ignoramus who knows nothing of the underlying distribution can also pass your test. This is true *no matter what the test is*.

Now, the testing literature is interesting, but more interesting are properties of what a good test for a forecaster might be. In an idea I first saw through a famous 1982 paper in JASA, one minimally sensible rule might be called “calibration”. I am well-calibrated if, on the days where I predict rain with probability .4, then it actually rains 40 percent of the time. Clearly this is not sufficient – I am well-calibrated if I simply predict the long run empirical average frequency of rain every day – but it seems a good minimum necessary condition. A law of large numbers argument shows that a true expert will pass a calibration test with arbitrarily high probability. With a lot of data points, we could simply bin predictions (say, here are days where prediction is between 40 and 45%), and graph those points against the actual empirical realization on the predicted days; a well-calibrated forecast would generate all data points along the 45-degree line.

Here is where we come to punditry. The recent election looks like a validation for data-using pundits like Nate Silver, and in many ways it is; people were calling people like him idiots literally a week ago, yet Silver, Sam Wang and the rest have more or less correctly named the winner in every state. But, you might say, what good is that? After all, aside from Florida, Intrade also correctly predicted the winner in every state. Here is where we can use calibration tests to figure out who is the better pundit.

Of the swing states which went for Obama, Intrade had Virginia and Colorado as tossups; Ohio, Iowa, New Hampshire and Florida as 2/3 favorites for the frontrunner (Obama in the first three, Romney in FL); and Wisconsin, Pennsylvania, Michigan, Nevada, Ohio and North Carolina as 75 to 85% chances for the frontrunner (Obama in the first five, Romney in NC). A well-calibrated prediction would have had the half the tossup states go to each candidate, 67% of the second group of states to the frontrunner, and 80% or so of the third group to the frontrunner. That is, a well-calibrated Intrade forecast should have “missed” on .5*2+.33*4+.2*6, or roughly 3, states. Intrade actually missed only one.

Doing the same exercise with Silver’s predictions, he had Florida as a tossup, Colorado as a .8 Obama favorite, NC a .8 Romney favorite, Iowa and NH about .85 Obama favorites, Ohio and Nevada about .9 Obama favorites, and the rest of the swing states a .95 or higher Obama favorite. A well-calibrated Silver forecast, then, should have been wrong on about 1.5 states. With Florida going to Obama, Silver will have correctly called every state. There is a very reasonable argument that Silver’s prediction would have been better had Florida gone to Romney! He would have called fewer states correctly in their binary outcomes, but his percentages would have been better calibrated.

That is, Silver is 1.5 states “off” the well-calibrated prediction, and Intrade 2 states “off” the well-calibrated prediction; we say, then, that Silver’s predictions were better. A similar calculation could be made for other pundits. Such an exercise is far better than the “how many states did you call correctly?” reckoning that you see, for example, here.

Two caveats. First, the events here are both simultaneous and correlated. If we had an even number of Romney-leaning and Obama-leaning swing states, the correlation would be much less important, but given how many of the close states were predicted to go for Obama, you might worry that even true experts will either get 0 states wrong, or a whole bunch of states wrong. This is a fair point, but in the absence of total correlation tangential to the general argument that forecasters of probabilistic binary events who get “everything right” should *not* be correct on every event. Second, you may note that forecasters like Silver also made predictions about vote shares and other factors. I agree that it is much more useful to distinguish good and bad forecasters using that more detailed, non-binary data, but even there, nature is only giving us one realization from a true underlying distribution in each state, so the point about calibration still applies. (EDIT: I should note here that if you’re interested in calibrated forecasts, there are much more sophisticated ways of doing the type of analysis that I did above, though with the same qualitative point. Google “Brier score” for a particularly well-known way to evaluate binary outcomes; the Brier score can be decomposed in a way that you extract something very similar to the more basic analysis above. In general, scoring rules are in a branch of statistics that we economists very much like; unlike pure frequentism or Bayesianism, scoring rules and other Wald-style statistics implicitly set out a decision problem with a maximand before doing any analysis. Very satisfying.)

Back to our regular style of post tomorrow.

My friend Andrew Thomas actually did a little calibration exercise for some of the vote-share predictions.

Great – thanks!

That Foster and Vohra paper’s conclusion is very interesting, but it went over my head. If you have time, could you explain that one? Thanks once more for this great blog!

Indeed, I wrote about that paper a couple years back: . If you click on “Expert Testing” as a category, you can find much more. It has also been pointed out to me that I remembered wrong: Foster/Vohra just handles testing with calibration, while Sandroni (2003) is the more general “all tests are manipulable” result.