Category Archives: Expert Testing

“At Least Do No Harm: The Use of Scarce Data,” A. Sandroni (2014)

This paper by Alvaro Sandroni in the new issue of AEJ:Micro is only four pages long, and has only one theorem whose proof is completely straightforward. Nonetheless, you might find it surprising if you don’t know the literature on expert testing.

Here’s the problem. I have some belief p about which events (perhaps only one, perhaps many) will occur in the future, but this belief is relatively uninformed. You come up to me and say, hey, I actually *know* the distribution, and it is p*. How should I incentivize you to truthfully reveal your knowledge? This step is actually an old one: all we need is something called a proper scoring rule, the Brier Score being the most famous. If someone makes N predictions f(i) about the probability of binary events i occurring, then the Brier Score is the sum of the squared difference between each prediction and its outcome {0,1}, divided by N. So, for example, if there are three events, you say all three will independently happen with p=.5, and the actual outcomes are {0,1,0}, your score is 1/3*[(.5-1)^2+2*(.5-0)^2], or .25. The Brier Score being a proper scoring rule means that your expected score is lowest if you actually predict the true probability distribution. That being the case, all I need to do is pay you more the lower your Brier Score is, and if you are risk-neutral you, being the expert, will truthfully reveal your knowledge. There are more complicated scoring rules that can handle general non-binary outcomes, of course. (If you don’t know what a scoring rule is, it might be worthwhile to convince yourself why a rule equal to the summed absolute value of deviations between prediction and outcome is not proper.)

That’s all well and good, but a literature over the past decade or so called “expert testing” has dealt with the more general problem of knowing who is actually an expert at all. It turns out that it is incredibly challenging to screen experts from charlatans when it comes to probabilistic forecasts. The basic (too basic, I’m afraid) reason is that your screening rule can only condition on realizations, but the expert is expected to know a much more complicated object, the probability distributions of each event. Imagine you want to use the following rule, called calibration, to test weathermen: on days where rain was predicted p=.4, it actually does rain close to 40 percent of those days. A charlatan has no idea whether it will rain today or tomorrow, but after making a year of predictions, notices that most of his predictions are “too low”. When rain was predicted with .6, it rained 80 percent of the time, and when predicted with .7, it rained 72 percent of the time, etc. What should the charlatan do? Start predicting rain every day, to become “better calibrated”. As the number of days grows large, this trick gets the charlatan closer and closer to calibration.

But, you say, surely I can notice such an obviously tricky strategy. That implicitly means you want to use a more complicated test to screen the charlatans from the experts. And a famous result of Foster and Vohra (which apparently was very hard to publish because so many referees simply didn’t believe the proof!) says that any test which passes experts with high probability for any realization of nature as the number of predictions gets large can be passed by a suitably clever and strategic charlatan with high probability. And, indeed, the proof of this turns out to be a straightforward application of an abstract minimax theorem proven by Fan in the early 1950s.

Back, now, to the original problem of this post. If I know you are an expert, I can get your information with a payment that is maximized when a proper scoring rule is minimized. But what if, in addition to wanting info when it is good, I don’t want to be harmed when you are a charlatan? And further, what if only a single prediction is being made? The expert testing results mean that screening good from bad is going to be a challenge no matter how much data I have. If you are a charlatan and are always incentivized to report my prior, then I am not hurt. But if you actually know the true probabilities, I want to pay you according to a proper scoring rule. Try this payment scheme: if you predict my prior p, then you get a payment ε which does not depend on the realization of the data. If you predict anything else, you get an expected payment based on a proper scoring rule, and that expected payment is greater than ε. So the informed expert is incentivized to report truthfully (there is a straightforward modification of the above if the informed expert is not risk-neutral). How can we get the charlatan to always report p? If the charlatan has minmax preferences as in Gilboa-Schmeidler, then the payoff is ε if p is reported no matter how the data realizes. If, however, the probability distribution actually is p, and the charlatan ever reports anything other than p, then since payoffs are based on a proper scoring rule, in that “worst-case scenario” the charlatan’s expected payoff is less than ε, hence she will never report anything other than p due to the minmax preferences. I wouldn’t worry too much about the minmax assumption, since it makes quite a bit of sense as a utility function for a charlatan that must make a decision what to announce under a complete veil of ignorance about nature’s true distribution.

Final AEJ:Micro version, which is unfortunately behind a paywall (IDEAS page). I can’t find an ungated version of this article. It remains a mystery why the AEA is still gating articles in the AEJ journals. This is especially true of AEJ:Micro, a society-run journal whose main competitor, Theoretical Economics, is completely open access.


“How Better Information Can Garble Experts’ Advice,” M. Elliott, B. Golub & A. Kirilenko (2012)

Ben Golub from Stanford GSB is on the market this year following a postdoc. This paper, which I hear is currently under submission, is a simple and straightforward theoretical point, but it does have some worrying implications for public policy. Consider a set of experts which society queries about the chance of some probabalistic event; the authors mention the severity of a flu, or the risk of a financial crisis, as examples. These experts all have different private information. Given their private information, and the (unmodeled) payoff they receive from a prediction, they weigh the risk of type I and type II errors.

Now imagine that information improves for each expert (restricting to two experts as in the paper). With the new information, any possible set of type I and type II errors is still possible, and there is now the possibility of making predictions with strictly fewer type I and type II errors. This means that the “error frontier” expands outward for each expert. To be precise, if each agent gets a signal in [0,1] whose cdf is G(i) for expert i if the event will actually occur. A new signal that generates a second cdf G2(i) which first order stochastically dominates G(i) is an information improvements. Imagine both experts receive information improvements. Is this socially useful? It turns out that is it not necessarily a good thing.

How? Imagine that expert 1 is optimizing by making x1 type I errors and y1 type II errors given his signal, and expert 2 is optimizing by making x2 type I errors and y2 type II errors. Initially expert 1 is making very few type I errors, and expert 2 is making very few type II errors. Information improves for both, pushing out the “error frontier”. At the new optimum for expert 1, he makes more type I errors, but many fewer type II errors. Likewise, at the new optimum, expert 2 makes more type II errors and fewer type I errors. Indeed, it can be the case that expert I after the information improvement is making more type I and type II errors than expert 2 did in her original prediction, and that expert II is now making more type I and type II errors than expert 1 did in his original prediction. That is, the new set of predictions are a Blackwell garbling of the original set of predictions, and hence less useful to society no matter what decision rule society uses when applying the information to some problem. Note that this result does not depend on experts trying to out-guess each other or anything similar.

Is such a perverse outcome unusual? Not necessarily. Let both experts be “envious” before new information arrives, meaning the both experts prefer the other’s bundle of type I and type II errors to any such bundle the expert can choose himself. Let the agents payoffs not depend on the prediction of the other agents. Finally, Let the new information be a “technology transfer”, meaning a sharing of some knowledge already known to one or both agents. That is, after the new information arrives, the error frontier of both agents lies within the convex hull of their original combined error frontiers. With envious agents, there is always a technology transfer that makes society worse off. All of the above holds even when experts are not required to make discrete {0,1} predictions.

This is all to say that, as the authors note, “better diagnostic technology need not lead to better diagnoses”. But let’s not go too far: there is no principal-agent game here. You may wonder if society can design payment rules to experts to avoid such perversity. We have a literature, now large, on expert testing, where you want to avoid paying “fake experts” for information. Though you can’t generally tell experts and charlatans apart, Shmaya and Echenique have a paper showing that there do exist mechanisms to ensure that, at least, I am not harmed “too much” by the charlatans’ advice. It is not clear whether a mechanism exists for paying experts which ensures that information improvements are strictly better for society. By Blackwell’s theorem, more information is strictly better for the principal, so incentivizing the experts to express their entire type I-type II error frontier (which is equivalent to expressing their prior and their signal) would work. How to do that is a job for another paper.

July 2012 working paper (unavailable on Repec IDEAS).

“Expressible Inspections,” T-W. Hu & E. Shmaya (2012)

The expert testing literature is really quite cool. The world is throwing out some stochastic process – a string of stock prices, or rain totals, or whatever – and an “expert” claims to know the process. I don’t know anything at all but I want to “test” whether the expert actually knows the process or is a charlatan. A number of results have been proven here: with tests that decide in finite time, it is generically true that fake experts have a strategy that can pass any test, though this manipulability result disappears if we suitably restrict the class of possible stochastic functions nature can be playing, or if we make the tester a Bayesian, or if we consider certain mechanism design problems where the tester is using the data from the test in certain ways rather than just testing what the expert knows. A particularly interesting note about how experts might be successfully tested was Fortnow and Vohra, who found that restricting the set of strategies to those which are computable in a “reasonable” amount of time, in complexity terms, was sufficient to make it impossible for false experts to manipulate some tests.

Hu and Shmaya then take the next step and derive precisely the conditions on “complexity” of strategies which are necessary and sufficient for the manipulability result to hold. They restrict tests to those which are computable, in the sense that they can be described by a Turing machine; if you know the Church-Turing thesis, this essentially means any test which I can describe to you in words, or write down in a contract, is allowed. If supposed experts can only use strategies which are Turing computable, then there does exist a test which is nonmanipulable while still accepted true experts with high probability. This is simply an strengthening of the result in Fortnow and Vohra. But if supposed experts can use slightly more complicated strategies – strategies that are computable with a Turing machine associated with an oracle that can answer the “halting problem” – then they can manipulate any computable test. The word “slightly” in the previous sentence is precise in the sense that according to something computer scientists call an arithmetic hierarchy, the machine-plus-oracle above is the first class of machines that can answer more complex questions than a simple Turing machine.

The heart of the proof involves a famous result in complexity called the Enumeration Theorem. Essentially, the existence of functions which cannot be run on a (necessarily finite) Turing machine is countable, since the set of possible Turing machine programs is just a (countable) set of finite sequences of symbols. The Enumeration Theorem just defines a Universal Program U(m,n) such that if program 11256 in the set of countable programs outputs f(m) when m is input, then U(m,11256) outputs f(m) as well. It turns out to be impossible for a Turing machine to figure out the domain of its own Universal program (this is the halting problem). The link between the computability of “winning” strategies for a false expert given a computable test is intimately linked to being able to solve halting problems, which gives us our result.

It strikes me that the expert testing literature is probably close to tapped out when it comes to “basic” results like “what tests can be manipulated given domain of nature’s strategies X and domain of expert strategies Y?” But broader questions in expert testing – particularly at the intersection of mechanism design and testing, or the links between inductive reasoning and testing – are still wide open. And these are not mathematical or philosophical curiosities. Imagine a hedge fund claims to garner above average risk-adjusted returns using special knowledge of the underlying stochastic process guiding the stock market, and you propose a test this before investing? Many straightforward tests are not only beatable by a fake expert in theory, but the strategy to beat the test is simple to derive: you can be fooled. A presentation I caught by Dean Foster a couple months back – forgive me if I have mentioned this before on this site – argued that some investment firms not only can, but do, take advantage of investors in precisely the way the testing literature says they can. Identifying fakes, or at least not being harmed by them, is a serious question.

Working Paper (IDEAS). Paper forthcoming in Theoretical Economics.

“The Reproducible Properties of Correct Forecasts,” A. Sandroni (2003)

Here’s a paper that really deserves to be better-known. Last week, I mentioned a result by Foster and Vohra that says that completely uninformed people can pass calibration tests if they are allowed to make predictions that are mixed strategies. Recall that being calibrated means that of the times when a forecaster predicts, say, rain with probability .3, nature actually does rain with probability .3. An application of the minimax theorem says that I can “fool” a calibration test by playing a suitably complex mixed strategy even when I, as a forecaster, actually have no idea what probability distribution nature is playing.

Now you might think that the result implies calibration tests are too “weak” in some sense. That is, if nature is playing {0,1,0,1…} and you predict .5 every period, then you are calibrated, but in no real sense are you making good predictions. A series of papers following Foster-Vohra (Hart and Mas-Colell have a couple, as do Kalai, Lehrer and Smorodinsky) looked at stronger and stronger versions of calibration, such as those that required subsequence predictions to calibrate as well, but kept coming up with the result that a clever mixed strategy could still fool the tests.

In the present paper, Sandroni shows that any test can be manipulated. That is, let a tester choose some test that will, no matter what probability distribution is playing, “pass” someone who actually knows that distribution with probability arbitrarily close to 1. Another fairly simple application of the minimax theorem (Fan’s theorem for infinite games, in this case) shows that a fake forecaster who does not know the true distribution can still also pass that test! That is a devastating result, as far as I’m concerned, for our ability to judge science.

It may not be obvious what this result means. If nature is playing, say, “rain with p=1 every period”, then why not just use the test “your prediction every period must be exactly what nature does or you fail”. In that case, someone who knew what nature was doing would pass but the fake predictor surely wouldn’t. The problem with that (false) counterexample is that the proposed test will not pass a true forecaster almost all of the time no matter what probability distribution nature is playing, and there is no way for the tester to know in advance that nature is playing something so deterministic. If nature was actually playing “rain with p=.99 every period”, the proposed test would fail the knowledgeable forecaster because, simply by the draw of the probability distribution, he would sometimes predict rain when nature draws sun.

A couple of caveats are in order. First, tests here are restricted to finite tests in the sense that we test at period n whether a test has been passed using the predictions and draws of nature up to period n. I will discuss next week results by Dekel and Feinberg, as well as by Eran Shmaya, which deal with “infinite” predictions; measure-theoretic issues make those proofs much less intuitive than the one presented here (though perhaps this is only on the surface, since the present proof uses Fan’s theorem, which uses Hahn-Banach, which uses Axiom of Choice…). Second, you might wonder why I care about testing experts in and of itself. There doesn’t seem to be any decision theory here. Mightn’t I not care about accidentally letting through an “fake” expert if her false prediction is not terribly harmful to whatever decision I’m going to make? For instance, as a classmate of mine pointed out, even if I as the tester don’t know the true distribution of nature, perhaps I only need to know (for utility purposes) who is the expert if nature is playing a deterministic strategy. In a case, I might be fine with a test that “rejects” true predictions when nature is playing a strategy that is not deterministic, as long as I can tell who is a real expert when it comes to deterministic nature. Is such a test possible? Results like this, at the intersection of decision theory, mechanism design, and the statistical literature, really look like the present frontier in this line of research, and I’ll discuss some of them in the next few weeks. (Final version – it appears to be ungated at the official IJGT website, but let me know in comments if this doesn’t open for you…)

“Asymptotic Calibration,” D. Foster & R. Vohra (1998)

In the last post, I wrote about Dawid’s result that no forecasting technique, no matter how clever, will be able to calibrate itself against nature in a coherent way. Is there a way to save calibration? Foster and Vohra claim yes: let forecasters play mixed strategies. That is, rather than predict a 40 percent chance of rain given the history observed and beliefs about what nature will do this period, instead play a strategy that predicts a 60 percent chance of rain with .5 probability and a 20 percent chance of rain with .5 probability. Though in some sense of expectation (I’m abusing the term here), this strategy still predicts a 40 percent chance of rain, the predictions will follow a distribution rather than be a point.

Foster and Vohra let nature choose its joint distribution after seeing the forecaster’s joint distribution. Nature tries to make the agent as poorly calibrated as possible. Nature is even allowed to condition its time t strategies on the history of forecasts made by the agent up until time t-1. In particular, if p is a forecast (from a finite set A of arbitrary fineness) of how often rain arrives, q(p) is the fraction of days it actually rains when p is projected, and n(p,t) is the number of days p is forecast up until time t, then let (q(p)-p)^2 times the proportion of days p is forecast, summed over all possible predictions p, be the calibration, and let an agent be well calibrated with respect to nature’s strategy if her forecasts are such that that term goes to zero as time goes to infinity.

Incredibly, there is a mixed strategy that is sufficient to defeat this malevolent nature. The proof relies on the minimax theorem. You can think of nature and the forecaster as playing a two-player zero sum game. Since the set of forecasts is assumed finite (say, forecasts and nature only produce rain up to the ability to measure it, in increments of .01 inches, with max of 50 inches in one day), von Neumann’s famous theorem applies, and I can just look for the value of the game.

The proof in Foster-Vohra is algebraically tedious, but Fudenberg and Levine give a very simple technique for calibrating in their short followup published in GEB. Essentially, the agent should play each strategy K times, where K is sufficiently large (this is the “initialization stage”). After initialization (which is finite, and during which nature can only beat the agent by a total amount that is finite), every period can be considered a zero-sum stage game. Applying minimax again, nature will choose the strategy that increases how poorly calibrated the agent becomes by the greatest amount, under the assumption that nature correctly forecasts the strategy the agent will use. The agent’s calibration score will increase the most when nature plays the strategy the agent has used the least. But since every action has been used many times, the amount by which nature can try to ruin a person’s calibration in any one period is bounded by an amount that is decreasing in K. From here, it is easy to show that the average increase in the forecaster’s error in any given period is bounded by an arbitrarily small number, and that therefore asymptotically the deviation from perfect calibration is bounded by an arbitrarily small number. That is, agent’s playing mixed strategies make it difficult for even the most malevolent nature to throw off their calibration.

You may be wondering: why is calibration a good criterion for forecasters anyway? The proof here essentially says that I can be well-calibrated even when I make completely uninformed forecasts for KN periods, where N is the number of possible predictions. Perhaps an asymptotic definition of good forecasting is not the most sensible? But, then, what rule should we use to judge forecasts? It turns out the answer to that question is not at all obvious; more importantly, economists’ have a very nice solution to the problem of what rule to use. Instead of choosing a test arbitrarily, why not show that all tests in a class, so big that it contains nearly any reasonable test, will have the same problems as calibration? More on this forthcoming. (Final published version, in Biometrika)

“The Well-Calibrated Bayesian,” A.P. Dawid (1982)

I’m helping run a reading group on a subfield of economics called “expert testing” this semester, so I’ll be posting a number of notes on the topic over the next couple months, as well as an annotated reading list at the end of March. The subfield in many ways develops out of the calibration literature in statistics, of which this article by Dawid in JASA is the classic article.

Consider a forecaster who tries to predict a stream of draws by nature. For instance, let nature choose a probability of rain every period, and let a weatherman likewise try to predict this probability (measurable to all weather occurring as of yesterday). How should we judge the accuracy of such forecasts? This turns out to be a totally nontrivial problem. “How close you are to being right about the true distribution” doesn’t work, since we only see the draw from nature’s mixed strategy – rain or no rain – and not the underlying strategy.

Dawid proposes calibration as a rule. Let nature play and the forecast predict an arbitrarily long sequence of days. Consider all the days where the forecaster projects rain with probability x, say 30 percent. Then a forecaster is well-calibrated if, in fact, it rained on 30 percent of those days. Calibration is, at best, a minimal property for good forecasting. For instance, just predicting the long-run probability of rain, every day, will ensure a forecaster is well-calibrated.

It is proven that Bayesian agents cannot be subjectively miscalibrated, assuming that forecasters sequentially make predictions from a fixed distribution that is conditional on all past data (i.e., on when it has rained and when not in the past), and assuming that forecasters are coherent, a term due to de Finetti that essentially means the forecaster’s subjective probabilities follow the normal rules of probabilities. That is, after making sufficiently many predictions, a forecaster must believe the empirical event “rain” will occur exactly p percent of the time on days where rain was predicted to occur with probability p. Forecasters cannot believe themselves miscalibrated, no matter what evidence they see to the contrary. The basic reason is that, at time zero, the forecaster had already computed in a coherent way what he will predict conditional on seeing the history that he in fact sees. If he wants to “change his mind” when later making his sequential forecasts – say, if upon predicting snow over and over when it had in fact not snowed – he would essentially need to have two different subjective probabilities in his head, the original conditional one, and the new conditional one. This would violate coherence.

Now, this might not be a problem: perhaps there is a joint distribution over histories which an agent can play that can never become miscalibrated. That is, my forecasting technique is so good that whatever stream of weather nature throws my way, my predictions are calibrated with that reality. Unfortunately, it is very easy to construct a “malevolent” nature – that is, a nature playing minimax against an agent trying to predict it – who will cause any forecasting system to miscalibrate. Dawid and Oakes, both in 1985 articles in JASA, produce simple examples. Basically, if an agent’s probability distribution forecast conditional on history A says to predict rain with probability more than .5, then nature plays sun, and vice versa. In this way, the agent is always poorly calibrated. The implication, essentially, is that no forecasting system – that is, no econometric technique no matter how sophisticated – will be able to always model natural processes. The implication for learning in games is particularly devastating, because even if we think nature doesn’t try to fool us, we certainly can believe that opposing players will try to fool our forecasting algorithm in competitive games of learning. (Final version – published in JASA 1982)

“The Importance of Being Honest,” N. Klein (2010)

(A ridiculous number of talented young game theorists are currently enjoying summer’s last gasp in the old Sicilian town of Erice for this year’s IMS Stochastic Methods in Game Theory conference. Over the next few days, I’ll be posting – from Chicago, not from the beaches of Italy, alas – about a few papers from that conference that struck me as particularly interesting. I can’t resist a good pun, so let’s start with Nicholas Klein’s expert testing paper…)

Classic mechanism design problems involved properly incentivizing agents to work when effort cannot be observed. A series of recent papers, the great concentration of which seem to be coming from here at Northwestern, extend mechanism design to the problem of expert testing: how do I know when an agent, who knows more about a topic than me, is accurately reporting the results of his experiments?

Klein considers the testing problem in the form of a three-armed bandit, with one safe arm (i.e., one arm where the agent shirks and does nothing), one “cheating” arm that achieves breakthrough with probability p, and one “research” arm that achieves breakthroughs with probability q should the principal’s hypothesis be true, and with probability 0 otherwise. Imagine that I own a pharma factory, and I think a certain chemical process will produce a useful new drug, but I don’t know how to do the experiments. If the scientist cheats and pours chemicals together in some way which is infeasible when we go to mass production, then he will surely create the compound I want with probability p. If he uses my hypothesized technique, and my hypothesis is true, he will create the drug with probability q, but if my hypothesis is not true, he will never create the drug. The agent has until time T to experiment as he will, with the principal observing – at any given time t – any success, not knowing whether it was achieved by cheating or research. At time T, the principal receives a payoff if the research arm provided a success, so the principal wants the agent to experiment with the research arm exclusively until a success is achieved, and then to play the safe arm, for which the agent will be paid zero. There is a clear incentive problem, though. If the agent is off equilibrium, he may have a belief that the research arm is very, very unlikely to succeed: indeed, because of that belief, he may feel a success is more probable with the cheating arm than with the research arm. For agents with such beliefs, if the principal pays upon seeing a success, the agent will be paid more than zero, but the principal will make zero. Is there a better incentive system?

There is. Note that, since the problem is continuous, if the agent plays “research”, his belief about the probability the hypothesis is true will converge to the true probability. Assume that, given perfect monitoring (i.e., no possibility of cheating, but still a probability of shirking, or playing the safe arm) and such a true probability, it is worthwhile for the principal to pay the agent to experiment. Because p>q, there must be a length of time T, and an integer m, such that if the agent is only paid after m successes, rather than after 1 success, the agent will only play the research arm since it will be more likely to reach m than the cheating arm given that the agent has true beliefs about the truth of the hypothesis. Klein shows that optimal payments only pay agents after m>1 successes. Once we ensure that agents do not cheat, we can ensure they experiment optimally (in the sense of a traditional bandit problem) by letting the agent payment be conditional on the time of the second success, which will only come from research and not cheating, with such payment varying to exactly take into account the value of information from experimentation. Clearly the final payment must be higher than the first-best payment where there are no agency problems (and indeed, it must be higher than the second-best payment where only shirking is a concern).

It is not stated in the paper, but it seems to me that if p is not greater than q, it does not appear that any incentive compatible scheme will elicit optimal experimentation by the agent. Perhaps we should have a General Theorem of Expert Testing: that experts can rarely be incentivized to tell the truth. (Klein notes that this draft, from April, is incomplete and preliminary: I do not have a newer version)

%d bloggers like this: