“Learning While Voting: Determinants of Collective Experimentation,” B. Strulovici (2010)

Politics has a well-known status quo bias. Surely one could explain this as the result of some psychological factors. But can a preference for the status quo result from the voting mechanism itself, even if all voters are expected utility maximizers? Bruno Strulovici develops some results from optimal control (“If the proofs seem hard, it’s because they actually are rocket science”) to show that the answer is yes.

In a standard multi-armed bandit problem, agents choose whether to pull a “safe” arm with a known payoff or a “risky” arm with an unknown payoff. Pulling the risky arm has an option value, because if it turns out that the risky arm payoff is high, then I will get that high payoff from now until the game ends. If the risky arm payoff is low, then at some point I will just switch to the safe arm, and will play that arm forever since I never learn anything more about either arm. In the presence of externalities – we both pull arms and I can see the result of your pulls – there is too little experimentation since everyone wants to free ride. Results of this type are well known by now.

Strulovici’s model is a little different. We all (in continuous time) are voting on whether society plays a safe arm or a risky arm, but the individual payoff from the risky arm is different for each individual. That is, some people are “winners” from a policy, and some are “losers”. Everyone begins unsure about which type they are, and with some Poisson process, winners receive information that they are indeed winners. Anyone who has not received news that they are a winner considers themselves a loser with more and more probability over time, simply by Bayes’ Law. Let x be the cutoff for what percentage of voters must approve the risky arm if we are to continue pulling it; in majority voting, x is just fifty percent. Note that there is no learning externality here: we all have independent types, and anyway, if society decides to play risky, then everyone in society is forced to play the risky arm.

In such a world, there is too little experimentation, vis-a-vis the utilitarian social optimum. That is, there is a bias for status quo policies. Why? On the one hand, experimenting and finding out you are a winner is less valuable than in the single-agent bandit problem, because even if you pay experimentation costs and learn you are a winner, society may have enough non-winners that the majority votes for the safe arm at some later date. The discounted profits of learning you are a winner, then, are lower than in the single-agent problem. On the other hand, if you are more and more sure you are a loser, you will want to end the risky experiment quickly, because there is a chance that a sufficiently high number of other agents will later find out they are winners and therefore trap you, by majority vote, into playing the risky arm forever.

What if we, for instance, lowered or raised the cutoff where the risky policy is continued? Instead of majority vote, we could require unanimity in order to keep implementing the risky arm. This will satisfy the potential losers: they never need fear that their vote to continue experimentation will trap them in the policy they don’t like. But it only makes things worse for potential winners: finding out I am a winner if even less valuable than in the majority rule case since a single agent can end the risky policy which benefits me, and which I paid to learn benefits me. It turns out that for any fixed cutoff, there is suboptimal experimentation under some parameter values. This is worrying: the majority rule, for example, violates what Strulovici calls nonadversity: I cannot be made worse off by experimenting and finding out that I am a winner. But consider three voters, voting with majority rule on which arm to pull. If one agent receives notice that she is a winner, the other two know that if one of them also receives a winner signal, the risky arm will be pulled forever. In order to avoid being trapped in the policy they won’t like, these two remaining voters will stop the risky policy sooner. If learning is slow and the harm of being trapped in the risky policy when you are a loser is high, then the value of experimentation is negative even if, given your current Bayesian beliefs, the immediate payoff from experimentation is positive.

However, there is a way to save voting under experimentation: make the cutoff an increasing function of time. If you require more and more people to vote for the risky arm as time increases, in a particular parameter-dependent way, the socially optimal level of experimentation is achieved. The intuition here is that the number of sure winners – those that have received notification from the Poisson process that they are in fact winners – is an increasing function of time.

All of the results stated are robust to correlating types, to making revelation of who is a winner and who is not private information (in the particular sense where the number of voters for the risky policy at any time is public knowledge).

http://faculty.wcas.northwestern.edu/~bhs675/VotEx.pdf (Final WP – final version published in May 2010 Econometrica)


One thought on ““Learning While Voting: Determinants of Collective Experimentation,” B. Strulovici (2010)

  1. […] bem, eis que descrubro, via o sempre excelente A fine Theorem, de um tal de multi-armed bandit problem. É o típico problema de processos estocásticos que […]

Comments are closed.

%d bloggers like this: