The multiarmed bandit is a true workhorse of modern mathematical economics. In a bandit problem, there are multiple arms you can pull, as in some types of slot machines. You have beliefs about the distribution of payoffs given when you pull a given arm. For instance, there may be a safe arm which yields an expectation of one coin every time you pull it, and a risky arm which yields an expectation of 2 coins with prior probability 1/3, and 0 coins with prior 2/3. Returns are generally discounted. There is often a “value of experimentation” where agents will pull an arm with a lower current expected value than another arm because, for instance, learning that the second arm above is the type with expected value 2 will increase my payoff from now until infinity, while I only pay the cost of experimenting now; in many single-person bandit problems, the optimal arm to pull can be solved simply using a formula called a Gittins index derived by J.C. Gittins in the 1970s.

As far as I know, the first explicitly economic bandit paper is Rothschild’s 1974 JET “A two-armed bandit theory of market pricing.” Rothschild tries to explain why prices may be disperse for the same product over time. His explanation is simple: consumer demand is unknown, and I try to learn demand by experimenting with prices. This produces a very particular form of price dispersion. Since Rothschild, a huge amount of economics work on bandit problems involves externalities: experimenting with the risky arm is socially valuable, but I bear all the cost privately, and don’t get all the benefit. This has been used, in many forms, extensively in the R&D literature by a number of economists you may know, like Bergemann, Besanko and Hopenhayn. Keller and Rady, along with Cripps, have a famous 2005 Econometrica involving exponential bandits (i.e., a safe arm and an arm that is either a total failure or a success, with the success learned while pulling that arm according to an exponential time distribution).

This 2010 paper, in Theoretical Economics, expands the R&D model to Poisson bandits. There are two arms being pulled by N firms in continuous time. On is a safe arm which pays a flow rate of s, and one is an arm which either gives an expectation of s’>s, or s”<s. The risky arm gives payoffs in lumps, so the only difference between the risky arm of type s' and the one of type s'' is that the Poisson arrival rate is slower for s''. This means that a single "success" on the risky arm does not tell me conclusively whether the arm is the good type or the bad type.

For the usual free-riding reasons diagrammed above, experimentation will be suboptimal. But there is another interesting effect here. Let p1* be the belief about the risky arm being of type s’ such that if there were only one firm, he would pull the risky arm if his belief were above p1* and pull the safe arm if it was below p1*. Keller and Rady prove that in any Markov perfect equilibrium with N firms, I am willing to spend some of my time pulling the risky arm even when my belief is below p1*. Why? They call this the “encouragement effect.” If there is just me, then the only benefit of pulling the risky arm when I am near p1* is that I might learn the risky arm is better than I previously thought by getting a Poisson success. But with N firms, getting a Poisson success both gives me this information and, by improving everyone’s belief about the quality of the risky arm, encourages others to also experiment with the risky arm in the future. Since payoffs exhibit strategic complementarity, I will benefit from their future experimentation.

There is one other neat trick, which involves some technical tricks as well. We usually solve just for the symmetric MPE, for simplicity. In the symmetric MPE, which is unique, we all mix between the safe and risky arm as long as we above some cutoff P. But as we get closer and closer to P, we are spending arbitrarily close to zero effort on the risky arm, so our posterior, given bad news, decreases only very slowly and we never reach P in finite time. This suggests that an asymmetric MPE may do better, even in a Pareto sense. Consider the following: near P, have one person experiment with full effort if the current belief is in some set B1, and have the other person experiment if the current belief is in B2. If it is my turn to experiment, I have multiple reasons to exert full effort: most importantly, because B1 and B2 are set up so that if I change the belief enough through my experimentation, the other person will take over the cost of experimenting. Characterizing the full set of MPE is difficult, of course.

https://tspace.library.utoronto.ca/bitstream/1807/27188/1/20100275.pdf (Final version in TE issue 5, 2010. Theoretical Economics is an amazing journal. It is completely open access, allows reuse and republication under a generous CC license, doesn’t charge any publication fee, doesn’t charge any submission fee as of now, and has among the fastest turnaround time in the business. Is it any surprise that TE has, by many accounts, passed Elsevier’s JET as the top field journal in micro theory?)