Category Archives: Experimentation

What Randomization Can and Cannot Do: The 2019 Nobel Prize

It is Nobel Prize season once again, a grand opportunity to dive into some of our field’s most influential papers and to consider their legacy. This year’s prize was inevitable, an award to Abhijit Banerjee, Esther Duflo, and Michael Kremer for popularizing the hugely influential experimental approach to development. It is only fitting that my writeup this year has been delayed due to the anti-government road blockades here in Ecuador which delayed my return to the internet-enabled world – developing countries face many barriers to reaching prosperity, and rarely have I been so personally aware of the effects of place on productivity as I was this week!

The reason for the prize is straightforward: an entire branch of economics, development, looks absolutely different from what it looked like thirty years ago. Development used to be essentially a branch of economic growth. Researchers studied topics like the productivity of large versus small farms, the nature of “marketing” (or the nature of markets and how economically connected different regions in a country are), or the necessity of exports versus industrialization. Studies were almost wholly observational, deep data collections with throwaway references to old-school growth theory. Policy was largely driven by the subjective impression of donors or program managers about projects that “worked”. To be a bit too honest – it was a dull field, and hence a backwater. And worse than dull, it was a field where scientific progress was seriously lacking.

Banerjee has a lovely description of the state of affairs back in the 1990s. Lots of probably-good ideas were funded, informed deeply by history, but with very little convincing evidence that highly-funded projects were achieving their stated aims. Of the World Bank Sourcebook of recommended projects, everything from scholarships to girls to vouchers for poor children to citizens’ report cards were recommended. Did these actually work? Banerjee quotes a program providing computer terminals in rural areas of Madhya Pradesh which explains that due to a lack of electricity and poor connectivity, “only a few of the kiosks have proved to be commercially viable”, then notes, without irony, that “following the success of the initiative,” similar programs would be funded. Clearly this state of affairs is unsatisfactory. Surely we should be able to evaluate the projects we’ve funded already? And better, surely we should structure those evaluations to inform future projects? Banerjee again: “the most useful thing a development economist can do in this environment is stand up for hard evidence.”

And where do we get hard evidence? If by this we mean internal validity – that is, whether the effect we claim to have seen is actually caused by a particular policy in a particular setting – applied econometricians of the “credibility revolution” in labor in the 1980s and 1990s provided an answer. Either take advantage of natural variation with useful statistical properties, like the famed regression discontinuity, or else randomize treatment like a medical study. The idea here is that the assumptions needed to interpret a “treatment effect” are often less demanding than those needed to interpret the estimated parameter of an economic model, hence more likely to be “real”. The problem in development is that most of what we care about cannot be randomized. How are we, for instance, to randomize whether a country adopts import substitution industrialization or not, or randomize farm size under land reform – and at a scale large enough for statistical inference?

What Banerjee, Duflo, and Kremer noticed is that much of what development agencies do in practice has nothing to do with those large-scale interventions. The day-to-day work of development is making sure teachers show up to work, vaccines are distributed and taken up by children, corruption does not deter the creation of new businesses, and so on. By breaking down the work of development on the macro scale to evaluations of development at micro scale, we can at least say something credible about what works in these bite-size pieces. No longer should the World Bank Sourcebook give a list of recommended programs, based on handwaving. Rather, if we are to spend 100 million dollars sending computers to schools in a developing country, we should at least be able to say “when we spent 5 million on a pilot, we designed the pilot so as to learn that computers in that particular setting led to a 12% decrease in dropout rate, and hence a 34%-62% return on investment according to standard estimates of the link between human capital and productivity.” How to run those experiments? How should we set them up? Who can we get to pay for them? How do we deal with “piloting bias”, where the initial NGO we pilot with is more capable than the government we expect to act on evidence learned in the first study? How do we deal with spillovers from randomized experiments, econometrically? Banerjee, Duflo, and Kremer not only ran some of the famous early experiments, they also established the premier academic institution for running these experiments – J-PAL at MIT – and further wrote some of the best known practical guides to experiments in development.

Many of the experiments written by the three winners are now canonical. Let’s start with Michael Kremer’s paper on deworming, with Ted Miguel, in Econometrica. Everyone agreed that deworming kids infected with things like hookworm has large health benefits for the children directly treated. But since worms are spread by outdoor bathroom use and other poor hygiene practices, one infected kid can also harm nearby kids by spreading the disease. Kremer and Miguel suspected that one reason school attendance is so poor in some developing countries is because of the disease burden, and hence that reducing infections among one kid benefits the entire community, and neighboring ones as well, by reducing overall infection. By randomizing mass school-based deworming, and measuring school attendance both at the focal and at neighboring schools, they found that villages as far as 4km away saw higher school attendance (4km rather 6km in the original paper due to a correction of an error in the analysis). Note the good economics here: a change from individual to school-based deworming helps identify spillovers across schools, and some care goes into handling the spatial econometric issue whereby density of nearby schools equals density of nearby population equals differential baseline infection rates at these schools. An extra year of school attendance could therefore be “bought” by a donor for $3.50, much cheaper than other interventions such as textbook programs or additional teachers. Organizations like GiveWell still rate deworming among the most cost-effective educational interventions in the world: in terms of short-run impact, surely this is one of the single most important pieces of applied economics of the 21st century.

The laureates have also used experimental design to learn that some previously highly-regarded programs are not as important to development as you might suspect. Banerjee, Duflo, Rachel Glennerster and Cynthia Kinnan studied microfinance rollout in Hyderabad, randomizing the neighborhoods which received access to a major first-gen microlender. These programs are generally woman-focused, joint-responsibility, high-interest loans a la the Nobel Peace Prize winning Grameen Bank. 2800 households across the city were initially surveyed about their family characteristics, lending behavior, consumption, and entrepreneurship, then followups were performed a year after the microfinance rollout, and then three years later. While women in treated areas were 8.8 percentage points more likely to take a microloan, and existing entrepreneurs do in fact increase spending on their business, there is no long-run impact on education, health, or the likelihood women make important family decisions, nor does it make businesses more profitable. That is, credit constraints, at least in poor neighborhoods in Hyderabad, do not appear the main barrier to development; this is perhaps not very surprising, since higher-productivity firms in India in the 2000s already have access to reasonably well-developed credit markets, and surely they are the main driver of national income (followup work does see some benefits for very high talent, very poor entrepreneurs, but the long run key result remains).

Let’s realize how wild this paper is: a literal Nobel Peace Prize was awarded for a form of lending that had not really been rigorously analyzed. This form of lending effectively did not exist in rich countries at the time they developed, so it is not a necessary condition for growth. And yet enormous amounts of money went into a somewhat-odd financial structure because donors were nonetheless convinced, on the basis of very flimsy evidence, that microlending was critical.

By replacing conjecture with evidence, and showing randomized trials can actually be run in many important development settings, the laureates’ reformation of economic development has been unquestionably positive. Or has it? Before returning to the (truly!) positive aspects of Banerjee, Duflo and Kremer’s research program, we must take a short negative turn. Because though Banerjee, Duflo, and Kremer are unquestionably the leaders of the field of development, and the most influential scholars for young economists working in that field, there is much more controversy about RCTs than you might suspect if all you’ve seen are the press accolades of the method. Donors love RCTs, as they help select the right projects. Journalists love RCTs, as they are simple to explain (Wired, in a typical example of this hyperbole: “But in the realm of human behavior, just as in the realm of medicine, there’s no better way to gain insight than to compare the effect of an intervention to the effect of doing nothing at all. That is: You need a randomized controlled trial.”) The “randomista” referees love RCTs – a tribe is a tribe, after all. But RCTs are not necessarily better for those who hope to understand economic development! The critiques are three-fold.

First, that while the method of random trials is great for impact or program evaluation, it is not great for understanding how similar but not exact replications will perform in different settings. That is, random trials have no specific claim to external validity, and indeed are worse than other methods on this count. Second, it is argued that development is much more than program evaluation, and that the reason real countries grow rich has essentially nothing to do with the kinds of policies studied in the papers we discussed above: the “economist as plumber” famously popularized by Duflo, who rigorously diagnoses small problems and proposes solutions, is a fine job for a World Bank staffer, but a crazy use of the intelligence of our otherwise-leading scholars in development. Third, even if we only care about internal validity, and only care about the internal validity of some effect that can in principle be studied experimentally, the optimal experimental design is generally not an RCT.

The external validity problem is often seen to be one related to scale: well-run partner NGOs are just better at implementing any given policy that, say, a government, so the benefit of scaled-up interventions may be much lower than that identified by an experiment. We call this “piloting bias”, but it isn’t really the core problem. The core problem is that the mapping from one environment or one time to the next depends on many factors, and by definition the experiment cannot replicate those factors. A labor market intervention in a high-unemployment country cannot inform in an internally valid way about a low-unemployment country, or a country with different outside options for urban laborers, or a country with an alternative social safety net or cultural traditions about income sharing within families. Worse, the mapping from a partial equilibrium to a general equilibrium world is not at all obvious, and experiments do not inform as to the mapping. Giving cash transfers to some villagers may make them better off, but giving cash transfers to all villagers may cause land prices to rise, or cause more rent extraction by corrupt governments, or cause all sorts of other changes in relative prices.

You can see this issue in the Scientific Summary of this year’s Nobel. Literally, the introductory justification for RCTs is that, “[t]o give just a few examples, theory cannot tell us whether temporarily employing additional contract teachers with a possibility of re-employment is a more cost-effective way to raise the quality of education than reducing class sizes. Neither can it tell us whether microfinance programs effectively boost entrepreneurship among the poor. Nor does it reveal the extent to which subsidized health-care products will raise poor people’s investment in their own health.”

Theory cannot tell us the answers to these questions, but an internally valid randomized control trial can? Surely the wage of the contract teacher vis-a-vis more regular teachers and hence smaller class sizes matters? Surely it matters how well-trained these contract teachers are? Surely it matters what the incentives for investment in human capital by students in the given location is? To put this another way: run literally whatever experiment you want to run on this question in, say, rural Zambia in grade 4 in 2019. Then predict the cost-benefit ratio of having additional contract teachers versus more regular teachers in Bihar in high school in 2039. Who would think there is a link? Actually, let’s be more precise: who would think there is a link between what you learned in Zambia and what will happen in Bihar which is not primarily theoretical? Having done no RCT, I can tell you that if the contract teachers are much cheaper per unit of human capital, we should use more of them. I can tell you that if the students speak two different languages, there is a greater benefit in having a teacher assistant to translate. I can tell you that if the government or other principal has the ability to undo outside incentives with a side contract, hence are not committed to the mechanism, dynamic mechanisms will not perform as well as you expect. These types of statements are theoretical: good old-fashioned substitution effects due to relative prices, or a priori production function issues, or basic mechanism design.

Things are worse still. It is not simply that an internally valid estimate of a treatment effect often tells us nothing about how that effect generalizes, but that the important questions in development cannot be answered with RCTs. Everyone working in development has heard this critique. But just because a critique is oft-repeated does not mean it is wrong. As Lant Pritchett argues, national development is a social process involving markets, institutions, politics, and organizations. RCTs have focused on, in his reckoning, “topics that account for roughly zero of the observed variation in human development outcomes.” Again, this isn’t to say that RCTs cannot study anything. Improving the function of developing world schools, figuring out why malaria nets are not used, investigating how to reintegrate civil war fighters: these are not minor issues, and it’s good that folks like this year’s Nobelists and their followers provide solid evidence on these topics. The question is one of balance. Are we, as economists are famously wont to do, simply looking for keys underneath the spotlight when we focus our attention on questions which are amenable to a randomized study? Has the focus on internal validity diverted effort from topics that are much more fundamental to the wealth of nations?

But fine. Let us consider that our question of interest can be studied in a randomized fashion. And let us assume that we do not expect piloting bias or other external validity concerns to be first-order. We still have an issue: even on internal validity, randomized control trials are not perfect. They are certainly not a “gold standard”, and the econometricians who push back against this framing have good reason to do so. Two primary issues arise. First, to predict what will happen if I impose a policy, I am concerned that what I have learned in this past is biased (e.g., the people observed to use schooling subsidies are more diligent than those who would go to school if we made these subsidies universal). But I am also concerned about statistical inference: with small sample sizes, even an unbiased estimate will not predict very well. I recently talked with an organization doing recruitment who quasi-randomly recruited at a small number of colleges. On average, they attracted a handful of applicants in each college. They stopped recruiting at the colleges with two or fewer applicants after the first year. But of course random variation means the difference between two and four applicants is basically nil.

In this vein, randomized trials tend to have very small sample sizes compared to observational studies. When this is combined with high “leverage” of outlier observations when multiple treatment arms are evaluated, particularly for heterogeneous effects, randomized trials often predict poorly out of sample even when unbiased (see Alwyn Young in the QJE on this point). Observational studies allow larger sample sizes, and hence often predict better even when they are biased. The theoretical assumptions of a structural model permit parameters to be estimated even more tightly, as we use a priori theory to effectively restrict the nature of economic effects.

We have thus far assumed the randomized trial is unbiased, but that is often suspect as well. Even if I randomly assign treatment, I have not necessarily randomly assigned spillovers in a balanced way, nor have I restricted untreated agents from rebalancing their effort or resources. A PhD student of ours on the market this year, Carlos Inoue, examined the effect of random allocation of a new coronary intervention in Brazilian hospitals. Following the arrival of this technology, good doctors moved to hospitals with the “randomized” technology. The estimated effect is therefore nothing like what would have been found had all hospitals adopted the intervention. This issue can be stated simply: randomizing treatment does not in practice hold all relevant covariates constant, and if your response is just “control for the covariates you worry about”, then we are back to the old setting of observational studies where we need a priori arguments about what these covariates are if we are to talk about the effects of a policy.

The irony is that Banerjee, Duflo and Kremer are often quite careful in how they motivate their work with traditional microeconomic theory. They rarely make grandiose claims of external validity when nothing of the sort can be shown by their experiment. Kremer is an ace theorist in his own right, Banerjee often relies on complex decision and game theory particularly in his early work, and no one can read the care with which Duflo handles issues of theory and external validity and think she is merely punting. Most of the complaints about their “randomista” followers do not fully apply to the work of the laureates themselves.

And none of the critiques above should be taken to mean that experiments cannot be incredibly useful to development. Indeed, the proof of the pudding is in the tasting: some of the small-scale interventions by Banerjee, Duflo, and Kremer have been successfully scaled up! To analogize to a firm, consider a plant manager interested in improving productivity. She could read books on operations research and try to implement ideas, but it surely is also useful to play around with experiments within her plant. Perhaps she will learn that it’s not incentives but rather lack of information that is the biggest reason workers are, say, applying car door hinges incorrectly. She may then redo training, and find fewer errors in cars produced at the plant over the next year. This evidence – not only the treatment effect, but also the rationale – can then be brought to other plants at the same company. All totally reasonable. Indeed, would we not find it insane for a manager to try things out, and make minor changes on the margin, before implementing a huge change to incentives or training? And of course the same goes, or should go, when the World Bank or DFID or USAID spend tons of money trying to solve some development issue.

On that point, what would even a skeptic agree a development experiment can do? First, it is generally better than other methods at identifying internally valid treatment effects, though still subject to the caveats above.

Second, it can fine-tune interventions along margins where theory gives little guidance. For instance, do people not take AIDS drugs because they don’t believe they work, because they don’t have the money, or because they want to continue having sex and no one will sleep with them if they are seen picking up antiretrovirals? My colleague Laura Derksen suspected that people are often unaware that antiretrovirals prevent transmission, hence in locations with high rates of HIV, it may be safer to sleep with someone taking antiretrovirals than the population at large. She shows that informational interventions informing villagers about this property of antiretrovirals meaningfully increases takeup of medication. We learn from her study that it may be important in the case of AIDS prevention to correct this particular set of beliefs. Theory, of course, tells us little about how widespread these incorrect beliefs are, hence about the magnitude of this informational shift on drug takeup.

Third, experiments allow us to study policies that no one has yet implemented. Ignoring the problem of statistical identification in observational studies, there may be many policies we wish to implement which are wholly different in kind from those seen in the past. The negative income tax experiments of the 1970s are a classic example. Experiments give researchers more control. This additional control is of course balanced against the fact that we should expect super meaningful interventions to have already occurred, and we may have to perform experiments at relatively low scale due to cost. We should not be too small-minded here. There are now experimental development papers on topics thought to be outside the bounds of experiment. I’ve previously discussed on this site Kevin Donovan’s work randomizing the placement of roads and bridges connected remote villages to urban centers. What could be “less amenable” to randomization that the literal construction of a road and bridge network?

So where do we stand? It is unquestionable that a lot of development work in practice was based on the flimsiest of evidence. It is unquestionable that armies Banerjee, Duflo, and Kremer have sent into the world via J-PAL and similar institutions have brought much more rigor to understanding program evaluation. Some of these interventions are now literally improving the lives of millions of people with clear, well-identified, nonobvious policy. That is an incredible achievement! And there is something likeable about the desire of the ivory tower to get into the weeds of day-to-day policy. Michael Kremer on this point: “The modern movement for RCTs in development economics…is about innovation, as well as evaluation. It’s a dynamic process of learning about a context through painstaking on-the-ground work, trying out different approaches, collecting good data with good causal identification, finding out that results do not fit pre-conceived theoretical ideas, working on a better theoretical understanding that fits the facts on the ground, and developing new ideas and approaches based on theory and then testing the new approaches.” No objection here.

That said, we cannot ignore that there are serious people who seriously object to the J-PAL style of development. Deaton, who won the Nobel Prize only four years ago, writes the following, in line with our discussion above: “Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft”… [T]he analysis of projects needs to be refocused towards the investigation of potentially generalizable mechanisms that explain why and in what contexts projects can be expected to work.” Lant Pritchett argues that despite success persuading donors and policymakers, the evidence that RCTs lead to better policies at the governmental level, and hence better outcomes for people, is far from the case. The barrier to the adoption of better policy is bad incentives, not a lack of knowledge on how given policies will perform. I think these critiques are quite valid, and the randomization movement in development often way overstates what they have, and could have in principle, learned. But let’s give the last word to Chris Blattman on the skeptic’s case for randomized trials in development: “if a little populist evangelism will get more evidence-based thinking in the world, and tip us marginally further from Great Leaps Forward, I have one thing to say: Hallelujah.” Indeed. No one, randomista or not, longs to go back to the day of unjustified advice on development, particularly “Great Leap Forward” type programs without any real theoretical or empirical backing!

A few remaining bagatelles:

1) It is surprising how early this award was given. Though incredibly influential, the earliest published papers by any of the laureates mentioned in the Nobel scientific summary are from 2003 and 2004 (Miguel-Kremer on deworming, Duflo-Saez on retirement plans, Chattopadhyay and Duflo on female policymakers in India, Banerjee and Duflo on health in Rajathstan). This seems shockingly recent for a Nobel – I wonder if there are any other Nobel winners in economics who won entirely for work published so close to the prize announcement.

2) In my field, innovation, Kremer is most famous for his paper on patent buyouts (we discussed that paper on this site way back in 2010). How do we both incentivize new drug production but also get these drugs sold at marginal cost once invented? We think the drugmakers have better knowledge about how to produce and test a new drug than some bureaucrat, so we can’t finance drugs directly. If we give a patent, then high-value drugs return more to the inventor, but at massive deadweight loss. What we want to do is offer inventors some large fraction of the social return to their invention ex-post, in exchange for making production perfectly competitive. Kremer proposes patent auctions where the government pays a multiple of the winning bid with some probability, giving the drug to the public domain. The auction reveals the market value, and the multiple allows the government to account for consumer surplus and deadweight loss as well. There are many practical issues, but I have always found this an elegant, information-based attempt to solve the problem of innovation production, and it has been quite influential on those grounds.

3) Somewhat ironically, Kremer also has a great 1990s growth paper with RCT-skeptics Pritchett, Easterly and Summers. The point is simple: growth rates by country vacillate wildly decade to decade. Knowing the 2000s, you likely would not have predicted countries like Ethiopia and Myanmar as growth miracles of the 2010s. Yet things like education, political systems, and so on are quite constant within-country across any two decade period. This necessarily means that shocks of some sort, whether from international demand, the political system, nonlinear cumulative effects, and so on, must be first-order for growth. A great, straightforward argument, well-explained.

4) There is some irony that two of Duflo’s most famous papers are not experiments at all. Her most cited paper by far is a piece of econometric theory on standard errors in difference-in-difference models, written with Marianne Bertrand. Her next most cited paper is a lovely study of the quasi-random school expansion policy in Indonesia, used to estimate the return on school construction and on education more generally. Nary a randomized experiment in sight in either paper.

5) I could go on all day about Michael Kremer’s 1990s essays. In addition to Patent Buyouts, two more of them appear on my class syllabi. The O-Ring theory is an elegant model of complementary inputs and labor market sorting, where slightly better “secretaries” earn much higher wages. The “One Million B.C.” paper notes that growth must have been low for most of human history, and that it was limited because low human density limited the spread of nonrivalrous ideas. It is the classic Malthus plus endogenous growth paper, and always a hit among students.

6) Ok, one more for Kremer, since “Elephants” is my favorite title in economics. Theoretically, future scarcity increases prices. When people think elephants will go extinct, the price of ivory therefore rises, making extinction more likely as poaching incentives go up. What to do? Hold a government stockpile of ivory and commit to selling it if the stock of living elephants falls below a certain point. Elegant. And I can’t help but think: how would one study this particular general equilibrium effect experimentally? I both believe the result and suspect that randomized trials are not a good way to understand it!

The 2017 Nobel: Richard Thaler

A true surprise this morning: the behavioral economist Richard Thaler from the University of Chicago has won the Nobel Prize in economics. It is not a surprise because it is undeserving; rather, it is a surprise because only four years ago, Thaler’s natural co-laureate Bob Shiller won while Thaler was left the bridesmaid. But Thaler’s influence on the profession, and the world, is unquestionable. There are few developed governments who do not have a “nudge” unit of some sort trying to take advantage of behavioral nudges to push people a touch in one way or another, including here in Ontario via my colleagues at BEAR. I will admit, perhaps under the undue influence of too many dead economists, that I am skeptical of nudging and behavioral finance on both positive and normative grounds, so this review will be one of friendly challenge rather than hagiography. I trust that there will be no shortage of wonderful positive reflections on Thaler’s contribution to policy, particularly because he is the rare economist whose work is totally accessible to laymen and, more importantly, journalists.

Much of my skepticism is similar to how Fama thinks about behavioral finance: “I’ve always said they are very good at describing how individual behavior departs from rationality. That branch of it has been incredibly useful. It’s the leap from there to what it implies about market pricing where the claims are not so well-documented in terms of empirical evidence.” In other words, surely most people are not that informed and not that rational much of the time, but repeated experience, market selection, and other aggregative factors mean that this irrationality may not matter much for the economy at large. It is very easy to claim that since economists model “agents” as “rational”, we would, for example, “not expect a gift on the day of the year in which she happened to get married, or be born” and indeed “would be perplexed by the idea of gifts at all” (Thaler 2015). This type of economist caricature is both widespread and absurd, I’m afraid. In order to understand the value of Thaler’s work, we ought first look at situations where behavioral factors matter in real world, equilibrium decisions of consequence, then figure out how common those situations are, and why.

The canonical example of Thaler’s useful behavioral nudges is his “Save More Tomorrow” pension plan, with Benartzi. Many individuals in defined contribution plans save too little, both because they are not good at calculating how much they need to save and because they are biased toward present consumption. You can, of course, force people to save a la Singapore, but we dislike these plans because individuals vary in their need and desire for saving, and because we find the reliance on government coercion to save heavy-handed. Alternatively, you can default defined-contribution plans to involve some savings rate, but it turns out people do not vary their behavior from the default throughout their career, and hence save too little solely because they didn’t want too much removed from their first paycheck. Thaler and Benartzi have companies offer plans where you agree now to having your savings rate increased when you get raises – for instance, if your salary goes up 2%, you will have half of that set into a savings plan tomorrow, until you reach a savings rate that is sufficiently high. In this way, no one takes a nominal post-savings paycut. People can, of course, leave this plan whenever they want. In their field experiments, savings rates did in fact soar (with takeup varying hugely depending on how information about the plan was presented), and attrition in the future from the plan was low.

This policy is what Thaler and Sunstein call “libertarian paternalism”. It is paternalistic because, yes, we think that you may make bad decisions from your own perspective because you are not that bright, or because you are lazy, or because you have many things which require your attention. It is libertarian because there is no compulsion, in that anyone can opt out at their leisure. Results similar to Thaler and Benartzi’s have found by Ashraf et al in a field experiment in the Philippines, and by Karlan et al in three countries where just sending reminder messages which make savings goals more salient modestly increase savings.

So far, so good. We have three issues to unpack, however. First, when is this nudge acceptable on ethical grounds? Second, why does nudging generate such large effects here, and if the effects are large, why doesn’t the market simply provide them? Third, is the 401k savings case idiosyncratic or representative? The idea that the homo economicus, rational calculator, misses important features of human behavior, and would do with some insights from psychology, is not new, of course. Thaler’s prize is, at minimum, the fifth Nobel to go to someone pushing this general idea, since Herb Simon, Maurice Allais, Daniel Kahneman, and the aforementioned Bob Shiller have all already won. Copious empirical evidence, and indeed simple human observation, implies that people have behavioral biases, that they are not perfectly rational – as Thaler has noted, we see what looks like irrationality even in the composition of 100 million dollar baseball rosters. The more militant behavioralists insist that ignoring these psychological factors is unscientific! And yet, and yet: the vast majority of economists, all of whom are by now familiar with these illustrious laureates and their work, still use fairly standard expected utility maximizing agents in nearly all of our papers. Unpacking the three issues above will clarify how that could possibly be so.

Let’s discuss ethics first. Simply arguing that organizations “must” make a choice (as Thaler and Sunstein do) is insufficient; we would not say a firm that defaults consumers into an autorenewal for a product they rarely renew when making an active choice is acting “neutrally”. Nudges can be used for “good” or “evil”. Worse, whether a nudge is good or evil depends on the planner’s evaluation of the agent’s “inner rational self”, as Infante and Sugden, among others, have noted many times. That is, claiming paternalism is “only a nudge” does not excuse the paternalist from the usual moral philosophic critiques! Indeed, as Chetty and friends have argued, the more you believe behavioral biases exist and are “nudgeable”, the more careful you need to be as a policymaker about inadvertently reducing welfare. There is, I think, less controversy when we use nudges rather than coercion to reach some policy goal. For instance, if a policymaker wants to reduce energy usage, and is worried about distortionary taxation, nudges may (depending on how you think about social welfare with non-rational preferences!) be a better way to achieve the desired outcomes. But this goal is very different from the common justification that nudges somehow are pushing people toward policies they actually like in their heart of hearts. Carroll et al have a very nice theoretical paper trying to untangle exactly what “better” means for behavioral agents, and exactly when the imprecision of nudges or defaults given our imperfect knowledge of individual’s heterogeneous preferences makes attempts at libertarian paternalism worse than laissez faire.

What of the practical effects of nudges? How can they be so large, and in what contexts? Thaler has very convincingly shown that behavioral biases can affect real world behavior, and that understanding those biases means two policies which are identical from the perspective of a homo economicus model can have very different effects. But many economic situations involve players doing things repeatedly with feedback – where heuristics approximated by rationality evolve – or involve players who “perform poorly” being selected out of the game. For example, I can think of many simple nudges to get you or me to play better basketball. But when it comes to Michael Jordan, the first order effects are surely how well he takes cares of his health, the teammates he has around him, and so on. I can think of many heuristics useful for understanding how simple physics will operate, but I don’t think I can find many that would improve Einstein’s understanding of how the world works. The 401k situation is unusual because it is a decision with limited short-run feedback, taken by unsophisticated agents who will learn little even with experience. The natural alternative, of course, is to have agents outsource the difficult parts of the decision, to investment managers or the like. And these managers will make money by improving people’s earnings. No surprise that robo-advisors, index funds, and personal banking have all become more important as defined contribution plans have become more common! If we worry about behavioral biases, we ought worry especially about market imperfections that prevent the existence of designated agents who handle the difficult decisions for us.

The fact that agents can exist is one reason that irrationality in the lab may not translate into irrationality in the market. But even without agents, we might reasonably be suspect of some claims of widespread irrationality. Consider Thaler’s famous endowment effect: how much you are willing to pay for, say, a coffee mug or a pen is much less than how much you would accept to have the coffee mug taken away from you. Indeed, it is not unusual in a study to find a ratio of three times or greater between the willingness to pay and willingness to accept amount. But, of course, if these were “preferences”, you could be money pumped (see Yaari, applying a theorem of de Finetti, on the mathematics of the pump). Say you value the mug at ten bucks when you own it and five bucks when you don’t. Do we really think I can regularly get you to pay twice as much by loaning you the mug for free for a month? Do we see car companies letting you take a month-long test drive of a $20,000 car then letting you keep the car only if you pay $40,000, with some consumers accepting? Surely not. Now the reason why is partly what Laibson and Yariv argue, that money pumps do not exist in competitive economies since market pressure will compete away rents: someone else will offer you the car at $20,000 and you will just buy from them. But even if the car company is a monopolist, surely we find the magnitude of the money pump implied here to be on face ridiculous.

Even worse are the dictator games introduced in Thaler’s 1986 fairness paper. Students were asked, upon being given $20, whether they wanted to give an anonymous student half of their endowment or 10%. Many of the students gave half! This experiment has been repeated many, many times, with similar effects. Does this mean economists are naive to neglect the social preferences of humans? Of course not! People are endowed with money and gifts all the time. They essentially never give any of it to random strangers – I feel confident assuming you, the reader, have never been handed some bills on the sidewalk by an officeworker who just got a big bonus! Worse, the context of the experiment matters a ton (see John List on this point). Indeed, despite hundreds of lab experiments on dictator games, I feel far more confident predicting real world behavior following windfalls if we use a parsimonious homo economicus model than if we use the results of dictator games. Does this mean the games are useless? Of course not – studying what factors affect other-regarding preferences is interesting, and important. But how odd to have a branch of our field filled with people who see armchair theorizing of homo economicus as “unscientific”, yet take lab experiments so literally even when they are so clearly contrary to data?

To take one final example, consider Thaler’s famous model of “mental accounting”. In many experiments, he shows people have “budgets” set aside for various tasks. I have my “gas budget” and adjust my driving when gas prices change. I only sell stocks when I am up overall on that stock since I want my “mental account” of that particular transaction to be positive. But how important is this in the aggregate? Take the Engel curve. Budget shares devoted to food fall with income. This is widely established historically and in the cross section. Where is the mental account? Farber (2008 AER) even challenges the canonical account of taxi drivers working just enough hours to make their targeted income. As in the dictator game and the endowment effect, there is a gap between what is real, psychologically, and what is consequential enough to be first-order in our economic understanding of the world.

Let’s sum up. Thaler’s work is brilliant – it is a rare case of an economist taking psychology seriously and actually coming up with policy-relevant consequences like the 401k policy. But Thaler’s work is also dangerous to young economists who see biases everywhere. Experts in a field, and markets with agents and mechanisms and all the other tricks they develop, are very very good at ferreting out irrationality, and economists core skill lies in not missing those tricks.

Some remaining bagatelles: 1) Thaler and his PhD advisor, Sherwin Rosen, have one of the first papers on measuring the “statistical” value of a life, a technique now widely employed in health economics and policy. 2) Beyond his academic work, Thaler has won a modicum of fame as a popular writer (Nudge, written with Cass Sunstein, is canonical here) and for his brief turn as an actor alongside Selena Gomez in “The Big Short”. 3) Dick has a large literature on “fairness” in pricing, a topic which goes back to Thomas Aquinas, if not earlier. Many of the experiments Thaler performs, like the thought experiments of Aquinas, come down to the fact that many perceive market power to be unfair. Sure, I agree, but I’m not sure there’s much more that can be learned than this uncontroversial fact. 4) Law and econ has been massively influenced by Thaler. As a simple example, if endowment effects are real, then the assignment of property rights matters even when there are no transaction costs. Jolls et al 1998 go into more depth on this issue. 5) Thaler’s precise results in so-called behavioral finance are beyond my area of expertise, so I defer to John Cochrane’s comments following the 2013 Nobel. Eugene Fama is, I think, correct when he suggests that market efficiency generated by rational traders with risk aversion is the best model we have of financial behavior, where best is measured by “is this model useful for explaining the world.” The number of behavioral anomalies at the level of the market which persist and are relevant in the aggregate do not strike me as large, while the number of investors and policymakers who make dreadful decisions because they believe markets are driven by behavioral sentiments is large indeed!

Reinhard Selten and the making of modern game theory

Reinhard Selten, it is no exaggeration, is a founding father of two massive branches of modern economics: experiments and industrial organization. He passed away last week after a long and idiosyncratic life. Game theory as developed by the three co-Nobel laureates Selten, Nash, and Harsanyi is so embedded in economic reasoning today that, to a great extent, it has replaced price theory as the core organizing principle of our field. That this would happen was not always so clear, however.

Take a look at some canonical papers before 1980. Arrow’s Possibility Theorem simply assumed true preferences can be elicited; not until Gibbard and Satterthwaite do we answer the question of whether there is even a social choice rule that can elicit those preferences truthfully! Rothschild and Stiglitz’s celebrated 1976 essay on imperfect information in insurance markets defines equilibria in terms of a individual rationality, best responses in the Cournot sense, and free entry. How odd this seems today – surely the natural equilibrium in an insurance market depends on beliefs about the knowledge held by others, and beliefs about those beliefs! Analyses of bargaining before Rubinstein’s 1982 breakthrough nearly always rely on axioms of psychology rather than strategic reasoning. Discussions of predatory pricing until the 1970s, at the very earliest, relied on arguments that we now find unacceptably loose in their treatment of beliefs.

What happened? Why didn’t modern game-theoretic treatment of strategic situations – principally those involve more than one agent but less than an infinite number, although even situations of perfect competition now often are motivated game theoretically – arrive soon after the proofs of von Neumann, Morganstern, and Nash? Why wasn’t the Nash program, of finding justification in self-interested noncooperative reasoning for cooperative or axiom-driven behavior, immediately taken up? The problem was that the core concept of the Nash equilibrium simply permits too great a multiplicity of outcomes, some of which feel natural and others of which are less so. As such, a long search, driven essentially by a small community of mathematicians and economists, attempted to find the “right” refinements of Nash. And a small community it was: I recall Drew Fudenberg telling a story about a harrowing bus ride at an early game theory conference, where a fellow rider mentioned offhand that should they crash, the vast majority of game theorists in the world would be wiped out in one go!

Selten’s most renowned contribution came in the idea of perfection. The concept of subgame perfection was first proposed in a German-language journal in 1965 (making it one of the rare modern economic classics inaccessible to English speakers in the original, alongside Maurice Allais’ 1953 French-language paper in Econometrica which introduces the Allais paradox). Selten’s background up to 1965 is quite unusual. A young man during World War II, raised Protestant but with one Jewish parent, Selten fled Germany to work on farms, and only finished high school at 20 and college at 26. His two interests were mathematics, for which he worked on the then-unusual extensive form game for his doctoral degree, and experimentation, inspired by the small team of young professors at Frankfurt trying to pin down behavior in oligopoly through small lab studies.

In the 1965 paper, on demand inertia (paper is gated), Selten wrote a small game theoretic model to accompany the experiment, but realized there were many equilibria. The term “subgame perfect” was not introduced until 1974, also by Selten, but the idea itself is clear in the ’65 paper. He proposed that attention should focus on equilibria where, after every action, each player continues to act rationally from that point forward; that is, he proposed that in every “subgame”, or every game that could conceivably occur after some actions have been taken, equilibrium actions must remain an equilibrium. Consider predatory pricing: a firm considers lowering price below cost today to deter entry. It is a Nash equilibrium for entrants to believe the price would continue to stay low should they enter, and hence to not enter. But it is not subgame perfect: the entrant should reason that after entering, it is not worthwhile for the incumbent to continue to lose money once the entry has already occurred.

Complicated strings of deductions which rule out some actions based on faraway subgames can seem paradoxical, of course, and did even to Selten. In his famous Chain Store paradox, he considers a firm with stores in many locations choosing whether to price aggressively to deter entry, with one potential entrant in each town choosing one at a time whether to enter. Entrants prefer to enter if pricing is not aggressive, but prefer to remain out otherwise; incumbents prefer to price nonaggressively either if entry occurs or not. Reasoning backward, in the final town we have the simple one-shot predatory pricing case analyzed above, where we saw that entry is the only subgame perfect equilibria. Therefore, the entrant in the second-to-last town knows that the incumbent will not fight entry aggressively in the final town, hence there is no benefit to doing so in the second-to-last town, hence entry occurs again. Reasoning similarly, entry occurs everywhere. But if the incumbent could commit in advance to pricing aggressively in, say, the first 10 towns, it would deter entry in those towns and hence its profits would improve. Such commitment may not possible, but what if the incumbent’s reasoning ability is limited, and it doesn’t completely understand why aggressive pricing in early stages won’t deter the entrant in the 16th town? And what if entrants reason that the incumbent’s reasoning ability is not perfectly rational? Then aggressive pricing to deter entry can occur.

That behavior may not be perfectly rational but rather bounded had been an idea of Selten’s since he read Herbert Simon as a young professor, but in his Nobel Prize biography, he argues that progress on a suitable general theory of bounded rationality has been hard to come by. The closest Selten comes to formalizing the idea is in his paper on trembling hand perfection in 1974, inspired by conversations with John Harsanyi. The problem with subgame perfection had been noted: if an opponent takes an action off the equilibrium path, it is “irrational”, so why should rationality of the opponent be assumed in the subgame that follows? Harsanyi assumes that tiny mistakes can happen, putting even rational players into subgames. Taking the limit as mistakes become infinitesimally rare produces the idea of trembling-hand perfection. The idea of trembles implicitly introduces the idea that players have beliefs at various information sets about what has happened in the game. Kreps and Wilson’s sequential equilibrium recasts trembles as beliefs under uncertainty, and showed that a slight modification of the trembling hand leads to an easier decision-theoretic interpretation of trembles, an easier computation of equilibria, and an outcome that is nearly identical to Selten’s original idea. Sequential equilibria, of course, goes on to become to workhorse solution concept in dynamic economics, a concept which underscores essentially all of modern industrial organization.

That Harsanyi, inventor of the Bayesian game, is credited by Selten for inspiring the trembling hand paper is no surprise. The two had met at a conference in Jerusalem in the mid-1960s, and they’d worked together both on applied projects for the US military, and on pure theory research while Selten visiting Berkeley. A classic 1972 paper of theirs on Nash bargaining with incomplete information (article is gated) begins the field of cooperative games with incomplete information. And this was no minor field: Roger Myerson, in his paper introducing mechanism design under incomplete information – the famous Bayesian revelation principle paper – shows that there exists a unique Selten-Harsanyi bargaining solution under incomplete information which is incentive compatible.

Myerson’s example is amazing. Consider building a bridge which costs $100. Two people will use the bridge. One values the bridge at $90. The other values the bridge at $90 with probability .9, and $30 with probability p=.1, where that valuation is the private knowledge of the second person. Note that in either case, the bridge is worth building. But who should pay? If you propose a 50/50 split, the bridge will simply not be built 10% of the time. If you propose an 80/20 split, where even in their worst case situation each person gets a surplus value of ten dollars, the outcome is unfair to player one 90% of the time (where “unfair” will mean, violates certain principles of fairness that Nash, and later Selten and Harsanyi, set out axiomatically). What of the 53/47 split that gives each party, on average, the same split? Again, this is not “interim incentive compatible”, in that player two will refuse to pay in the case he is the type that values the bridge only at $30. Myerson shows mathematically that both players will agree once they know their private valuations to the following deal, and that the deal satisfies the Selten-Nash fairness axioms: when player 2 claims to value at $90, the payment split is 49.5/50.5 and the bridge is always built, but when player 2 claims to value at $30, the entire cost is paid by player 1 but the bridge is built with only probability .439. Under this split, there are correct incentives for player 2 to always reveal his true willingness to pay. The mechanism means that there is a 5.61 percent chance the bridge isn’t built, but the split of surplus from the bridge nonetheless does better than any other split which satisfies all of Harsanyi and Selten’s fairness axioms.

Selten’s later work is, it appears to me, more scattered. His attempt with Harsanyi to formalize “the” equilibrium refinement, in a 1988 book, was a valiant but in the end misguided attempt. His papers on theoretical biology, inspired by his interest in long walks among the wildflowers, are rather tangential to his economics. And what of his experimental work? To understand Selten’s thinking, read this fascinating dialogue with himself that Selten gave as a Schwartz Lecture at Northwestern MEDS. In this dialogue, he imagines a debate between a Bayesian economist, experimentalist, and an evolutionary biologist. The economist argues that “theory without theorems” is doomed to fail, that Bayesianism is normatively “correct”, and the Bayesian reasoning can easily be extended to include costs of reasoning or reasoning mistakes. The experimentalist argues that ad hoc assumptions are better than incorrect ones: just as human anatomy is complex and cannot be reduced to a few axioms, neither can social behavior. The biologist argues that learning a la Nelson and Winter is descriptively accurate as far as how humans behave, whereas high level reasoning is not. The “chairman”, perhaps representing Selten himself, sums up the argument as saying that experiments which simply contradict Bayesianism are a waste of time, but that human social behavior surely depends on bounded rationality and hence empirical work ought be devoted to constructing a foundation for such a theory (shall we call this the “Selten program”?). And yet, this essay was from 1990, and we seem no closer to having such a theory, nor does it seem to me that behavioral research has fundamentally contradicted most of our core empirical understanding derived from theories with pure rationality. Selten’s program, it seems, remains not only incomplete, but perhaps not even first order; the same cannot be said of his theoretical constructs, as without perfection a great part of modern economics simply could not exist.

“Optimal Contracts for Experimentation,” M. Halac, N. Kartik & Q. Liu (2013)

Innovative activities have features not possessed by more standard modes of production. The eventual output, and its value, are subject to a lot of uncertainty. Effort can be difficult to monitor – it is often the case that the researcher knows more than management about what good science should look like. The inherent skill of the scientist is hard to observe. Output is generally only observed in discrete bunches.

These features make contracting for researchers inherently challenging. The classic reference here is Holmstrom’s 1989 JEBO, which just applies his great 1980s incentive contract papers to innovative activities. Take a risk-neutral firm. They should just work on the highest expected value project, right? Well, if workers are risk averse and supply unobserved effort, the optimal contract balances moral hazard (I would love to just pay you based on your output) and risk insurance (I would have to pay you to bear risk about the eventual output of the project). It turns out that the more uncertainty a project has, the more inefficient the information-constrained optimal contract becomes, so that even risk-neutral firms are biased toward relatively safe, lower expected value projects. Incentives within the firm matter in many other ways, as Holmstrom also points out: giving employee multiple tasks when effort is unobserved makes it harder to provide proper incentives because the opportunity cost of a given project goes up, firms with a good reputation in capital markets will be reluctant to pursue risky projects since the option value of variance in reputation is lower (a la Doug Diamond’s 1989 JPE), and so on. Nonetheless, the first order problem of providing incentives for a single researcher on a single project is hard enough!

Holmstrom’s model doesn’t have any adverse selection, however: both employer and employee know what expected output will result from a given amount of effort. Nor is Holmstrom’s problem dynamic. Marina Halac, Navin Kartik and Qingmin Liu have taken up the unenviable task of solving the dynamic researcher contracting problem under adverse selection and moral hazard. Let a researcher be either a high type or a low type. In every period, the researcher can work on a risky project at cost c, or shirk at no cost. The project is either feasible or not, with probability b. If the employee shirks, or the project is bad, there will be no invention this period. If the employee works, the project is feasible, and the employee is a high type, the project succeeds with probability L1, and if the employee is low type, with probability L2<L1. Note that as time goes on, if the employee works on the risk project, they continually update their beliefs about b. If enough time passes without an invention, belief about b becomes low enough that everyone (efficiently) stops working on the risky project. The firm's goal is to get employees to exert optimal effort for the optimal number of period given their type.

Here’s where things really get tricky. Who, in expectation and assuming efficient behavior, stops working on the risky project earlier conditional on not having finished the invention, the high type or the low type? On the one hand, for any belief about b, the high type is more likely to invent, hence since costs are identical for both types, the high type should expect to keep working longer. On the other hand, the high type learns more quickly whether the project is bad, and hence his belief about b declines more rapidly, so he ought expect to work for less time. That either case is possible makes solving for the optimal contract a real challenge, because I need to write the contracts for each type such that the low type does not ever prefer the high type payoffs and vice versa. To know whether these contracts are incentive compatible, I have to know what agents will do if they deviate to the “wrong” contract. The usual trick here is to use a single crossing result along the lines of “for any contract with properties P, action Y is more likely for higher types”. In the dynamic researcher problem, since efficient stopping times can vary nonmonotically with researcher type, the single crossing trick doesn’t look so useful.

The “simple” (where simple means a 30 page proof) case is when the higher types efficiently work longer in expectation. The information-constrained optimum involves inducing the high type to work efficiently, while providing the low type too little incentive to work for the efficient amount of time. Essentially, the high type is willing to work for less money per period if only you knew who he was. Asymmetric information means the high type can extract information rents. By reducing the incentive for the low type to work in later periods, the high type information rent is reduced, and hence the optimal mechanism trades off lower total surplus generated by the low type against lower information rents paid to the high type.

This constrained-optimal outcome can be implemented by paying scientists up front, and then letting them choose either a contract with progressively increasing penalties for lack of success each period, or a contract with a single large penalty if no success is achieved by the socially efficient high type stopping time. Also, “Penalty contracts” are nice because they remain optimal even if scientists can keep their results secret: since secrecy just means paying more penalties, everyone has an incentive to reveal their invention as soon as they create it. The proof is worth going through if you’re into dynamic mechanism design; essentially, the authors are using a clever set of relaxed problems where a form of single crossing will hold, then showing that mechanism is feasible even under the actual problem constraints.

Finally, note that if there is only moral hazard (scientist type is observable) or only adverse selection (effort is observable), the efficient outcome is easy. With moral hazard, just make the agent pay the expected surplus up front, and then provide a bonus to him each period equal to the firm’s profit from an invention occurring then; we usually say in this case that “the firm is sold to the employee”. With adverse selection, we can contract on optimal effort, using total surplus to screen types as in the correlated information mechanism design literature. Even though the “distortion only at the bottom” result looks familiar from static adverse selection, the rationale here is different.

Sept 2013 working paper (No RePEc IDEAS version). The article appears to be under R&R at ReStud.

“Dynamic Commercialization Strategies for Disruptive Technologies: Evidence from the Speech Recognition Industry,” M. Marx, J. Gans & D. Hsu (2014)

Disruption. You can’t read a book about the tech industry without Clayton Christensen’s Innovator’s Dilemma coming up. Jobs loved it. Bezos loved it. Economists – well, they were a bit more confused. Here’s the story at its most elemental: in many industries, radical technologies are introduced. They perform very poorly initially, and so are ignored by the incumbent. These technologies rapidly improve, however, and the previously ignored entrants go on to dominate the industry. The lesson many tech industry folks take from this is that you ought to “disrupt yourself”. If there is a technology that can harm your most profitable business, then you should be the one to develop it; take Amazon’s “Lab126” Kindle skunkworks as an example.

There are a couple problems with this strategy, however (well, many problems actually, but I’ll save the rest for Jill Lepore’s harsh but lucid takedown of the disruption concept which recently made waves in the New Yorker). First, it simply isn’t true that all innovative industries are swept by “gales of creative destruction” – consider automobiles or pharma or oil, where the major players are essentially all quite old. Gans, Hsu and Scott Stern pointed out in a RAND article many years ago that if the market for ideas worked well, you would expect entrants with good ideas to just sell to incumbents, since the total surplus would be higher (less duplication of sales assets and the like) and since rents captured by the incumbent would be higher (less product market competition). That is, there’s no particular reason that highly innovative industries require constant churn of industry leaders.

The second problem concerns disrupting oneself or waiting to see which technologies will last. Imagine it is costly to investigate potentially disruptive technologies for the incumbent. For instance, selling mp3s in 2002 would have cannibalized existing CD sales at a retailer with a large existing CD business. Early on, the potentially disruptive technology isn’t “that good”, hence it is not in and of itself that profitable. Eventually, some of these potentially disruptive technologies will reveal themselves to actually be great improvements on the status quo. If that is the case, then, why not just let the entrant make these improvements/drive down costs/learn about market demand, and then buy them once they reveal that the potentially disruptive product is actually great? Presumably the incumbent even by this time still retains its initial advantage in logistics, sales, brand, etc. By waiting and buying instead of disrupting yourself, you can still earn those high profits on the CD business in 2002 even if mp3s had turned out to be a flash in the pan.

This is roughly the intuition in a new paper by Matt Marx – you may know his work on non-compete agreements – Gans and Hsu. Matt has also collected a great dataset from industry journals on every firm that ever operated in automated speech recognition. Using this data, the authors show that a policy by entrants of initial competition followed by licensing or acquisition is particularly common when the entrants come in with a “disruptive technology”. You should see these strategies, where the entrant proves the value of their technology and the incumbent waits to acquire, in industries where ideas are not terribly appropriable (why buy if you can steal?) and entry is not terribly expensive (in an area like biotech, clinical trials and the like are too expensive for very small firms). I would add that you also need complementary assets to be relatively hard to replicate; if they aren’t, the incumbent may well wind up being acquired rather than the entrant should the new technology prove successful!

Final July 2014 working paper (RePEc IDEAS). The paper is forthcoming in Management Science.

“Strategic Experimentation with Poisson Bandits,” G. Keller & S. Rady (2010)

The multiarmed bandit is a true workhorse of modern mathematical economics. In a bandit problem, there are multiple arms you can pull, as in some types of slot machines. You have beliefs about the distribution of payoffs given when you pull a given arm. For instance, there may be a safe arm which yields an expectation of one coin every time you pull it, and a risky arm which yields an expectation of 2 coins with prior probability 1/3, and 0 coins with prior 2/3. Returns are generally discounted. There is often a “value of experimentation” where agents will pull an arm with a lower current expected value than another arm because, for instance, learning that the second arm above is the type with expected value 2 will increase my payoff from now until infinity, while I only pay the cost of experimenting now; in many single-person bandit problems, the optimal arm to pull can be solved simply using a formula called a Gittins index derived by J.C. Gittins in the 1970s.

As far as I know, the first explicitly economic bandit paper is Rothschild’s 1974 JET “A two-armed bandit theory of market pricing.” Rothschild tries to explain why prices may be disperse for the same product over time. His explanation is simple: consumer demand is unknown, and I try to learn demand by experimenting with prices. This produces a very particular form of price dispersion. Since Rothschild, a huge amount of economics work on bandit problems involves externalities: experimenting with the risky arm is socially valuable, but I bear all the cost privately, and don’t get all the benefit. This has been used, in many forms, extensively in the R&D literature by a number of economists you may know, like Bergemann, Besanko and Hopenhayn. Keller and Rady, along with Cripps, have a famous 2005 Econometrica involving exponential bandits (i.e., a safe arm and an arm that is either a total failure or a success, with the success learned while pulling that arm according to an exponential time distribution).

This 2010 paper, in Theoretical Economics, expands the R&D model to Poisson bandits. There are two arms being pulled by N firms in continuous time. On is a safe arm which pays a flow rate of s, and one is an arm which either gives an expectation of s’>s, or s”<s. The risky arm gives payoffs in lumps, so the only difference between the risky arm of type s' and the one of type s'' is that the Poisson arrival rate is slower for s''. This means that a single "success" on the risky arm does not tell me conclusively whether the arm is the good type or the bad type.

For the usual free-riding reasons diagrammed above, experimentation will be suboptimal. But there is another interesting effect here. Let p1* be the belief about the risky arm being of type s’ such that if there were only one firm, he would pull the risky arm if his belief were above p1* and pull the safe arm if it was below p1*. Keller and Rady prove that in any Markov perfect equilibrium with N firms, I am willing to spend some of my time pulling the risky arm even when my belief is below p1*. Why? They call this the “encouragement effect.” If there is just me, then the only benefit of pulling the risky arm when I am near p1* is that I might learn the risky arm is better than I previously thought by getting a Poisson success. But with N firms, getting a Poisson success both gives me this information and, by improving everyone’s belief about the quality of the risky arm, encourages others to also experiment with the risky arm in the future. Since payoffs exhibit strategic complementarity, I will benefit from their future experimentation.

There is one other neat trick, which involves some technical tricks as well. We usually solve just for the symmetric MPE, for simplicity. In the symmetric MPE, which is unique, we all mix between the safe and risky arm as long as we above some cutoff P. But as we get closer and closer to P, we are spending arbitrarily close to zero effort on the risky arm, so our posterior, given bad news, decreases only very slowly and we never reach P in finite time. This suggests that an asymmetric MPE may do better, even in a Pareto sense. Consider the following: near P, have one person experiment with full effort if the current belief is in some set B1, and have the other person experiment if the current belief is in B2. If it is my turn to experiment, I have multiple reasons to exert full effort: most importantly, because B1 and B2 are set up so that if I change the belief enough through my experimentation, the other person will take over the cost of experimenting. Characterizing the full set of MPE is difficult, of course.

https://tspace.library.utoronto.ca/bitstream/1807/27188/1/20100275.pdf (Final version in TE issue 5, 2010. Theoretical Economics is an amazing journal. It is completely open access, allows reuse and republication under a generous CC license, doesn’t charge any publication fee, doesn’t charge any submission fee as of now, and has among the fastest turnaround time in the business. Is it any surprise that TE has, by many accounts, passed Elsevier’s JET as the top field journal in micro theory?)

“Nuclear Power Reactors: A Study in Technological Lock-In,” R. Cowan (1990)

If you want to start a heated debate among historians of technology, just express an opinion about the importance of path dependence and then watch the sparks fly. Do “bad” technologies prevail because of random factors, what Brian Arthur calls “historical small events”? Or are what look like bad technologies actually good ones that prevailed for sensible reasons? Could optimal policy improve things? More on that last question in the final paragraph.

Robin Cowan, now at Maastricht, is an economist right in my sweet spot: a theorist interested in technology who enjoys the occasional dig through historical archives. His PhD dissertation concerns conditions for technological lock-in. Basically, increasing returns to scale (learning-by-using, for example) and unknown future benefits of a given research line (here is where the multiarmed bandit comes in) generally will lead to 1) a sole technology dominating the market, 2) each technology, regardless of underlying quality, having a positive probability of being that technology, and 3) cycling between technology early in the lifecycle. In the present paper, Cowan examines the history of nuclear power reactors through this framework; apropos to the previous post on this site, I think what Cowan does is a much more sensible test of a theory than any sort of statistical process.

Nuclear power is interesting because, at least of as 1990, light water nuclear power reactors are dominant despite the fact that many other types of reactors appear to have underlying quality/cost combinations as good or better. How did light water come to dominate? After WW2, the US had a monopoly on enriched uranium production, and was unwilling to share because of national security concerns. Development of nuclear power technology was also driven by military concerns: nuclear submarines could stay underwater longer, for example. A military focus led policymakers to focus research effort on small reactors which could be developed quickly.

In the 1950s, following the Soviet atomic bomb, US nuclear power policy shifted somewhat toward developing power for civilians. There was a belief that the Soviets would develop allies in exchange for Soviet nuclear power plants, and so the US began pushing “Atoms for Peace” civilian nuclear power to counter that threat. There was an urgency to such development, and because light water reactors had already been developed and accepted for submarine use, they were the quickest to develop for civilian power plant export. A handful of US firms with experience in light water heavily subsidized the capital cost of their plants, which led to rapid adoption in the early 1960s. Because of learning-by-doing, light water plant costs quickly decreased, and because of network effects – more users means more knowledge of potential safety risks, for example – a number of nations adopted light water plants soon after. During this period, other technologies like heavy water and gas graphite suffered temporary setbacks which you can think of as a bad draw in a multiarmed bandit. Because of future uncertainty in the bandit model, and learning-by-using, light water plants locked themselves in. As of 1990, at least, Cowan notes that experts both then, as well as in the 50s and 60s, did not believe that light water was necessarily the best civilian nuclear power technology.

Much more detail is found in the paper. One thing to worry about when reading, though, is the conflation of path dependence in general and socially suboptimal path dependence. Imagine two technologies with identical output and marginal cost, but one with fixed research cost 7 and one with fixed research cost 10. If the second is adopted by everyone, it appears naively that the “wrong” technology has won out. But what if the cost of 10 was already borne by military researchers developing a similar product? In that case, the second technology is socially optimal. The multiarmed bandit has similar issues – in the fact of uncertainty about nuclear power technology quality, it is not obvious that a social planner would have done anything different; indeed, many important decisions were made by the US Navy. I only mention this distinction because a friend and I have a model of technology that generates similar path dependence, but in a way that can absolutely be countered by better policy, and I’m not sure how Cowan’s historical example speaks to our model.

http://dimetic.dime-eu.org/dimetic_files/cowan1990.pdf (Final Journal of Economic History 1990 version)

“How Demanding is the Revealed Preference Approach to Demand?,” T. Beatty & I. Crawford (2011)

If you’ve read this site at all, you know that I see little value in “testing” economic theories, but if we’re going to do it, we ought at least do it in a way that makes a bit of sense. There are a ton of studies testing whether agents (here meaning not just humans; Chen and coauthors have a series of papers about revealed preference and other forms of maximizing behavior in Capuchin monkeys!) have preferences that can be described by the standard model: a concave, monotonic, continuous utility function that is time-invariant. Generally, the studies do find such maximizing behavior. But this may mean nothing: a theory that is trivially satisfied will never be shown to violate utility maximization, and indeed lots of experiments and empirical datasets see so little variation in prices that nearly any set of choices can be rationalized.

Beatty and Crawford propose a simple fix here. Consider an experiment with only two goods, and two price/income bundles. There is a feasible mixture among those two goods for each bundle. Consider the share of income under each price/income bundle spent on each of the two goods. If, say, 75% of income is spent on Good A under price/income bundle 1, then, for example, utility maximization may be consistent with spending anywhere between 0 and 89% of income on Good A under price/bundle 2. Imagine drawing a square with “income share spent on Good A under price/income bundle 1” on the x-axis, and “income share on A under bundle 2” on the y-axis. Some sets of choices will lie in a part of that square which is incompatible with utility maximization. The greater the proportion of total area which is incompatible with utility maximization, the more restrictive a test of utility maximizing behavior will be. The idea extends in a straightforward way to tests with N goods and M choices.

Beatty and Crawford assume you want a measure of “how well” agents do in a test of revealed preference as a function of both the pass rate (what proportion of the sample does not reject utility maximizing behavior) and the test difficulty (how often a random number generator selecting bundles would pass); if this all sounds like redefining the concept of statistical power, it should. It turns out that r minus a, where r is the pass rate and a is the test difficulty, has some nice axiomatic properties; I’m not totally convinced this part of the paper is that important, so I’ll leave it for you to read. The authors then apply this idea to some Spanish consumption data, where households were tracked for eight quarters. They find that about 96% of households in the sample pass: they show no purchases which violate utility maximizing behavior. But the variation in prices and quarterly income is so minimal that utility maximizing behavior imposes almost no constraints: 91% of random number generators would “pass” given the same variation in prices and incomes.

What do we learn from an exercise like this? There is definitely some benefit: if you want to design experiments concerning revealed preference, the measure in the present paper is useful indeed for helping choose precisely what variation in incomes and prices to use in order to subject revealed preference to a “tough” test. But this assumes you want to test at all. “Science is underdetermined,” they shout from the rooftops! Even if people showed behavior that “rejected” utility maximization, we would surely ask, first, by how much; second, are you sure “budget” and “price” are determined correctly (there is Varian’s error in price measurement, and no one is using lifetime income adjusted for credit constraints when talking about “budgets”); third, are you just rejecting concaveness and not maximizing behavior?; fourth, are there not preference shocks over a two year period, such as my newfound desire to buy diapers after a newborn arrives?; and so on. I think such critiques would be accepted by essentially any economist. Those of the philosophic school that I like to discuss on this site would further note that the model of utility maximization is not necessarily meant to be predictive, that we know it is “wrong” in that clearly people do not always act as if they are maximizers, and that the Max U model is nonetheless useful as a epistemic device for social science researchers.

http://www.tc.umn.edu/~tbeatty/working_papers/revisedpowerpaper.pdf (Final working paper – final version published in AER October 2011)

“Stakes Matter in Ultimatum Games,” S. Andersen, S. Ertac, U. Gneezy, M. Hoffman & J. List (2011)

[Update, 9/7/2011: A comment at Cheap Talk mentioned a new paper by Nicholas Bardsley which I find quite relevant to the final paragraph of this post. Essentially, Bardsley is able to completely change (as far as I’m concerned) the “sharing” characteristic of the dictator game just by changing the action set available to players; if the dictator can also “take” money, and not simply share, then they do take indeed. The Hawthorne Effect Is Real, shout the villagers from the mountaintop.]

Here is one more experimental paper, which I believe is forthcoming in the AER as well. Experimentalists love the Ultimatum Game. In the Ultimatum Game, two anonymous people are matched and one of them is given X dollars. She is told to propose a split of the money between herself and the other player. The other player can then either accept his share of the split, or reject, in which case both parties get nothing. Tons of experiments over the past 20 years have shown, everywhere from U.S. undergraduate labs to tribes in the Amazon, offers that tend to be rather high (30-50%) and also high rejection rates on low offers. This is “strange” (more on this shortly) to economists because the unique subgame perfect Nash equilibrium is to offer one penny, and for the responder to accept. Even if you think that the so-called paradox is nothing of the sort – rather, people are unused to one-shot games and are instead trying to develop reputation in a repeated game called Life – there is an even stranger stylized fact: changing stakes doesn’t seem to affect behavior. That is, if the stakes are 1 dollar, 10 dollars or 100 dollars, people still reject. Why aren’t people responding to incentives at all?

I remember a study a few years ago, from Indonesia perhaps, where many days worth of wages were being rejected seemingly out of spite. (And speaking of spite, ultimatum game papers are great examples of economists abusing language. One man’s “unfair offers were consistently rejected” is another man’s “primitive spite seems more important to responders than rational thought.”)

Andersen et al (more on this also in a second) play the ultimatum game in India using stakes that range up to a year’s income. And unsurprisingly, stakes matter a lot. No matter how low the split, only one time is an offer rejected with the year’s income stakes, and that offer was less than 10% of the stake. As stakes increase from 20 rupees up to 20000, the rejection rate for a given split falls, though it seems to fall fastest when stakes get very large. The takeaway: even given all of the experimental results on the Ultimatum game, spite is probably not terribly important vis-a-vis more standard incentives across the range of “very important economic phenomena.” None of this is to say that CEOs won’t cost their firm millions out of spite – surely they sometimes do – but rather claims that human nature is hardwired for fairness or spite or whatever you want to call it even at the expense of standard maximizing behavior are limited claims indeed.

Two final notes here. First, I think economists need to come to some conclusion concerning norms on experimental papers. Econ has long had a standard of giving author billing only to those who were essential for the idea and the completion of a paper – rarely has this meant more than three authors. Credit for data collection, straightforward math, coding, etc. has generally been given in the acknowledgments. A lot of econ psych and experimental work strikes me as fighting that norm: five and six authors have become standard. (I should caveat this by saying that in the present paper, I have no idea how workload was divided; rather, I think it’s undeniable that more generally the work expected of a coauthor in experimental papers is lower than that which was traditional in economics.)

Second, and I’m sure someone has done this but I don’t have a cite, the “standard” instructions in ultimatum games seem to prime the results to a ridiculous degree. Imagine the following exercise. Give 100 dollars to a research subject (Mr. A). Afterwards, tell some other subject (Ms. B) that 100 dollars was given to Mr. A. Tell Mr. A that the other subject knows he was given the money, but don’t prime him to “share” or “offer a split” or anything similar. Later, tell Ms. B that she can, if she wishes, reverse the result and take the 100 dollars away from A – if she does so, had Mr. A happened to have given her some of the money, that would also be taken. I hope we can agree that if you did such an experiment, A would share no money and B would show no spite, as neither has been primed to see the 100 dollars as something that should have been shared in the first place. One doesn’t normally expect anonymous strangers to share their good fortune with you, surely. That is, feelings of spite, jealousy and fairness can be, and are, primed by researchers. I think this is worth keeping in mind when trying to apply the experimental results on ultimatum games to the real economy.

http://openarchive.cbs.dk/bitstream/handle/10398/8244/ECON_wp1-2011.pdf?sequence=1 (January 2011 working paper, forthcoming in the AER)

“A Continuous Dilemma,” D. Friedman & R. Oprea (2011)

I feel pretty confident that the two lab experiment papers I will write about today will represent the only such posts on that field here for quite a while. Both results are interesting, but as an outsider to experimental econ, I’m quite surprised that these represent the “state of the art”, and at some level both must since both are forthcoming in the AER.

In the present paper, Friedman and Oprea run three versions of the prisoner’s dilemma: a one-shot game, a one-minute continuous time game where players must “wait” 7.5 seconds to react to an opponent’s change of strategy, and a one-minute continuous time game with no limit on reaction speed aside from human reaction time. We’ve known since Nash that finitely-repeated prisoner’s dilemmas can only support defect every period as an equilibrium (by a simple backward induction unraveling argument), but that infinitely-repeated prisoner’s dilemmas can support any payoffs from the cooperate payoff to the defect payoff in equilibrium (by the Fudenberg-Maskin folk theorem). Two results from the 1980s save us a bit here. First, as the underrated Radner has pointed out, if you can react quickly to an opponent’s deviation, then you can only lose a tiny bit by cooperating and hoping your opponent cooperates also. That is, with a very high number of periods, cooperate until almost the very end is an “almost” dominant equilibria. If your opponent defects, you defect almost immediately afterward and thereafter both players play the “unique” equilibrium defect-defect. If your opponent does not defect, you both continue to cooperate until the very end. Regardless of your opponent’s strategy, “cooperate until opponent defects the first time” gains only a tiny bit less than the maximal payoff from using defect every period. Second, Simon and Stinchcombe (1989) show that in continuous time games, induction cannot be used and something like the folk theorem applies.

Friedman and Oprea test this in a lab. Basically none of their subjects cooperate in the one-shot game, and cooperation steadily increases as the minimum wait to react drops from 30 seconds to nearly continuous. In the example where the only restriction on reaction time is human response time, cooperation occurs 80-90% of the time, essentially encompassing the entire game in every example except for the last few seconds. A modification of Radner’s insight shows that this type of cutoff strategy is an epsilon-equilibrium, and that expected cooperation given the limits on reaction time are reasonable. The authors do not fully solve for the (epsilon)-equilibria of their game – I have no idea how they got away with this, but I would love to know what they said to the referees! In any case, the intuition for why cutoff strategies are nearly dominant equilibria seems reasonable, although it should be noted that this intuition is essentially Radner’s intuition and not anything novel to the present paper.

So what’s the takeaway? For a theoretically-minded reader, I think the experimental results here are simply more justification for taking care in interpreting Nash predictions for actions in lengthy, finitely-repeated games. Even for modeling purposes, it might be reasonable to see more work on epsilon-equilibria in, say, oligopoly behavior; cartel pricing is much easier to support when prices and quantities are very quickly reported if we look at that type of equilibria. I still find it a bit strange that the authors do not, as far as I can tell, attempt to distinguish between different types of theoretical explanation for high rates of cooperation in repeated games. Is there infection from beliefs a la the Kreps’ et al Gang of Four paper? (This does not appear to be the case to me, since I believe Gang of Four can sustain cooperation all the way to the horizon.) Would bounded rationality matter? (Both players’ complete action profile over time is available throughout the game in the present paper.) There are many other explanations that could be tested here. (Indeed, Bigoni et al have a new paper following up the present results with some discussion of infinite versus finite horizon continuous time games.)

http://faculty.arts.ubc.ca/roprea/prisonerEX.pdf (Dec 2010 working paper. Final version forthcoming in the AER. If you’re coming from a theory background, there are many norms in experimental econ that will strike you as strange – writing about an experiment with 36 American undergraduates who self select into lab studies as if it representative of human behavior, for example – but I’m afraid that battle has already been lost. Best just to read experimental work for what it is; some interesting insights for theory lie inside even despite these peccadillos.)

%d bloggers like this: