Category Archives: Experimentation

“Strategic Experimentation with Poisson Bandits,” G. Keller & S. Rady (2010)

The multiarmed bandit is a true workhorse of modern mathematical economics. In a bandit problem, there are multiple arms you can pull, as in some types of slot machines. You have beliefs about the distribution of payoffs given when you pull a given arm. For instance, there may be a safe arm which yields an expectation of one coin every time you pull it, and a risky arm which yields an expectation of 2 coins with prior probability 1/3, and 0 coins with prior 2/3. Returns are generally discounted. There is often a “value of experimentation” where agents will pull an arm with a lower current expected value than another arm because, for instance, learning that the second arm above is the type with expected value 2 will increase my payoff from now until infinity, while I only pay the cost of experimenting now; in many single-person bandit problems, the optimal arm to pull can be solved simply using a formula called a Gittins index derived by J.C. Gittins in the 1970s.

As far as I know, the first explicitly economic bandit paper is Rothschild’s 1974 JET “A two-armed bandit theory of market pricing.” Rothschild tries to explain why prices may be disperse for the same product over time. His explanation is simple: consumer demand is unknown, and I try to learn demand by experimenting with prices. This produces a very particular form of price dispersion. Since Rothschild, a huge amount of economics work on bandit problems involves externalities: experimenting with the risky arm is socially valuable, but I bear all the cost privately, and don’t get all the benefit. This has been used, in many forms, extensively in the R&D literature by a number of economists you may know, like Bergemann, Besanko and Hopenhayn. Keller and Rady, along with Cripps, have a famous 2005 Econometrica involving exponential bandits (i.e., a safe arm and an arm that is either a total failure or a success, with the success learned while pulling that arm according to an exponential time distribution).

This 2010 paper, in Theoretical Economics, expands the R&D model to Poisson bandits. There are two arms being pulled by N firms in continuous time. On is a safe arm which pays a flow rate of s, and one is an arm which either gives an expectation of s’>s, or s”<s. The risky arm gives payoffs in lumps, so the only difference between the risky arm of type s' and the one of type s'' is that the Poisson arrival rate is slower for s''. This means that a single "success" on the risky arm does not tell me conclusively whether the arm is the good type or the bad type.

For the usual free-riding reasons diagrammed above, experimentation will be suboptimal. But there is another interesting effect here. Let p1* be the belief about the risky arm being of type s’ such that if there were only one firm, he would pull the risky arm if his belief were above p1* and pull the safe arm if it was below p1*. Keller and Rady prove that in any Markov perfect equilibrium with N firms, I am willing to spend some of my time pulling the risky arm even when my belief is below p1*. Why? They call this the “encouragement effect.” If there is just me, then the only benefit of pulling the risky arm when I am near p1* is that I might learn the risky arm is better than I previously thought by getting a Poisson success. But with N firms, getting a Poisson success both gives me this information and, by improving everyone’s belief about the quality of the risky arm, encourages others to also experiment with the risky arm in the future. Since payoffs exhibit strategic complementarity, I will benefit from their future experimentation.

There is one other neat trick, which involves some technical tricks as well. We usually solve just for the symmetric MPE, for simplicity. In the symmetric MPE, which is unique, we all mix between the safe and risky arm as long as we above some cutoff P. But as we get closer and closer to P, we are spending arbitrarily close to zero effort on the risky arm, so our posterior, given bad news, decreases only very slowly and we never reach P in finite time. This suggests that an asymmetric MPE may do better, even in a Pareto sense. Consider the following: near P, have one person experiment with full effort if the current belief is in some set B1, and have the other person experiment if the current belief is in B2. If it is my turn to experiment, I have multiple reasons to exert full effort: most importantly, because B1 and B2 are set up so that if I change the belief enough through my experimentation, the other person will take over the cost of experimenting. Characterizing the full set of MPE is difficult, of course.

https://tspace.library.utoronto.ca/bitstream/1807/27188/1/20100275.pdf (Final version in TE issue 5, 2010. Theoretical Economics is an amazing journal. It is completely open access, allows reuse and republication under a generous CC license, doesn’t charge any publication fee, doesn’t charge any submission fee as of now, and has among the fastest turnaround time in the business. Is it any surprise that TE has, by many accounts, passed Elsevier’s JET as the top field journal in micro theory?)

“Nuclear Power Reactors: A Study in Technological Lock-In,” R. Cowan (1990)

If you want to start a heated debate among historians of technology, just express an opinion about the importance of path dependence and then watch the sparks fly. Do “bad” technologies prevail because of random factors, what Brian Arthur calls “historical small events”? Or are what look like bad technologies actually good ones that prevailed for sensible reasons? Could optimal policy improve things? More on that last question in the final paragraph.

Robin Cowan, now at Maastricht, is an economist right in my sweet spot: a theorist interested in technology who enjoys the occasional dig through historical archives. His PhD dissertation concerns conditions for technological lock-in. Basically, increasing returns to scale (learning-by-using, for example) and unknown future benefits of a given research line (here is where the multiarmed bandit comes in) generally will lead to 1) a sole technology dominating the market, 2) each technology, regardless of underlying quality, having a positive probability of being that technology, and 3) cycling between technology early in the lifecycle. In the present paper, Cowan examines the history of nuclear power reactors through this framework; apropos to the previous post on this site, I think what Cowan does is a much more sensible test of a theory than any sort of statistical process.

Nuclear power is interesting because, at least of as 1990, light water nuclear power reactors are dominant despite the fact that many other types of reactors appear to have underlying quality/cost combinations as good or better. How did light water come to dominate? After WW2, the US had a monopoly on enriched uranium production, and was unwilling to share because of national security concerns. Development of nuclear power technology was also driven by military concerns: nuclear submarines could stay underwater longer, for example. A military focus led policymakers to focus research effort on small reactors which could be developed quickly.

In the 1950s, following the Soviet atomic bomb, US nuclear power policy shifted somewhat toward developing power for civilians. There was a belief that the Soviets would develop allies in exchange for Soviet nuclear power plants, and so the US began pushing “Atoms for Peace” civilian nuclear power to counter that threat. There was an urgency to such development, and because light water reactors had already been developed and accepted for submarine use, they were the quickest to develop for civilian power plant export. A handful of US firms with experience in light water heavily subsidized the capital cost of their plants, which led to rapid adoption in the early 1960s. Because of learning-by-doing, light water plant costs quickly decreased, and because of network effects – more users means more knowledge of potential safety risks, for example – a number of nations adopted light water plants soon after. During this period, other technologies like heavy water and gas graphite suffered temporary setbacks which you can think of as a bad draw in a multiarmed bandit. Because of future uncertainty in the bandit model, and learning-by-using, light water plants locked themselves in. As of 1990, at least, Cowan notes that experts both then, as well as in the 50s and 60s, did not believe that light water was necessarily the best civilian nuclear power technology.

Much more detail is found in the paper. One thing to worry about when reading, though, is the conflation of path dependence in general and socially suboptimal path dependence. Imagine two technologies with identical output and marginal cost, but one with fixed research cost 7 and one with fixed research cost 10. If the second is adopted by everyone, it appears naively that the “wrong” technology has won out. But what if the cost of 10 was already borne by military researchers developing a similar product? In that case, the second technology is socially optimal. The multiarmed bandit has similar issues – in the fact of uncertainty about nuclear power technology quality, it is not obvious that a social planner would have done anything different; indeed, many important decisions were made by the US Navy. I only mention this distinction because a friend and I have a model of technology that generates similar path dependence, but in a way that can absolutely be countered by better policy, and I’m not sure how Cowan’s historical example speaks to our model.

http://dimetic.dime-eu.org/dimetic_files/cowan1990.pdf (Final Journal of Economic History 1990 version)

“How Demanding is the Revealed Preference Approach to Demand?,” T. Beatty & I. Crawford (2011)

If you’ve read this site at all, you know that I see little value in “testing” economic theories, but if we’re going to do it, we ought at least do it in a way that makes a bit of sense. There are a ton of studies testing whether agents (here meaning not just humans; Chen and coauthors have a series of papers about revealed preference and other forms of maximizing behavior in Capuchin monkeys!) have preferences that can be described by the standard model: a concave, monotonic, continuous utility function that is time-invariant. Generally, the studies do find such maximizing behavior. But this may mean nothing: a theory that is trivially satisfied will never be shown to violate utility maximization, and indeed lots of experiments and empirical datasets see so little variation in prices that nearly any set of choices can be rationalized.

Beatty and Crawford propose a simple fix here. Consider an experiment with only two goods, and two price/income bundles. There is a feasible mixture among those two goods for each bundle. Consider the share of income under each price/income bundle spent on each of the two goods. If, say, 75% of income is spent on Good A under price/income bundle 1, then, for example, utility maximization may be consistent with spending anywhere between 0 and 89% of income on Good A under price/bundle 2. Imagine drawing a square with “income share spent on Good A under price/income bundle 1″ on the x-axis, and “income share on A under bundle 2″ on the y-axis. Some sets of choices will lie in a part of that square which is incompatible with utility maximization. The greater the proportion of total area which is incompatible with utility maximization, the more restrictive a test of utility maximizing behavior will be. The idea extends in a straightforward way to tests with N goods and M choices.

Beatty and Crawford assume you want a measure of “how well” agents do in a test of revealed preference as a function of both the pass rate (what proportion of the sample does not reject utility maximizing behavior) and the test difficulty (how often a random number generator selecting bundles would pass); if this all sounds like redefining the concept of statistical power, it should. It turns out that r minus a, where r is the pass rate and a is the test difficulty, has some nice axiomatic properties; I’m not totally convinced this part of the paper is that important, so I’ll leave it for you to read. The authors then apply this idea to some Spanish consumption data, where households were tracked for eight quarters. They find that about 96% of households in the sample pass: they show no purchases which violate utility maximizing behavior. But the variation in prices and quarterly income is so minimal that utility maximizing behavior imposes almost no constraints: 91% of random number generators would “pass” given the same variation in prices and incomes.

What do we learn from an exercise like this? There is definitely some benefit: if you want to design experiments concerning revealed preference, the measure in the present paper is useful indeed for helping choose precisely what variation in incomes and prices to use in order to subject revealed preference to a “tough” test. But this assumes you want to test at all. “Science is underdetermined,” they shout from the rooftops! Even if people showed behavior that “rejected” utility maximization, we would surely ask, first, by how much; second, are you sure “budget” and “price” are determined correctly (there is Varian’s error in price measurement, and no one is using lifetime income adjusted for credit constraints when talking about “budgets”); third, are you just rejecting concaveness and not maximizing behavior?; fourth, are there not preference shocks over a two year period, such as my newfound desire to buy diapers after a newborn arrives?; and so on. I think such critiques would be accepted by essentially any economist. Those of the philosophic school that I like to discuss on this site would further note that the model of utility maximization is not necessarily meant to be predictive, that we know it is “wrong” in that clearly people do not always act as if they are maximizers, and that the Max U model is nonetheless useful as a epistemic device for social science researchers.

http://www.tc.umn.edu/~tbeatty/working_papers/revisedpowerpaper.pdf (Final working paper – final version published in AER October 2011)

“Stakes Matter in Ultimatum Games,” S. Andersen, S. Ertac, U. Gneezy, M. Hoffman & J. List (2011)

[Update, 9/7/2011: A comment at Cheap Talk mentioned a new paper by Nicholas Bardsley which I find quite relevant to the final paragraph of this post. Essentially, Bardsley is able to completely change (as far as I'm concerned) the "sharing" characteristic of the dictator game just by changing the action set available to players; if the dictator can also "take" money, and not simply share, then they do take indeed. The Hawthorne Effect Is Real, shout the villagers from the mountaintop.]

Here is one more experimental paper, which I believe is forthcoming in the AER as well. Experimentalists love the Ultimatum Game. In the Ultimatum Game, two anonymous people are matched and one of them is given X dollars. She is told to propose a split of the money between herself and the other player. The other player can then either accept his share of the split, or reject, in which case both parties get nothing. Tons of experiments over the past 20 years have shown, everywhere from U.S. undergraduate labs to tribes in the Amazon, offers that tend to be rather high (30-50%) and also high rejection rates on low offers. This is “strange” (more on this shortly) to economists because the unique subgame perfect Nash equilibrium is to offer one penny, and for the responder to accept. Even if you think that the so-called paradox is nothing of the sort – rather, people are unused to one-shot games and are instead trying to develop reputation in a repeated game called Life – there is an even stranger stylized fact: changing stakes doesn’t seem to affect behavior. That is, if the stakes are 1 dollar, 10 dollars or 100 dollars, people still reject. Why aren’t people responding to incentives at all?

I remember a study a few years ago, from Indonesia perhaps, where many days worth of wages were being rejected seemingly out of spite. (And speaking of spite, ultimatum game papers are great examples of economists abusing language. One man’s “unfair offers were consistently rejected” is another man’s “primitive spite seems more important to responders than rational thought.”)

Andersen et al (more on this also in a second) play the ultimatum game in India using stakes that range up to a year’s income. And unsurprisingly, stakes matter a lot. No matter how low the split, only one time is an offer rejected with the year’s income stakes, and that offer was less than 10% of the stake. As stakes increase from 20 rupees up to 20000, the rejection rate for a given split falls, though it seems to fall fastest when stakes get very large. The takeaway: even given all of the experimental results on the Ultimatum game, spite is probably not terribly important vis-a-vis more standard incentives across the range of “very important economic phenomena.” None of this is to say that CEOs won’t cost their firm millions out of spite – surely they sometimes do – but rather claims that human nature is hardwired for fairness or spite or whatever you want to call it even at the expense of standard maximizing behavior are limited claims indeed.

Two final notes here. First, I think economists need to come to some conclusion concerning norms on experimental papers. Econ has long had a standard of giving author billing only to those who were essential for the idea and the completion of a paper – rarely has this meant more than three authors. Credit for data collection, straightforward math, coding, etc. has generally been given in the acknowledgments. A lot of econ psych and experimental work strikes me as fighting that norm: five and six authors have become standard. (I should caveat this by saying that in the present paper, I have no idea how workload was divided; rather, I think it’s undeniable that more generally the work expected of a coauthor in experimental papers is lower than that which was traditional in economics.)

Second, and I’m sure someone has done this but I don’t have a cite, the “standard” instructions in ultimatum games seem to prime the results to a ridiculous degree. Imagine the following exercise. Give 100 dollars to a research subject (Mr. A). Afterwards, tell some other subject (Ms. B) that 100 dollars was given to Mr. A. Tell Mr. A that the other subject knows he was given the money, but don’t prime him to “share” or “offer a split” or anything similar. Later, tell Ms. B that she can, if she wishes, reverse the result and take the 100 dollars away from A – if she does so, had Mr. A happened to have given her some of the money, that would also be taken. I hope we can agree that if you did such an experiment, A would share no money and B would show no spite, as neither has been primed to see the 100 dollars as something that should have been shared in the first place. One doesn’t normally expect anonymous strangers to share their good fortune with you, surely. That is, feelings of spite, jealousy and fairness can be, and are, primed by researchers. I think this is worth keeping in mind when trying to apply the experimental results on ultimatum games to the real economy.

http://openarchive.cbs.dk/bitstream/handle/10398/8244/ECON_wp1-2011.pdf?sequence=1 (January 2011 working paper, forthcoming in the AER)

“A Continuous Dilemma,” D. Friedman & R. Oprea (2011)

I feel pretty confident that the two lab experiment papers I will write about today will represent the only such posts on that field here for quite a while. Both results are interesting, but as an outsider to experimental econ, I’m quite surprised that these represent the “state of the art”, and at some level both must since both are forthcoming in the AER.

In the present paper, Friedman and Oprea run three versions of the prisoner’s dilemma: a one-shot game, a one-minute continuous time game where players must “wait” 7.5 seconds to react to an opponent’s change of strategy, and a one-minute continuous time game with no limit on reaction speed aside from human reaction time. We’ve known since Nash that finitely-repeated prisoner’s dilemmas can only support defect every period as an equilibrium (by a simple backward induction unraveling argument), but that infinitely-repeated prisoner’s dilemmas can support any payoffs from the cooperate payoff to the defect payoff in equilibrium (by the Fudenberg-Maskin folk theorem). Two results from the 1980s save us a bit here. First, as the underrated Radner has pointed out, if you can react quickly to an opponent’s deviation, then you can only lose a tiny bit by cooperating and hoping your opponent cooperates also. That is, with a very high number of periods, cooperate until almost the very end is an “almost” dominant equilibria. If your opponent defects, you defect almost immediately afterward and thereafter both players play the “unique” equilibrium defect-defect. If your opponent does not defect, you both continue to cooperate until the very end. Regardless of your opponent’s strategy, “cooperate until opponent defects the first time” gains only a tiny bit less than the maximal payoff from using defect every period. Second, Simon and Stinchcombe (1989) show that in continuous time games, induction cannot be used and something like the folk theorem applies.

Friedman and Oprea test this in a lab. Basically none of their subjects cooperate in the one-shot game, and cooperation steadily increases as the minimum wait to react drops from 30 seconds to nearly continuous. In the example where the only restriction on reaction time is human response time, cooperation occurs 80-90% of the time, essentially encompassing the entire game in every example except for the last few seconds. A modification of Radner’s insight shows that this type of cutoff strategy is an epsilon-equilibrium, and that expected cooperation given the limits on reaction time are reasonable. The authors do not fully solve for the (epsilon)-equilibria of their game – I have no idea how they got away with this, but I would love to know what they said to the referees! In any case, the intuition for why cutoff strategies are nearly dominant equilibria seems reasonable, although it should be noted that this intuition is essentially Radner’s intuition and not anything novel to the present paper.

So what’s the takeaway? For a theoretically-minded reader, I think the experimental results here are simply more justification for taking care in interpreting Nash predictions for actions in lengthy, finitely-repeated games. Even for modeling purposes, it might be reasonable to see more work on epsilon-equilibria in, say, oligopoly behavior; cartel pricing is much easier to support when prices and quantities are very quickly reported if we look at that type of equilibria. I still find it a bit strange that the authors do not, as far as I can tell, attempt to distinguish between different types of theoretical explanation for high rates of cooperation in repeated games. Is there infection from beliefs a la the Kreps’ et al Gang of Four paper? (This does not appear to be the case to me, since I believe Gang of Four can sustain cooperation all the way to the horizon.) Would bounded rationality matter? (Both players’ complete action profile over time is available throughout the game in the present paper.) There are many other explanations that could be tested here. (Indeed, Bigoni et al have a new paper following up the present results with some discussion of infinite versus finite horizon continuous time games.)

http://faculty.arts.ubc.ca/roprea/prisonerEX.pdf (Dec 2010 working paper. Final version forthcoming in the AER. If you’re coming from a theory background, there are many norms in experimental econ that will strike you as strange – writing about an experiment with 36 American undergraduates who self select into lab studies as if it representative of human behavior, for example – but I’m afraid that battle has already been lost. Best just to read experimental work for what it is; some interesting insights for theory lie inside even despite these peccadillos.)

“The Role of Theory in Field Experiments,” D. Card, S. Dellavigna & U. Malmendier (2011)

This article, I hope, will be widely read by economists working on field experiments. And it comes with David Card’s name right on the title page; this is certainly not a name that one associates with structural modeling!

Field experiments and randomized control trials are booming at the moment. Until the past decade, an average year saw a single field experiment published in any of the top five journals. Now, 8 to 10 a year are. The vast majority of these papers are atheoretical, though I have a small complaint about the definition of “theoretical” which I’ll leave for the final paragraph of this post. The same atheoretical nature is largely true of lab experiments; I generally am very receptive to field experiments and much less so to lab experiments, so I’ll leave out discussion of the lab for now.

(That said, I’m curious for the lab types out there: are there any good examples of lab experiments which have overturned a key economic insight? By overturned, I mean the reversal was accepted as valid by many economists. I don’t mean “behavioral theory” like Kahneman-Tversky. I mean, an actual lab experiment in the style of the German School – we ought call it that at this point. It just seems to me like many of the “surprising” results just turn out not to be true once we move to economically relevant behavior in the market. The “gift reciprocity” paper by Fehr and coauthors is a great example, and Card, Dellavigna and Malmendier discuss it. In the lab, people “work” much harder when they get paid a surprisingly high wage. In field and natural experiments trying to replicate this, with Gneezy and List (2006) being the canonical example, there is no such economically relevant effect. I would love some counterexamples of this phenomenon, though: I’m trying my best to keep an open mind!)

But back to field experiments. After noting the paucity of theory in most experimental papers, the authors give three examples of where theory could have played a role. In the gift reciprocity/wages literature mentioned above, there are many potential explanations for what is going on in the lab. Perhaps workers feel inequity aversion, and don’t want to “rip off” unprofitable employers. Perhaps they simple act under reciprocity – if you pay me a high wage, I’ll work hard even in a one-shot game. A properly designed field experiment can distinguish between the two. An even better example is charitable giving. List and Lucking-Reiley ran a famous 2002 field experiment where they examined whether giving to charity could be affected by, for example, claiming in the brochure that the goal of the fundraising drive was already almost reached. But can’t we learn much more about charity? Do people give because of warm glow? Or because of social pressure? Or some other reason? List, Dellavigna and Malmendier have a wonderful 2010 paper that writes down a basic structural model of gift-giving, and introduces just enough randomization into the experimental design to identify all of the parameters. They find that social pressure is important, and that door-to-door fundraising can actually lower total social welfare, even taking into account the gain from purchasing whatever public good charity is raising money for. And their results have a great link back to earlier theory and to future experiments along similar lines. Now that’s great work!

The complaints against structural models always seemed hollow to me. As Card, Dellavigna and Malmendier note, when interpreting results, every paper, structural or not, is making implicit assumptions. Why not make them in a way that is both clear and is guided by the huge body of theoretical knowledge that social science has already developed? The authors note a turn away from structural models in experiments after the negative income tax papers of the 70s and 80s were thought to be failures in some sense due to the difficulty of interpreting their results. This argument was always a bit ridiculous: all social science results are hard to interpret, and there’s no way around this. Writing up research in a way that it seems more clearcut to a policy audience does not mean that the evidence actually is clearcut.

I do have one quibble with this paper, though – and I think the authors will sympathize with this complaint given their case studies. The authors divide experimental papers into four groups: descriptive, single model, competing model and parameter estimation. Single model, to take one example, is defined as a paper that lays out a formal model and tests one or more implications thereof. Similar definitions are given for competing models and parameter estimations. Once we get over Friedman’s 1953 model of economic methodology, though, we’ve got to realize that “testing” models is far, far away from the only link between theory and data. Theory is useful to empirics because it can guide interesting and nonobvious questions to look for, because it can be used to justify nontestable econometric assumptions, because it allows for reasonable discussion of counterfactuals, because it allows empirical studies to be linked into a broader conception of knowledge, because it allows for results to be interpreted correctly, etc. I’d argue that checking whether papers “test” models is almost irrelevant for knowing whether empirical papers properly use theory. Let me give my favorite example, which I used in a presentation to empirical economists last year. Imagine you study government-mandated hospital report cards, and find that two years into the program, there is no evidence that hospitals or patients are changing behavior based on the ratings, but that 20% of patients were looking at the report cards at some point. An atheoretical paper might suggest that these report card programs are a waste of money. A theoretically-guided paper would note that game theorists have shown reputational equilibria often are discontinuous, and that perhaps if more patients were induced to look at the report cards (maybe by directly mailing them to each household once a year), hospitals would begin to react by giving better care. There is no testing of a theoretical model or anything similar, but there is certainly great use of theory! (Perhaps of interest: my two favorite job market papers of the last couple years, those of Ben Handel and Heidi Williams, both use theory in one of the ways above rather than in the direct “let’s use data to test a theoretical model” framework…)

Similar comments apply to theorists’ use of empirical research, of course, but let’s save that for another day.

http://elsa.berkeley.edu/~sdellavi/wp/FieldExperimentJEPFeb11Tris.pdf (February 2011 working paper – forthcoming in the JEP)

“Reviews, Reputation and Revenue: The Case of Yelp.com,” M. Luca (2010)

I’m doing some work related to social learning, and a friend passed along the present paper by a recent job market candidate. It’s quite clever, and a great use of the wealth of data now available to the empirically-minded economist.

Here’s the question: there are tons of ways products, stores and restaurants develop reputation. One of these ways is reviews. How important is that extra Michelin star, or higher Zagat rating, or better word of mouth? And how could we ever separate the effect of reputation from the underlying quality of the restaurant?

Luca scrapes restaurant review data from Yelp, which really began penetrating Seattle in 2005; Yelp data is great because it includes review dates, so you can go back in time and reconstruct, with some error due to deleted reviews, what the review profile used to look like. Luca also has, incredibly, 7 years of restaurant revenue data from the city of Seattle. Just put the two together and you can track how restaurant reviews are correlated with revenue.

But what of causality? Here’s the clever bit. He notes that Yelp aggregates reviews into a star rating. So a restaurant with average review 3.24 gets 3 stars, and one with 3.25 gets 3.5 stars. Since no one actually reads all 200, for example, reviews of a given restaurant, the star rating can be said to represent reputation, while the actual review average is the underlying restaurant quality. It’s 2011, so this calls for some regression discontinuity (apparently, some grad students at Harvard call the empirical publication gatekeepers “the identification Taliban”; at least the present paper gets the internal validity right and doesn’t seem to have too many interpretive problems with external validity).

Holding underlying quality constant, the discontinuous jump of a half star is worth a 4.5% increase in revenue in the relevant quarter. This is large, but not crazy: similar gains have been found in recent work for moving from “B” to “A” in sanitary score, or from calorie consumption after calorie info was posted in New York City. The effect is close to zero for chain stores – one way this might be interpreted is that no one Yelps restaurants they are already familiar with. I would have liked to see some sort of demographic check here also: is the “Yelp effect” stronger in neighborhoods with younger, more internet-savvy consumers, as you might expect? Also, you may wonder whether there is manipulation by restaurant owners, given the large gains from a tiny jump in star rating. A quick and dirty distributional check doesn’t find any problem with manipulation, but that may change after this paper gets published!

You may also be wondering why reputation matters at all: why don’t I just go to a good restaurant? The answer is social learning plus costs of experimentation. The paper I’m working on now follows this line of thought toward what I think is a rather surprising policy implication: more on this at a future date.

http://people.bu.edu/mluca/JMP.pdf (Working paper version – Luca was hired at HBS, so savvy use of a great dataset pays off!)

Follow

Get every new post delivered to your Inbox.

Join 175 other followers

%d bloggers like this: