A “causal empiricist” turn has swept through economics over the past couple decades. As a result, many economists are primarily interested in internally valid treatment effects according to the causal models of Rubin, meaning they are interested in credible statements of how some outcome Y is affected if you manipulate some treatment T given some covariates X. That is, to the extent that full functional form Y=f(X,T) is impossible to estimate because of unobserved confounding variables or similar, it turns out to still be possible to estimate some feature of that functional form, such as the average treatment effect E(f(X,1))-E(f(X,0)). At some point, people like Angrist and Imbens will win a Nobel prize not only for their applied work, but also for clarifying precisely what various techniques are estimating in a causal sense. For instance, an instrumental variable regression under a certain exclusion restriction (let’s call this an “auxiliary assumption”) estimates the average treatment effect along the local margin of people induced into treatment. If you try to estimate the same empirical feature using a different IV, and get a different treatment effect, we all know now that there wasn’t a “mistake” in either paper, but rather than the margins upon which the two different IVs operate may not be identical. Great stuff.

This causal model emphasis has been controversial, however. Social scientists have quibbled because causal estimates generally require the use of small, not-necessarily-general samples, such as those from a particular subset of the population or a particular set of countries, rather than national data or the universe of countries. Many statisticians have gone even further, suggestion that multiple regression with its linear parametric form does not take advantage of enough data in the joint distribution of (Y,X), and hence better predictions can be made with so-called machine learning algorithms. And the structural economists argue that the parameters we actually care about are much broader than regression coefficients or average treatment effects, and hence a full structural model of the data generating process is necessary. We have, then, four different techniques to analyze a dataset: multiple regression with control variables, causal empiricist methods like IV and regression discontinuity, machine learning, and structural models. What exactly do each of these estimate, and how do they relate?

Peter Aronow and Cyrus Samii, two hotshot young political economists, take a look at old fashioned multiple regression. Imagine you want to estimate y=a+bX+cT, where T is a possibly-binary treatment variable. Assume away any omitted variable bias, and more generally assume that all of the assumptions of the OLS model (linearity in covariates, etc.) hold. What does that coefficient c on the treatment indicator represent? This coefficient is a *weighted* combination of the individual estimated treatment effects, where more weight is given to units whose treatment status is not well explained by covariates. Intuitively, if you are regressing, say, the probability of civil war on participation in international institutions, then if a bunch of countries with very similar covariates all participate, the “treatment” of participation will be swept up by the covariates, whereas if a second group of countries with similar covariates all have different participation status, the regression will put a lot of weight toward those countries since differences in outcomes can be related to participation status.

This turns out to be quite consequential: Aronow and Samii look at one paper on FDI and find that even though the paper used a broadly representative sample of countries around the world, about 10% of the countries weighed more than 50% in the treatment effect estimate, with very little weight on a number of important regions, including all of the Asian tigers. In essence, the sample was general, but the *effective sample* once you account for weighting was just as limited as some of “nonrepresentative samples” people complain about when researchers have to resort to natural or quasinatural experiments! It turns out that similar effective vs. nominal representativeness results hold even with nonlinear models estimated via maximum likelihood, so this is not a result unique to OLS. Aronow and Samii’s result matters for interpreting bodies of knowledge as well. If you replicate a paper adding in an additional covariate, and get a different treatment effect, it may not reflect omitted variable bias! The difference may simply result from the additional covariate changing the effective weighting on the treatment effect.

So the “externally valid treatment effects” we have been estimating with multiple regression aren’t so representative at all. So when, then, is old fashioned multiple regression controlling for observable covariates a “good” way to learn about the world, compared to other techniques. I’ve tried to think through this is a uniform way; let’s see if it works. First consider machine learning, where we want to estimate y=f(X,T). Assume that there are no unobservables relevant to the estimation. The goal is to estimate the functional form f nonparametrically but to avoid overfitting, and statisticians have devised a number of very clever ways to do this. The proof that they work is in the pudding: cars drive themselves now. It is hard to see any reason why, if there are no unobservables, we wouldn’t want to use these machine learning/nonparametric techniques. However, at present the machine learning algorithms people use literally depend only on data in the joint distribution (X,Y), and not on any auxiliary assumptions. To interpret the marginal effect of a change in T as some sort of “treatment effect” that can be manipulated with policy, if estimated without auxiliary assumptions, requires some pretty heroic assumptions about the lack of omitted variable bias which essentially will never hold in most of the economic contexts we care about.

Now consider the causal model, where y=f(X,U,T) and you interested in what would happen with covariates X and unobservables U if treatment T was changed to a counterfactual. All of these techniques require a particular set of auxiliary assumptions: randomization requires the SUTVA assumption that treatment of one unit does not effect the independent variable of another unit, IV requires the exclusion restriction, diff-in-diff requires the parallel trends assumption, and so on. In general, auxiliary assumptions will only hold in certain specific contexts, and hence by construction the result will not be representative. Further, these assumptions are very limited in that they can’t recover every conditional aspect of y, but rather recover only summary statistics like the average treatment effect. Techniques like multiple regression with covariate controls, or machine learning nonparametric estimates, can draw on a more general dataset, but as Aronow and Samii pointed out, the marginal effect on treatment status they identify is not necessarily effectively drawing on a more general sample.

Structural folks are interested in estimating y=f(X,U,V(t),T), where U and V are unobserved, and the nature of unobserved variables V are affected by t. For example, V may be inflation expectations, T may be the interest rate, y may be inflation today, and X and U are observable and unobservable country characteristics. Put another way, the functional form of f may depend on how exactly T is modified, through V(t). This Lucas Critique problem is assumed away by the auxiliary assumptions in causal models. In order to identify a treatment effect, then, additional auxiliary assumptions generally derived from economic theory are needed in order to understand how V will change in response to a particular treatment type. Even more common is to use a set of auxiliary assumptions to find a sufficient statistic for the particular parameter desired, which may not even be a treatment effect. In this sense, structural estimation is similar to causal models in one way and different in two. It is similar in that it relies on auxiliary assumptions to help extract particular parameters of interest when there are unobservables that matter. It is different in that it permits unobservables to be functions of policy, and that it uses auxiliary assumptions whose credibility leans more heavily on non-obvious economic theory. In practice, structural models often also require auxiliary assumptions which do not come directly from economic theory, such as assumptions about the distribution of error terms which are motivated on the basis of statistical arguments, but *in principle* this distinction is not a first order difference.

We then have a nice typology. Even if you have a completely universal and representative dataset, multiple regression controlling for covariates does not generally give you a “generalizable” treatment effect. Machine learning can try to extract treatment effects when the data generating process is wildly nonlinear, but has the same nonrepresentativeness problem and the same “what about omitted variables” problem. Causal models can extract some parameters of interest from nonrepresentative datasets where it is reasonable to assume certain auxiliary assumptions hold. Structural models can extract more parameters of interest, sometimes from more broadly representative datasets, and even when there are unobservables that depend on the nature of the policy, but these models require auxiliary assumptions that can be harder to defend. The so-called sufficient statistics approach tries to retain the former advantages of structural models while reducing the heroics that auxiliary assumptions need to perform.

Aronow and Samii is forthcoming in the American Journal of Political Science; the final working paper is at the link. Related to this discussion, Ricardo Hausmann caused a bit of a stir online this week with his “constant adaptation rather than RCT” article. His essential idea was that, unlike with a new medical drug, social science interventions vary drastically depending on the exact place or context; that is, external validity matters so severely that slowly moving through “RCT: Try idea 1”, then “RCT: Try idea 2”, is less successful than smaller, less precise explorations of the “idea space”. He received a lot of pushback from the RCT crowd, but I think for the wrong reason: the constant iteration is *less* likely to discover underlying mechanisms than even an RCT, as it is still far too atheoretical. The link Hausmann makes to “lean manufacturing” is telling: GM famously (Henderson and Helper 2014) took photos of every square inch of their joint venture plant with NUMMI, and tried to replicate this plant in their other plants. But the underlying *reason* NUMMI and Toyota worked has to do with the credibility of various relational contracts, rather than the (constantly iterated) features of the shop floor. Iterating without attempting to glean the underlying mechanisms at play is not a rapid route to good policy.

*Edit: A handful of embarrassing typos corrected, 2/26/2016*