A “causal empiricist” turn has swept through economics over the past couple decades. As a result, many economists are primarily interested in internally valid treatment effects according to the causal models of Rubin, meaning they are interested in credible statements of how some outcome Y is affected if you manipulate some treatment T given some covariates X. That is, to the extent that full functional form Y=f(X,T) is impossible to estimate because of unobserved confounding variables or similar, it turns out to still be possible to estimate some feature of that functional form, such as the average treatment effect E(f(X,1))-E(f(X,0)). At some point, people like Angrist and Imbens will win a Nobel prize not only for their applied work, but also for clarifying precisely what various techniques are estimating in a causal sense. For instance, an instrumental variable regression under a certain exclusion restriction (let’s call this an “auxiliary assumption”) estimates the average treatment effect along the local margin of people induced into treatment. If you try to estimate the same empirical feature using a different IV, and get a different treatment effect, we all know now that there wasn’t a “mistake” in either paper, but rather than the margins upon which the two different IVs operate may not be identical. Great stuff.

This causal model emphasis has been controversial, however. Social scientists have quibbled because causal estimates generally require the use of small, not-necessarily-general samples, such as those from a particular subset of the population or a particular set of countries, rather than national data or the universe of countries. Many statisticians have gone even further, suggestion that multiple regression with its linear parametric form does not take advantage of enough data in the joint distribution of (Y,X), and hence better predictions can be made with so-called machine learning algorithms. And the structural economists argue that the parameters we actually care about are much broader than regression coefficients or average treatment effects, and hence a full structural model of the data generating process is necessary. We have, then, four different techniques to analyze a dataset: multiple regression with control variables, causal empiricist methods like IV and regression discontinuity, machine learning, and structural models. What exactly do each of these estimate, and how do they relate?

Peter Aronow and Cyrus Samii, two hotshot young political economists, take a look at old fashioned multiple regression. Imagine you want to estimate y=a+bX+cT, where T is a possibly-binary treatment variable. Assume away any omitted variable bias, and more generally assume that all of the assumptions of the OLS model (linearity in covariates, etc.) hold. What does that coefficient c on the treatment indicator represent? This coefficient is a *weighted* combination of the individual estimated treatment effects, where more weight is given to units whose treatment status is not well explained by covariates. Intuitively, if you are regressing, say, the probability of civil war on participation in international institutions, then if a bunch of countries with very similar covariates all participate, the “treatment” of participation will be swept up by the covariates, whereas if a second group of countries with similar covariates all have different participation status, the regression will put a lot of weight toward those countries since differences in outcomes can be related to participation status.

This turns out to be quite consequential: Aronow and Samii look at one paper on FDI and find that even though the paper used a broadly representative sample of countries around the world, about 10% of the countries weighed more than 50% in the treatment effect estimate, with very little weight on a number of important regions, including all of the Asian tigers. In essence, the sample was general, but the *effective sample* once you account for weighting was just as limited as some of “nonrepresentative samples” people complain about when researchers have to resort to natural or quasinatural experiments! It turns out that similar effective vs. nominal representativeness results hold even with nonlinear models estimated via maximum likelihood, so this is not a result unique to OLS. Aronow and Samii’s result matters for interpreting bodies of knowledge as well. If you replicate a paper adding in an additional covariate, and get a different treatment effect, it may not reflect omitted variable bias! The difference may simply result from the additional covariate changing the effective weighting on the treatment effect.

So the “externally valid treatment effects” we have been estimating with multiple regression aren’t so representative at all. So when, then, is old fashioned multiple regression controlling for observable covariates a “good” way to learn about the world, compared to other techniques. I’ve tried to think through this is a uniform way; let’s see if it works. First consider machine learning, where we want to estimate y=f(X,T). Assume that there are no unobservables relevant to the estimation. The goal is to estimate the functional form f nonparametrically but to avoid overfitting, and statisticians have devised a number of very clever ways to do this. The proof that they work is in the pudding: cars drive themselves now. It is hard to see any reason why, if there are no unobservables, we wouldn’t want to use these machine learning/nonparametric techniques. However, at present the machine learning algorithms people use literally depend only on data in the joint distribution (X,Y), and not on any auxiliary assumptions. To interpret the marginal effect of a change in T as some sort of “treatment effect” that can be manipulated with policy, if estimated without auxiliary assumptions, requires some pretty heroic assumptions about the lack of omitted variable bias which essentially will never hold in most of the economic contexts we care about.

Now consider the causal model, where y=f(X,U,T) and you interested in what would happen with covariates X and unobservables U if treatment T was changed to a counterfactual. All of these techniques require a particular set of auxiliary assumptions: randomization requires the SUTVA assumption that treatment of one unit does not effect the independent variable of another unit, IV requires the exclusion restriction, diff-in-diff requires the parallel trends assumption, and so on. In general, auxiliary assumptions will only hold in certain specific contexts, and hence by construction the result will not be representative. Further, these assumptions are very limited in that they can’t recover every conditional aspect of y, but rather recover only summary statistics like the average treatment effect. Techniques like multiple regression with covariate controls, or machine learning nonparametric estimates, can draw on a more general dataset, but as Aronow and Samii pointed out, the marginal effect on treatment status they identify is not necessarily effectively drawing on a more general sample.

Structural folks are interested in estimating y=f(X,U,V(t),T), where U and V are unobserved, and the nature of unobserved variables V are affected by t. For example, V may be inflation expectations, T may be the interest rate, y may be inflation today, and X and U are observable and unobservable country characteristics. Put another way, the functional form of f may depend on how exactly T is modified, through V(t). This Lucas Critique problem is assumed away by the auxiliary assumptions in causal models. In order to identify a treatment effect, then, additional auxiliary assumptions generally derived from economic theory are needed in order to understand how V will change in response to a particular treatment type. Even more common is to use a set of auxiliary assumptions to find a sufficient statistic for the particular parameter desired, which may not even be a treatment effect. In this sense, structural estimation is similar to causal models in one way and different in two. It is similar in that it relies on auxiliary assumptions to help extract particular parameters of interest when there are unobservables that matter. It is different in that it permits unobservables to be functions of policy, and that it uses auxiliary assumptions whose credibility leans more heavily on non-obvious economic theory. In practice, structural models often also require auxiliary assumptions which do not come directly from economic theory, such as assumptions about the distribution of error terms which are motivated on the basis of statistical arguments, but *in principle* this distinction is not a first order difference.

We then have a nice typology. Even if you have a completely universal and representative dataset, multiple regression controlling for covariates does not generally give you a “generalizable” treatment effect. Machine learning can try to extract treatment effects when the data generating process is wildly nonlinear, but has the same nonrepresentativeness problem and the same “what about omitted variables” problem. Causal models can extract some parameters of interest from nonrepresentative datasets where it is reasonable to assume certain auxiliary assumptions hold. Structural models can extract more parameters of interest, sometimes from more broadly representative datasets, and even when there are unobservables that depend on the nature of the policy, but these models require auxiliary assumptions that can be harder to defend. The so-called sufficient statistics approach tries to retain the former advantages of structural models while reducing the heroics that auxiliary assumptions need to perform.

Aronow and Samii is forthcoming in the American Journal of Political Science; the final working paper is at the link. Related to this discussion, Ricardo Hausmann caused a bit of a stir online this week with his “constant adaptation rather than RCT” article. His essential idea was that, unlike with a new medical drug, social science interventions vary drastically depending on the exact place or context; that is, external validity matters so severely that slowly moving through “RCT: Try idea 1”, then “RCT: Try idea 2”, is less successful than smaller, less precise explorations of the “idea space”. He received a lot of pushback from the RCT crowd, but I think for the wrong reason: the constant iteration is *less* likely to discover underlying mechanisms than even an RCT, as it is still far too atheoretical. The link Hausmann makes to “lean manufacturing” is telling: GM famously (Henderson and Helper 2014) took photos of every square inch of their joint venture plant with NUMMI, and tried to replicate this plant in their other plants. But the underlying *reason* NUMMI and Toyota worked has to do with the credibility of various relational contracts, rather than the (constantly iterated) features of the shop floor. Iterating without attempting to glean the underlying mechanisms at play is not a rapid route to good policy.

*Edit: A handful of embarrassing typos corrected, 2/26/2016*

I’m confused why you claim machine learning can’t deal with omitted variables. Machine learning can be used to estimate parameters of any model. Different types of models require different machine learning methods, it’s not magic, but to say that omitted variables are out of scope just seems completely wrong. If your model has a catch-all error term, machine learning methods will find a parameter for it.

This is not correct. The problem is not that an omitted variable, if added, would fit the model better. The problem is that a particular form of omitted variable, in conjunction with auxiliary assumptions, will lend a causal interpretation to the marginal effects in the fitted model. Consider the simplest possible example, dating back to the 40s: supply and demand. You have a huge number of points (P,Q) representing price and quantity. What is the supply and demand curve? Price and quantity can both increase either because demand shifts out and supply is constant, or because supply shifts in and demand shifts out, or… We can use an IV of things known to shift only supply or demand to recover the curves. Now in principle the idea of IV should work with nonlinear ML style functions, but it turns out there are serious statistical problems with, say, nonlinear IV, and the same goes for other attempts to combine “causal assumptions” with “machine learned nonparametric regression”.

Thanks for another great post Kevin.

A note: I would say Arronow and Samii are two hotshot political scientists. Unless, of course, we are making an argument about the artificiality of the barriers between the two disciplines, in which case, I wholeheartedly agree.

Oh jeez – of course I mean “political scientists”, but us economists are imperialistic even subconsciously!

Thanks for the pointer to this interesting paper. However, I thought that Aronow and Samii are overselling their results a bit. The last section before the conclusion reveals that the typical setup of the matching literature (selection on observables plus common support or, in their terminology, unconfoundedness and positivity) allows one to estimate average causal effects that are representative for the population for which the assumptions hold. Aronow and Samii metion inverse probability weighting (IPW) rather than matching but this should be equivalent, as far as I know. Read with this in mind the paper reminds us that multiple linear regression makes strong functional form assumptions (like constant coefficients) and does not generally identify an average treatment effect on the treated when treatment responses are heterogeneous. It’s good to make this point clear. But I don’t think that this is a very novel insight, although I could be mistaken. Taking matching or IPW as a solution to the problem, the focus should be on the positivity assumption which is crucial for the external validity of the results. But again, it should be obvious that one relies heavily on out-of-sample-predictions (based on the assumed functional form) if positivity fails and instead any parametric regression technique is applied. Essentially one then would impute a counterfactual treatment status for data points where this counterfectual is observed with probability zero.

Am I doing the paper injustice here? Would love to hear your thoughts on this.

I was intrigued by your write up on Aronow and Samii, which lead me to believe that you consider the problem of external validity to be central to statistical methodology. I agree with you, and would like to call your attention to a recent solution of this problem.

If you examine this paper http://ftp.cs.ucla.edu/pub/stat_ser/r450.pdf (or this http://ftp.cs.ucla.edu/pub/stat_ser/r400-reprint.pdf or this http://ftp.cs.ucla.edu/pub/stat_ser/r425.pdf), you see that external validity, including transporting experimental findings across heterogeneous populations and generalizing from biased sample to the population at large has been reduced to syntactic derivation. It can safely be considered a “solved problem”, in the sense that we have a full mathematical characterization (if and only if condition) of when a causal query can be answered (transported) in the target with information from the source.

Thanks for an interesting and useful summary of the arguments. Would you agree that there is congruence here with a critical realist position of concern about conventional quantitative methods in the social sciences? Simplistically, this would argue that the search for ‘average causal effects’ across populations is illusory, and what we need are better theories on context-specific configurations of causes for specified types of cases.