The limitations of RCTs (wonkish) Yours truly is extremely fond of science philosophers and economists like Nancy Cartwright and Angus Deaton. With razor-sharp intellects they immediately go for the essentials. They have no time for bullshit. And neither should we: Randomised controlled trials (RCTs) have been sporadically used in economic research since the negative income tax experiments between 1968 and 1980 … and have been regularly used since then to evaluate labour market and welfare programmes … In recent years, they have spread widely in economics … In a recent paper, we argue that some of the popularity of RCTs, among the public as well as some practitioners, rests on misunderstandings about what they are capable of accomplishing (Deaton and Cartwright 2016). Well-conducted RCTs could provide unbiased estimates of the average treatment effect (ATE) in the study population, provided no relevant differences between treatment and control are introduced post randomisation, which blinding of subjects, investigators, data collectors, and analysts serves to diminish. Unbiasedness says that, if we were to repeat the trial many times, we would be right on average.
Topics:
Lars Pålsson Syll considers the following as important: Theory of Science & Methodology
This could be interesting, too:
Lars Pålsson Syll writes Randomization and causal claims
Lars Pålsson Syll writes Race and sex as causes
Lars Pålsson Syll writes Randomization — a philosophical device gone astray
Lars Pålsson Syll writes Keynes on the importance of ‘causal spread’
The limitations of RCTs (wonkish)
Yours truly is extremely fond of science philosophers and economists like Nancy Cartwright and Angus Deaton. With razor-sharp intellects they immediately go for the essentials. They have no time for bullshit. And neither should we:
Randomised controlled trials (RCTs) have been sporadically used in economic research since the negative income tax experiments between 1968 and 1980 … and have been regularly used since then to evaluate labour market and welfare programmes … In recent years, they have spread widely in economics …
In a recent paper, we argue that some of the popularity of RCTs, among the public as well as some practitioners, rests on misunderstandings about what they are capable of accomplishing (Deaton and Cartwright 2016). Well-conducted RCTs could provide unbiased estimates of the average treatment effect (ATE) in the study population, provided no relevant differences between treatment and control are introduced post randomisation, which blinding of subjects, investigators, data collectors, and analysts serves to diminish. Unbiasedness says that, if we were to repeat the trial many times, we would be right on average. Yet we are almost never in such a situation, and with only one trial (as is virtually always the case) unbiasedness does nothing to prevent our single estimate from being very far away from the truth. If, as if often believed, randomisation were to guarantee that the treatment and control groups are identical except for the treatment, then indeed, we would have a precise – indeed exact – estimate of the ATE. But randomisation does nothing of the kind …
A well-conducted RCT can yield a credible estimate of an ATE in one specific population, namely the ‘study population’ from which the treatments and controls were selected. Sometimes this is enough; if we are doing a post hoc program evaluation, if we are testing a hypothesis that is supposed to be generally true, if we want to demonstrate that the treatment can work somewhere, or if the study population is a randomly drawn sample from the population of interest whose ATE we are trying to measure. Yet the study population is often not the population that we are interested in, especially if subjects must volunteer to be in the experiment and have their own reasons for participating or not …
More generally, demonstrating that a treatment works in one situation is exceedingly weak evidence that it will work in the same way elsewhere; this is the ‘transportation’ problem: what does it take to allow us to use the results in new contexts, whether policy contexts or in the development of theory? … No matter how rigorous or careful the RCT, if the bridge is built by a hand-waving simile that the policy context is somehow similar to the experimental context, the rigor in the trial does nothing to support a policy … Causal effects depend on the settings in which they are derived, and often depend on factors that might be constant within the experimental setting but different elsewhere … Without knowing why things happen and why people do things, we run the risk of worthless casual (‘fairy story’) causal theorising, and we have given up on one of the central tasks of economics.
Nowadays many mainstream economists maintain that ‘imaginative empirical methods’ — such as natural experiments, field experiments, lab experiments, RCTs — can help us to answer questions conerning the external validity of economic models. In their view they are more or less tests of ‘an underlying economic model’ and enable economists to make the right selection from the ever expanding ‘collection of potentially applicable models.’
When looked at carefully, however, there are in fact few real reasons to share this optimism on the alleged ’empirical turn’ in economics.
If we see experiments or field studies as theory tests or models that ultimately aspire to say something about the real ‘target system,’ then the problem of external validity is central (and was for a long time also a key reason why behavioural economists had trouble getting their research results published).
Assume that you have examined how the work performance of Chinese workers A is affected by B (‘treatment’). How can we extrapolate/generalize to new samples outside the original population (e.g. to the US)? How do we know that any replication attempt ‘succeeds’? How do we know when these replicated experimental results can be said to justify inferences made in samples from the original population? If, for example, P(A|B) is the conditional density function for the original sample, and we are interested in doing a extrapolative prediction of E [P(A|B)], how can we know that the new sample’s density function is identical with the original? Unless we can give some really good argument for this being the case, inferences built on P(A|B) is not really saying anything on that of the target system’s P'(A|B).
External validity/extrapolation/generalization is founded on the assumption that we could make inferences based on P(A|B) that is exportable to other populations for which P'(A|B) applies. Sure, if one can convincingly show that P and P’are similar enough, the problems are perhaps surmountable. But arbitrarily just introducing functional specification restrictions of the type invariance/stability /homogeneity, is, at least for an epistemological realist far from satisfactory. And often it is – unfortunately – exactly this that I see when I take part of mainstream economists’ RCTs and ‘experiments.’
Many ‘experimentalists’ claim that it is easy to replicate experiments under different conditions and therefore a fortiori easy to test the robustness of experimental results. But is it really that easy? Population selection is almost never simple. Had the problem of external validity only been about inference from sample to population, this would be no critical problem. But the really interesting inferences are those we try to make from specific labs/experiments/fields to specific real world situations/institutions/ structures that we are interested in understanding or (causally) to explain. And then the population problem is more difficult to tackle.
In randomized trials the researchers try to find out the causal effects that different variables of interest may have by changing circumstances randomly — a procedure somewhat (‘on average’) equivalent to the usual ceteris paribus assumption).
Besides the fact that ‘on average’ is not always ‘good enough,’ it amounts to nothing but hand waving to simpliciter assume, without argumentation, that it is tenable to treat social agents and relations as homogeneous and interchangeable entities.
Randomization is used to basically allow the econometrician to treat the population as consisting of interchangeable and homogeneous groups (‘treatment’ and ‘control’). The regression models one arrives at by using randomized trials tell us the average effect that variations in variable X has on the outcome variable Y, without having to explicitly control for effects of other explanatory variables R, S, T, etc., etc. Everything is assumed to be essentially equal except the values taken by variable X.
In a usual regression context one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:
Y = α + βX + ε,
where α is a constant intercept, β a constant ‘structural’ causal effect and ε an error term.
The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated'( X=1) may have causal effects equal to – 100 and those ‘not treated’ (X=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.
Limiting model assumptions in economic science always have to be closely examined since if we are going to be able to show that the mechanisms or causes that we isolate and handle in our models are stable in the sense that they do not change when we ‘export’ them to our ‘target systems,’ we have to be able to show that they do not only hold under ceteris paribus conditions and a fortiori only are of limited value to our understanding, explanations or predictions of real economic systems.
Most ‘randomistas’ underestimate the heterogeneity problem. It does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of regression estimates that economists produce every year.
Just as econometrics, randomization promises more than it can deliver, basically because it requires assumptions that in practice are not possible to maintain. And just like econometrics, randomization is basically a deductive method. Given the assumptions, these methods deliver deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. And although randomization may contribute to controlling for confounding, it does not guarantee it, since genuine ramdomness presupposes infinite experimentation and we know all real experimentation is finite. And even if randomization may help to establish average causal effects, it says nothing of individual effects unless homogeneity is added to the list of assumptions. Causal evidence generated by randomization procedures may be valid in ‘closed’ models, but what we usually are interested in, is causal evidence in the real target system we happen to live in.
When does a conclusion established in population X hold for target population Y? Only under very restrictive conditions!
‘Ideally controlled experiments’ tell us with certainty what causes what effects — but only given the right ‘closures.’ Making appropriate extrapolations from (ideal, accidental, natural or quasi) experiments to different settings, populations or target systems, is not easy. “It works there” is no evidence for “it will work here”. Causes deduced in an experimental setting still have to show that they come with an export-warrant to the target population/system. The causal background assumptions made have to be justified, and without licenses to export, the value of ‘rigorous’ and ‘precise’ methods — and ‘on-average-knowledge’ — is despairingly small.