**Summary:**

On the limited value of randomization In Social Science and Medicine (December 2017), Angus Deaton & Nancy Cartwright argue that Randomized Controlled Trials (RCTs) do not have any warranted special status. They are, simply, far from being the ‘gold standard’ they are usually portrayed as: Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an

**Topics:**

Lars Pålsson Syll considers the following as important: Statistics & Econometrics

**This could be interesting, too:**

Lars Pålsson Syll writes The limited epistemic value of ‘variation analysis’

Lars Pålsson Syll writes Adjusting for confounding (student stuff)

Lars Pålsson Syll writes What RCTs can and cannot tell us

Lars Pålsson Syll writes The Deadly Sin of Statistical Reification

## On the limited value of randomization

In *Social Science and Medicine* (December 2017), Angus Deaton & Nancy Cartwright argue that Randomized Controlled Trials (RCTs) do not have any warranted special status. They are, simply, far from being the ‘gold standard’ they are usually portrayed as:

Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial. Demanding ‘external validity’ is unhelpful because it expects too much of an RCT while undervaluing its potential contribution. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon, not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not ‘what works’, but ‘why things work’.

In a comment on Deaton & Cartwright, statistician Stephen Senn argues that on several issues concerning randomization Deaton & Cartwright “simply confuse the issue,” that their views are “simply misleading and unhelpful” and that they make “irrelevant” simulations:

My view is that randomisation should not be used as an excuse for ignoring what is known and observed but that it does deal validly with hidden confounders. It does not do this by delivering answers that are guaranteed to be correct; nothing can deliver that. It delivers answers about which valid probability statements can be made and, in an imperfect world, this has to be good enough. Another way I sometimes put it is like this: show me how you will analyse something and I will tell you what allocations are exchangeable. If you refuse to choose one at random I will say, “why? Do you have some magical thinking you’d like to share?”

Contrary to Senn, Andrew Gelman shares Deaton’s and Cartwright’s view that randomized trials often are overrated:

There is a strange form of reasoning we often see in science, which is the idea that a chain of reasoning is as strong as its strongest link. The social science and medical research literature is full of papers in which a randomized experiment is performed, a statistically significant comparison is found, and then story time begins, and continues, and continues—as if the rigor from the randomized experiment somehow suffuses through the entire analysis …

One way to get a sense of the limitations of controlled trials is to consider the conditions under which they can yield meaningful, repeatable inferences. The measurement needs to be relevant to the question being asked; missing data must be appropriately modeled; any relevant variables that differ between the sample and population must be included as potential treatment interactions; and the underlying effect should be large. It is difficult to expect these conditions to be satisfied without good substantive understanding. As Deaton and Cartwright put it, “when little prior knowledge is available, no method is likely to yield well-supported conclusions.” Much of the literature in statistics, econometrics, and epidemiology on causal identification misses this point, by focusing on the procedures of scientific investigation—in particular, tools such as randomization and p-values which are intended to enforce rigor—without recognizing that rigor is empty without something to be rigorous about.

Yours truly’s view is that nowadays many social scientists maintain that ‘imaginative empirical methods’ — such as natural experiments, field experiments, lab experiments, RCTs — can help us to answer questions concerning the external validity of models used in social sciences. In their view, they are more or less tests of ‘an underlying model’ that enable them to make the right selection from the ever-expanding ‘collection of potentially applicable models.’ When looked at carefully, however, there are in fact few real reasons to share this optimism.

Many ‘experimentalists’ claims that it is easy to replicate experiments under different conditions and therefore a fortiori easy to test the robustness of experimental results. But is it really that easy? Population selection is almost never simple. Had the problem of external validity only been about inference from sample to population, this would be no critical problem. But the really interesting inferences are those we try to make from specific labs/experiments/fields to specific real-world situations/institutions/ structures that we are interested in understanding or (causally) explaining. And then the population problem is more difficult to tackle.

In randomized trials the researchers try to find out the causal effects that different variables of interest may have by changing circumstances randomly — a procedure somewhat (‘on average’) equivalent to the usual *ceteris paribus* assumption).

Besides the fact that ‘on average’ is not always ‘good enough,’ it amounts to nothing but hand waving to simpliciter assume, without argumentation, that it is tenable to treat social agents and relations as homogeneous and interchangeable entities.

Randomization is used to basically allow the econometrician to treat the population as consisting of ‘exchangeable’ and homogeneous groups (‘treatment’ and ‘control’). The regression models one arrives at by using randomized trials tell us the average effect that variations in variable X has on the outcome variable Y, without having to explicitly control for effects of other explanatory variables R, S, T, etc., etc. Everything is assumed to be essentially equal except the values taken by variable X. But as noted by Jerome Cornfield, even if one of the main functions of randomization is to generate a sample space, there are

reasons for questioning the basic role of the sample space, i.e., of variations from sample to sample, in statistical theory. In practice, certain unusual samples would ordinarily be modified, adjusted or entirely discarded, if they in fact were obtained, even though they are part of the basic description of sampling variation. Savage reports that Fisher, when asked what he would do with a randomly selected Latin Square that turned out to be a Knut Vik Square, replied that “he thought he would draw again and that, ideally, a theory explicitly excluding regular squares should be developed.” But this option is not available in clinical trials and undesired baseline imbalances between treated and control groups can occur. There is often no alternative to reweighting or otherwise adjusting for these imbalances.

In a usual regression context, one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:

Y = α + βX + ε,

where α is a constant intercept, β a constant ‘structural’ causal effect and ε an error term.

The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated'( X=1) may have causal effects equal to – 100 and those ‘not treated’ (X=0) may have causal effects equal to 100. Contemplating whether being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.

Limiting model assumptions in science always have to be closely examined since if we are going to be able to show that the mechanisms or causes that we isolate and handle in our models are stable in the sense that they do not change when we ‘export’ them to our ‘target systems,’ we have to be able to show that they do not only hold under ceteris paribus conditions and a fortiori only are of limited value to our understanding, explanations or predictions of real-world systems.

Most ‘randomistas’ underestimate the heterogeneity problem. It does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of regression estimates that are produced every year.

Just as econometrics, randomization promises more than it can deliver, basically because it requires assumptions that in practice are not possible to maintain. And just like econometrics, randomization is basically a deductive method. Given the assumptions, these methods deliver deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. And although randomization may contribute to controlling for confounding, it does not guarantee it, since genuine randomness presupposes infinite experimentation and we know all real experimentation is finite. And even if randomization may help to establish average causal effects, it says nothing of individual effects unless homogeneity is added to the list of assumptions. Causal evidence generated by randomization procedures may be valid in ‘closed’ models, but what we usually are interested in, is causal evidence in the real-world target system we happen to live in.

Some statisticians and data scientists think that algorithmic formalisms somehow give them access to causality. That is, however, simply not true. Assuming ‘convenient’ things like ‘faithfulness,’ ‘exchangeability,’ or stability, is not to give proof. It’s to assume what has to be proven. Deductive-axiomatic methods used in statistics do not produce evidence for causal inferences. The real causality we are searching for is the one existing in the real world around us. If there is no warranted connection between axiomatically derived theorems and the real world, well, then we haven’t really obtained the causation we are looking for.

As social scientists, we have to confront the all-important question of how to handle uncertainty and randomness. Should we define randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts. Accepting a domain of probability theory and sample space of infinite populations also implies that judgments are made on the basis of observations that are actually never made!

Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.

And as if this wasn’t enough, one could also seriously wonder what kind of ‘populations’ many statistical models ultimately are based on. Why should we as social scientists — and not as pure mathematicians working with formal-axiomatic systems without the urge to confront our models with real target systems — unquestioningly accept models based on concepts like the ‘infinite super populations’ used in e.g. the ‘potential outcome’ framework that has become so popular lately in social sciences?

Modelling assumptions made in statistics are more often than not made for mathematical tractability reasons, rather than verisimilitude. That is unfortunately also a reason why the methodological ‘rigour’ encountered when taking part of statistical research often is deceptive. The models constructed may seem technically advanced and very ‘sophisticated,’ but that’s usually only because the problems here discussed have been swept under the carpet. Assuming that our data are generated by ‘coin flips’ in an imaginary ‘superpopulation’ only means that we get answers to questions that we are not asking. The inferences made based on imaginary ‘superpopulations,’ well, they too are nothing but imaginary.

‘Ideally controlled experiments’ tell us with certainty what causes what effects — but only given the right ‘closures.’ Making appropriate extrapolations from (ideal, accidental, natural or quasi) experiments to different settings, populations or target systems, is not easy. ‘It works there’ is no evidence for ‘it will work here’. Causes deduced in an experimental setting still have to show that they come with an export warrant to the target population/system. The causal background assumptions made have to be justified, and without licenses to export, the value of ‘rigorous’ and ‘precise’ methods — and ‘on-average-knowledge’ — is often despairingly small.

In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved, although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science …

Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling — by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance …

David A. FreedmanStatistical Models and Causal Inference