Significance tests — asking the wrong questions and getting the wrong answers Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted. Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.
Topics:
Lars Pålsson Syll considers the following as important: Statistics & Econometrics
This could be interesting, too:
Lars Pålsson Syll writes The history of econometrics
Lars Pålsson Syll writes What statistics teachers get wrong!
Lars Pålsson Syll writes Statistical uncertainty
Lars Pålsson Syll writes The dangers of using pernicious fictions in statistics
Significance tests — asking the wrong questions and getting the wrong answers
Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted.
Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.
So what’s the point in showing empirically that the null hypothesis is not true? Researchers who conduct clinical trials need to determine if the effect of treatment is big enough to make the intervention worthwhile, not whether the treatment has any effect at all.
A more technical issue is that p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data. The difference might sound subtle but it’s not. It is like the difference between the probability that a prime minister is male and the probability a male is prime minister! …
Significance testing and hypothesis testing are so widely misinterpreted that they impede progress in many areas of science. What can be done to hasten their demise? Senior scientists should ensure that a critical exploration of the methods of statistical inference is part of the training of all research students. Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”, especially when support for such claims is based on the evil p.
Decisions based on statistical significance testing certainly make life easier. But significance testing doesn’t give us the knowledge we want. It only gives an answer to a question we as researchers never ask — what is the probability of getting the result we have got, assuming that there is no difference between two sets of data (e. g. control group – experimental group, sample – population). On answering the question we really are interested in — how probable and reliable is our hypothesis — it remains silent.
Significance tests are not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypotheses by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
Most importantly — we should never forget that the underlying parameters we use when performing significance tests are model constructions — our p-values mean nothing if the model is wrong!