Scientists love statistical significance. It offers a way to test hypotheses. It’s a ticket to publishing, to media coverage, to tenure.
It’s also a crock — statistically speaking, anyway.
You know the idea. When scientists perform an experiment and their data suggest an important result — say, that watching TV causes influenza — there’s always the nagging concern that the finding was a fluke. Maybe some of the college sophomores selected for the study had been recently exposed to the flu via some other medium. By dividing the students into two groups at random, though — one to watch TV and the other not — scientists try to make such preexposure equally likely in either group. Of course, there’s still a chance that the luck of the draw put more flu-prone people in the TV group. Tests of statistical significance offer a way to calculate just how likely such a fluke should be.
Even when such tests are performed correctly, it’s a challenge to draw sensible conclusions. And analyzing statistical data presents many opportunities for making logical errors.
One such analytical error is often committed by researchers comparing multiple groups. Suppose, for purposes of illustration, that you tested two strategies for improving memory: drinking orange juice, and injecting a drug into the brain. People who drank the orange juice showed slight memory improvement, but not quite enough to be statistically significant. People taking the drug showed better memory improvement, just enough to be statistically significant.
For many scientists, the knee-jerk conclusion is that the drug works but orange juice doesn’t. But for those who prefer thinking with brain rather than knee, further review is required. In fact, when one test shows statistical significance and the other doesn’t, the difference between the two tests may not itself be statistically significant. Two groups may fall just barely on opposite sides of the imaginary statistical significance threshold. Concluding that one strategy works but the other doesn’t is statistically stupid.
Surely, you would surmise, sophisticated scientific researchers would realize the error of such thinking and never perform an analysis in that way. And of course, some are well aware of that issue. In fact, a 2006 paper by statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine explicitly pointed out this mistake, in a paper titled “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.”
“As teachers of statistics, we might think that ‘everybody knows’ that comparing significance levels is inappropriate, but we have seen this mistake all the time in practice,” Gelman and Stern wrote in the American Statistician. But apparently a lot of scientists don’t read that journal. A report by three Dutch psychologists in a recent issue of Nature Neuroscience points out that this shaky statistical reasoning remains fairly common.
“It is interesting that this statistical error occurs so often, even in journals of the highest standard,” write Sander Nieuwenhuis of Leiden University and Birte U. Forstmann and Eric-Jan Wagenmakers of the University of Amsterdam.
Nieuwenhuis, Forstmann and Wagenmakers culled prestigious journals (Science, Nature, Nature Neuroscience, Neuron and the Journal of Neuroscience), for neuroscience studies that provided researchers an opportunity to make this error. Of 157 such papers, 79 — more than half — chose the incorrect approach.
“Our impression was that this error of comparing significance levels is widespread in the neuroscience literature, but until now there were no aggregate data to support this impression,” the psychologists wrote. In many cases, they note, a proper calculation might still have upheld the main conclusion. But not in all of them.
In other sorts of neuroscience papers, such as those involving molecular mechanisms, the statistical significance comparison error seemed even more common, further analyses suggested.
Complaints about such statistical sloppiness are not mere quibbles over a minor technical point, but rather illuminate a deep flaw at science’s core. Improper comparisons afflict the statistical reasoning scientists routinely use to draw conclusions that affect all sorts of things, ranging from new drug approvals and what treatments insurance companies will pay for to what research projects get funded in the first place. Nor is this particular problem the only statistical weak link in the scientific method’s chain of reasoning.
In fact, this comparison error illustrates how meaningless a finding of “statistical significance” can sometimes be: If adding a second group reveals that a significance finding is suspect, you can’t be sure that a test of any one group against a control reveals anything. In other words, the scientific enterprise has a serious problem. And its name is statistics.