There’s something suspicious about using statistics to test statistics

February 11, 2014 at 4:15 pm

Second of two parts (read part 1)

There’s something slightly oxymoronic about the phrase “false positive.” When you’re positive, you’re supposed to be sure that what you’re saying is true. As when you have the support of solid scientific evidence. Sadly, though, scientific evidence is frequently not so solid.

In fact, as one analysis has suggested, more than half of the papers published in scientific journals report false conclusions. Some of those erroneous results can be attributed to bias, incompetence and even outright fraud. But mostly the errors arise from properly conducted science using standard statistical methods to test hypotheses. Such methods are guaranteed to sometimes suggest the existence of a true (or “positive”) effect that doesn’t actually exist — hence the label “false positive.”

Supposedly (but erroneously), you can estimate how likely it is that you’ve got a false positive by using the methods of statistics to calculate a P value. P stands for the probability of seeing your results (or more extreme results) if there was no actual effect. (This no-effect assumption is called the null hypothesis.) In other words, sometimes the results look like there’s a real effect just by chance — a fluke.

Conventional wisdom (another oxymoron) says that you should consider a result “statistically significant” if the chance of a fluke is less than 5 percent (corresponding to a P value of less than .05). You will then conclude that the null hypothesis is unlikely to be true and that there is a real effect. Supposedly you’d be wrong only 5 percent of the time. And many scientists believe that. But it’s baloney.

For one thing, a P value calculation is based on the assumption that there is no real effect. If there is a real effect, your calculation is null and void. And wrong. All sorts of other assumptions are built into P value calculations; some of those assumptions may be true in textbooks, but are unlikely to be accurate elsewhere (for example, in laboratories and hospitals).

In real life, conclusions are wrong far more than 5 percent of the time. No knowledgeable expert contends the false positive rate is actually as low as 5 percent, at least in fields like medical science.

Last year, for instance, Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University small splash in the Twitter pool with a paper reporting a false positive rate in medical journals of 14 percent (ranging from 11 percent to 19 percent depending on the journal). Jager and Leek thought that wasn’t so bad, concluding that the medical research literature is therefore pretty darn reliable.

To the credit of the journal Biostatistics, where the paper was published, several commentary papers appeared in the same issue, most of them basically saying that the Jager and Leek result was bogus. It relied on an analysis of P values, using the assumptions used to calculate P values, to validate the reliability of medical studies based on P values. As statistician Andrew Gelman of Columbia University pointed out in a blog post, “they’re … basically assuming the model that is being questioned.”

A commentary in Biostatistics coauthored by Gelman and Keith O’Rourke congratulated Jager and Leek for their “boldness in advancing their approach” but concluded that “what Jager and Leek are trying to do is hopeless.”

“We admire the authors’ energy and creativity, but we find their claims unbelievable … and based on unreasonable assumptions.”

Jager and Leek’s method, Gelman noted, is based on an approach used in genetics to estimate false positive rates. When testing thousands of genes at once (to see, say, how their activity is related to a disease), many supposedly significant differences in gene activity will be false positives. But there are mathematical ways of calculating what the false positive rate will be. That approach makes some sense in genetics, Gelman observes, because all the P values are collected at one time, in one study by one set of methods. So assumptions about how the values of false positive P values will be distributed may not be too far off.

But in analyzing the medical literature (and in Jager and Leek’s case, only the abstracts from papers in only five journals), the P values have emerged from many different kinds of experiments, and have been processed and filtered in so many ways, that there is no way that the underlying assumptions about their statistical properties could hold up.

“We just do not think it is possible to analyze a collection of published P values and, from that alone, infer anything interesting about the distribution of true effects,” Gelman and O’Rourke wrote in Biostatistics. “The approach is just too driven by assumptions that are not even close to plausible.”

In his blog, Gelman suggested that Jager and Leek’s analysis illustrates problems with the whole business of statistical null hypothesis testing and the concept of false positives. Very rarely, if ever, is there really absolutely no effect — that is, exactly zero difference between two groups being tested with anything. But standard statistical analysis admits only two types of error: a false positive, meaning the conclusion of an effect when there isn’t one, and false negative, concluding there is no effect when there really is one. In real life, Gelman points out, there is also a “Type S error” — concluding an effect is positive when it’s actually negative, and a “Type M” error — concluding an effect is large when it is really small. Or vice versa.

If you analyze the medical literature solely through the prism of false positives or negatives, you miss a lot of erroneous results. “I see this whole false-positive, true-positive framework as a dead end,” Gelman wrote.

In other words, the whole question about P values and false positives may be completely misposed. Although it has been embedded in the scientific publication process for decades, the validity of P value tests of null hypotheses rests on a shaky foundation. While many scientists believe a small P value shows that the data are incompatible with the null hypothesis, among statisticians that view “has not gained acceptance and in fact now seems untenable in generality,” the esteemed statistician David Cox of the University of Oxford wrote in a Biostatistics commentary.

And so perhaps the P value method of assessing false positives is not oxymoronic, but merely moronic.

Follow me on Twitter: @tom_siegfried