To make science better, watch out for statistical flaws

First of two parts

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

But while a P value in any given paper may be unreliable, analyzing aggregates of P values for thousands of papers can give a fair assessment of how many conclusions of significance are likely to be bogus, Jager and Leek contend. “There are well established and statistically sound methods for estimating the rate of false discoveries among an aggregated set of tested hypotheses using P values.”

It’s sophisticated methodology. It takes into account the fact that some studies report a very strong statistical significance, with P value much smaller than .05. So the 1-in-20 fluke argument doesn’t necessarily apply. Yes, a P value of .05 means there’s a 1-in-20 chance that your results (or even more extreme results) would show up even if there was no effect. But that doesn’t mean 1 in 20 (or 5 percent) of all studies are wrong, because many studies report data at well below the .05 significance level.

So Jager and Leek recorded actual P values reported in more than 5,000 medical papers published from 2000 to 2010 in journals such as the Lancet, the Journal of the American Medical Association and the New England Journal of Medicine. An algorithm developed to calculate the false discovery rate found a range of 11 percent to 19 percent for the various journals.     

“Our results suggest that while there is an inflation of false discovery results above the nominal 5 percent level … the relatively minor inflation in error rates does not merit the claim that most published research is false,” Jager and Leek concluded.

Well, maybe.

But John Ioannidis, author of the 2005 study claiming most results are wrong, was not impressed. In fact, he considers Jager and Leek’s paper to fall into the “false results” category. “Their approach is flawed in sampling, calculations and conclusions,” Ioannidis wrote in a commentary also appearing in Biostatistics.

For one thing, Jager and Leek selected only five very highly regarded journals, a small sample, not randomly selected from the thousands of medical journals published these days. And out of more than  77,000 papers published over the study period, the automated procedure for identifying P values in the abstracts found only 5,322 usable for the study’s purposes. More than half of those papers reported randomized controlled trials or were systematic reviews — the types of papers least likely to be in error. Those types account for less than 5 percent of all published papers. Furthermore, recording only those P values given in abstracts further compounds the sampling bias, as abstracts are typically selective in reporting only the most dramatic results from a study.

Of course, Ioannidis is not exactly an unbiased observer, as it was his paper the new study was attempting to refute. Some other commenters were not quite as harsh. But they nevertheless identified shortcomings. Steven Goodman of Stanford University pointed out some of the same weaknesses that Ioannidis cited.

“Jager and Leek’s attempt to bring a torrent of empirical data and rigorous statistical analyses to bear

on this important question is a major step forward,” Goodman wrote in Biostatistics. “Its weaknesses are less important than its positive contributions.” Still, Goodman suggested that the true rate of false positives is higher than Jager and Leek found, while less than what Ioannidis claimed.

Two other statisticians, also commenting in Biostatistics, reached similar conclusions. Problems with the Jager and Leek study could push the false discovery rate from 14 percent to 30 percent or higher, wrote Yoav Benjamini and Yotam Hechtlinger of Tel Aviv University.

Even one slight adjustment to Jager and Leek’s analysis (including “less than or equal to .05” instead of just “equal to .05”) raised the false discovery rate from 14 percent to 20 percent, Benjamini and Hechtlinger pointed out. Other factors, such as those identified by Ioannidis and Goodman, would drive the false discovery rate even higher, perhaps as high as 50 percent. So maybe Ioannidis was right, after all.

Of course, that’s not really the point. Whether more or less than half of all medical studies are wrong is not exactly the key issue here. It’s not a presidential election. What matters here is the fact that medical science is so unsure of its facts. Knowing that a lot of studies are wrong is not very comforting, especially when you don’t know which ones are the wrong ones.

“We think that the study of Jager and Leek is enough to point at the serious problem we face,” Benjamini and Hechtlinger note. “Even though most findings may be true, whether the science-wise false discovery rate is at the more realistic 30 percent or higher, or even at the optimistic 20 percent, it is certainly too high.”

But there’s another issue, too. As Goodman notes, claiming that more than half of medical research is false can produce “an unwarranted degree of skepticism, hopefully not cynicism, about truth claims in medical science.” If people stop trusting medical science, they turn to those even worse sources of knowledge that lead to serious consequences (such as children not getting proper vaccinations).

Part of the resolution of this conundrum is the realization that individual studies do not establish medical knowledge. Replication of results, convergence of conclusions from different kinds of studies, real world experience in clinics, judgments by knowledgeable practitioners aware of all the relevant considerations and various other elements of evidence all accrue to create a sound body of knowledge for medical practice. It’s just not as sound as it needs to be. Criticizing the flaws in current scientific practice, and developing methods to address and correct those flaws, is an important enterprise that shouldn’t be dismissed on the basis of any one study, especially when it’s based on P values.

Follow me on Twitter: @tom_siegfried

Tom Siegfried is a contributing correspondent. He was editor in chief of Science News from 2007 to 2012 and managing editor from 2014 to 2017.