Psychology results evaporate upon further review

Surprising reports, findings with marginal statistical significance least likely to be reproduced, study concludes

VANISHING EFFECTS An international project finds that statistically significant findings reported in papers published in three major psychology journals often disappear when independent teams redo the original studies.

KM6064/istockphoto

By Bruce Bower

August 27, 2015 at 2:00 pm

Psychologists have recently bemoaned a trend for provocative and sometimes highly publicized findings that vanish in repeat experiments. A large, collaborative project has now put an unsettling, and contested, number on the extent of that problem.

Only 35 of 97 reports of statistically significant results published in three major psychology journals in 2008 could be replicated, a group led by psychologist Brian Nosek of the University of Virginia in Charlottesville reports in the Aug. 28 Science. Nosek is executive director of the Center for Open Science, which coordinated 270 researchers involved in the replication project.

“There is a lot of room to improve reproducibility in psychology,” Nosek says.

He and his colleagues can’t say whether nonreproduced results represented illusory effects in the original studies that needed debunking or genuine effects that were missed in replications. It’s also possible that unnoticed differences between original and repeat studies led to failed replications.

Replication teams selected suitable studies from either Psychological Science, the Journal of Personality and Social Psychology, or the Journal of Experimental Psychology: Learning, Memory and Cognition. Teams repeated the last experiment reported in each article. Across the three journals, 14 of 55 social psychology findings, or 25 percent, were replicated. Among cognitive psychology results, 21 of 42 were replicated.

Surprising findings and results that barely achieved statistical significance were least likely to be reproduced. That raises concerns about the common practice of publishing attention-grabbing results and studies reporting effects that barely pass statistical muster, Nosek says.

All original and repeat studies employed a statistical method that estimates the likelihood of obtaining the observed results if an apparent experimental effect is a fluke. Successful replications in Nosek’s project had to find that an original result would have been a fluke no more than one out of 20 times. The acceptable calculated likelihood of a fluke finding, or P value, may have to be tightened to one out of 100 times or more to deter the publication of results with marginal statistical support, and therefore have a good chance of never being replicated, comments psychologist Hal Pashler of the University of California, San Diego.

Considering the limitations of trying to repeat experiments in different countries, with different populations and at different times, “these results show the psychology glass as half full,” says Stanford University evolutionary biologist Daniele Fanelli, a past critic of behavioral research practices (SN: 10/5/13, p. 10). Still, the new study confirms P values as “the least informative measure of all,” Fanelli holds. No one knows if the newly replicated studies have uncovered any true psychological phenomena, he says, since low P values indicate only that measured observations in an experiment are unlikely if any apparent relationship is due to chance. That leaves unexplained what, if anything, is actually going on.

Although a study of his was replicated in Nosek’s project, psychologist Klaus Fiedler of the University of Heidelberg in Germany regards the new findings as too flawed to draw conclusions about the state of psychology. Replication attempts failed to account for the tendency of an extreme initial result to fade toward an average result in a second go-round, Fiedler says. That effect happens even when a hypothesis is correct. Among other problems, critical checks on whether experimental manipulations in repeat studies produced the same effects as in original studies were not conducted, Fiedler adds.

Consider a study that finds that an unobtrusive mood, measured as feeling happier on sunny days than on cloudy days, leads people to feel better about their lives. A replication attempt must first establish that volunteers actually feel happier on sunny days, as in the original experiment. This type of check on experimental manipulations was not typically included in the new set of replications, Fiedler says.