Taking a lice-grade comb to press coverage of Hillary Clinton during the 2016 presidential campaign can feel a little like relitigating, but in light of recent news about President Donald Trump, consider this article: “It Really Doesn’t Matter if Hillary Clinton Is Dishonest.” Published in the Washington Post just before the Iowa caucuses, it was one of many stories that took as stipulated the idea that voters saw Clinton as untrustworthy.
In hindsight, the press had the wrong candidate’s honesty under the heat lamps. This WaPo story, though, goes even further, suggesting that perhaps presidents don’t need to be super-honest. Honesty might be an obstacle to effectiveness, a couple of experts tell the writer. One of them, a psychologist named David Rand, then at Yale, hearkens to his own team’s research showing that people see emotional, impulsive people as inherently more honest.
And what’s funny about that—not funny like “ha-ha” but more funny like “sob, oh God, another round here please”—is that the Rand study, an important piece of the last decade’s understanding of social science, seems to not be … right? No, that’s not accurate. What is accurate is that its results did not replicate. Along with a half-dozen other major social science papers reworked in a study publishing today in the journal Nature Human Behaviour, that study apparently fails a key test of scientific validity, which is the following: If you do it again, you’re supposed to get the same results.
That doesn’t mean those papers were wrong. Except it kind of does. That tension is at the core of what researchers sometimes call the “reproducibility crisis,” the revelation that wide swaths of published science are not meeting a basic standard of the scientific method. Other researchers, using the same methods, should get the same results. They often don’t, particularly pernicious in the social sciences—psychology, economics, sociology—but even the so-called hard sciences, like biology and medicine, have had reproducibility problems.
The new Nature Human Behaviour paper comes from a group out of the Center for Open Science, which has been at the forefront of exposing and dealing with the problem. They looked at 21 papers from the premiere journals Nature and Science between 2010 and 2015. To test the results of the original papers, the new teams—five of them, at universities around the world—tested much larger groups of people, and ran several kinds of statistical analyses. The original authors gave feedback on the protocols and provided the data, software, and coding they had used. It was a massive effort.
“If we’re going to study reproducibility, we need that investment,” says Brian Nosek, head of the Center for Open Science and a psychologist at the University of Virginia. The question wasn’t just whether the original claims were replicable. It was whether would-be replicators could rule out some of the excuses for why they weren’t. “All of that extra work beyond the normal was because those explanations for failure to replicate are boring. We wanted to eliminate as much of that as we can and see, still, is the credibility of the published literature a little bit lower than what we’d expect?”
It was. Of 21 social and behavioral science papers in Science and Nature that met the study criteria between 2010 and 2015, the replicators found that just 13 had a statistically significant effect in the same direction as in the original. And it was generally about half as big as the original paper showed. The other papers showed, essentially, zero effect.
That’s nothing to shrug at. Nature and Science are major journals; articles in both not only further scientific careers, but also, through emails to journalists in advance of publication, help dictate science coverage in the popular media. (Yes, I get those emails, and yes, this Nosek paper was in one.) Research promulgates. Flashy, interesting research gets embedded in popular culture—sometimes despite its reproducibility, or lack thereof.
Thanks to Google Scholar and a scoring system called Altmetrics, it’s possible to get a sense of the outward ripples of any published scientific article. The honesty study I mentioned has been cited more than 800 times in books, journals, and other sources, including by its own authors. News outlets like Scientific American and Slate did stories referring to it. It got a lot of play, conceivably even having an effect on the 2016 presidential election.
Now, look, just because the paper didn’t replicate doesn’t actually mean its conclusions were false. Experiments fail to replicate for lots of reasons. In comments to Nosek’s group, David Rand, one of the original study’s authors, suggested that the problem might be a methodological one. Both recruited subjects via Amazon’s Mechanical Turk system, but today, eight years later, Turk-ers have been the subjects of so many behavioral economics studies that they know the drill and aren’t as easily primed or studied. (Rand also pointed out that he was an author on three studies in the Nosek paper, and two of them replicated.)
For all the work Nosek’s group did, some questions about reproducibility still boil down to resource constraints and methodological slap-fights between scientists. Rand makes a good point about Mechanical Turk—and time. “The heterogeneity of social life and the variability of people across space and time make it harder for us to get the same result when we do the same thing,” says Matt Salganik, a computational social scientist at Princeton who has been involved in replicability research, but wasn’t involved with this new work. “That doesn’t mean that the original result never happened, or that the follow-up result never happened.”
One of Salganik’s big papers, a 2006 look at how social media functions, revolved around the construction of a website on which subjects could download music. As he says, how would you replicate that today? Would you build a 2006-era website? Would you use the same songs, or contemporary ones? Who even downloads music anymore? “There are a lot of these decisions that are not obvious,” Salganik adds.
In other cases, though, they are. One of the studies that didn’t replicate, “Analytic Thinking Promotes Religious Disbelief,” from 2012 asserted that the more analytical a person was, the less likely they were to believe in God. To test this idea, researchers showed 26 Canadian undergraduates a picture of Auguste Rodin’s sculpture The Thinker (analytical) and 31 Canadian undergraduates a picture of Myron’s sculpture Discobolus (neutral). Thus primed, the undergraduates rated their belief in God; the ones who saw The Thinker said they were less godly. The paper has been cited more than 360 times in books and journal articles, and 12 news outlets mentioned it, including a Mother Jones story called “Why Obamacare Could Produce More Atheists.”
So, yeah … no. Will Gervais, a psychologist at the University of Kentucky, was one of the original paper’s authors, and participated in a teleconference for the press about the new reproducibility paper. “Our study was, in retrospect, outright silly. It was a really tiny sample size and barely [statistically] significant,” Gervais says. “I like to think that it wouldn’t get published today.”
That goes to the heart of large-scale replication studies like this one. They aren’t about science-shaming, or calling the field to action. Thousands of researchers now preregister their methodology and hypothesis before publication, to head off concerns that they’ll massage data after the fact. Journals commonly require researchers to submit their entire datasets and analytical code. Even Nature and Science have changed their rules since the Nosek paper’s 2010-15 time frame. “The underlying motivation is a genuine one. They are in it to get it right, not to be right, even though the culture incentivizes sexy findings,” UVA’s Nosek says. “The competing values of transparency, of rigor, of showing all your work, those are still deeply held in the community. So the change is coming with people who are willing to confront the cultural incentives and practice in new ways.”
Large-scale reproducibility efforts on every paper from three centuries of scientific journals would be prohibitively expensive. But one of the backstopping efforts in the Nosek paper does point to a creative way forward. In addition to rerunning the experiments, the group also asked a separate set of 400 researchers to form a “prediction market,” trading tokens and betting on which of the 21 studies would or wouldn’t reproduce. Their guesses lined up with the results almost perfectly.
No one really knows how prediction markets make their decisions, and the so-called wisdom of a crowd can be biased by all sorts of pernicious stuff. Still, though, “maybe we don’t need all of this effort on a whole host of different studies. Maybe we can take seriously what the community says is likely to be true,” Nosek says. So, like, before the National Science Foundation drops tens of millions of dollars on a new research effort, a market could form on the basic science, and if the result is skeptical, a small-scale replication study could get out ahead of the large-scale initiative. “You save a lot of money or you go into that investment with a lot more confidence.”
The solution to the reproducibility crisis isn’t necessarily more reproducibility studies. It’s better training, better statistics, and better institutional practices that’ll stop these kinds of problems in research before they ever make it to the pages of a journal—or even a place like WIRED.