Credit: sciencetimes.com.
NOTE: hover over words in blue for additional information
In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.
When UK statistician Ronald Fisher introduced the P value in the 1920s, he intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. Researchers would first set up a 'null hypothesis' that they wanted to disprove. Next, they would assume that this null hypothesis was in fact true and calculate the chances of getting results at least as extreme as what was actually observed. This probability was the P value. The smaller it was, the greater the likelihood that the straw-man null hypothesis was false.
For all the P value's apparent precision, Fisher intended it to be just one part of a non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. This movement was spearheaded by Fisher's bitter rivals, Polish mathematician Jerzy Neyman and UK statistician Egon Pearson, who introduced an alternative framework for data analysis that included statistical power, false positives, false negatives and many other concepts. They pointedly left out the P value.
While the rivals feuded, other researchers began to write statistics manuals for working scientists. And because many of the authors were non-statisticians without a thorough understanding of either approach, they created a hybrid system that crammed Fisher's easy-to-calculate P value into Neyman and Pearson's reassuringly rigorous rule-based system.
One result is an abundance of confusion about what the P value means. All the P value can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. The more implausible the hypothesis, the greater the chance that an exciting finding is a false alarm.
Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. “We should be asking, 'How much of an effect is there?', not 'Is there an effect?'”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia.
Perhaps the worst fallacy is the kind of self-deception known as P-hacking. “P-hacking,” says psychologist Simonsohn of the University of Pennsylvania, “is trying multiple things until you get the desired result” — even unconsciously. P-hacking is especially likely, he says, in today's environment of studies that chase small effects hidden in noisy data. In an analysis, Simonsohn found evidence that many published psychology papers report P values that cluster suspiciously around 0.05, just as would be expected if researchers fished for significant P values until they found one.
Despite the criticisms, reform has been slow. “The basic framework of statistics has been virtually unchanged since Fisher, Neyman and Pearson introduced it,” says Goodman. John Campbell, a psychologist now at the University of Minnesota in Minneapolis, bemoaned the issue in 1982, when he was editor of the Journal of Applied Psychology: “It is almost impossible to drag authors away from their p-values, and the more zeroes after the decimal point, the harder people cling to them.”
Nature 506, 150–152 (13 February 2014), doi:10.1038/506150a
The below video complements the information offered by the text. Feel free to use subtitles if you need them.
B. TRUE/FALSE: say whether the following statements are TRUE or FALSE.
a. There have been few instances of replication problems since 2005.
b. Fisher wanted the P value to be a way of assessing informally if a study’s conclusions were scientifically solid or not.
c. The P value was not supposed to be objective.
d. One of the problems with the way the P value is used is that some people mistakenly think that P is a measure of the reality of an effect.
e. A very small P value indicates that a result is real.
f. P values tend to focus people’s attention on statistical significance rather than on the magnitude of an effect.
g. P hacking is defined as a fraudulent practice of researchers to make their research appear statistically significant by changing things until they are satisfied with the P value.
h. The reason why Simonsohn is suspicious of many psychology papers is that the researchers obtain P values that are close to the threshold for statistical significance.
Choose the correct answer.
1. Which of the following is NOT true about the null hypothesis?
2. Which of the following is true about Jerzy Newman and Egon Pearson’s system?
C. VOCABULARY: in the text, find equivalents for the following words.
Synonym/equivalent | Word from the text |
---|---|
a. to mix, to combine | |
b. to lead an organized effort, to take a leading role in an undertaking | |
c. a set of principles, ideas etc. that you use when you are forming your decisions and judgments | |
d. to be involved in an angry disagreement that continues for a long time | |
e. complete, comprehensive | |
f. to fit people or things into a space that is too small or unsuitable | |
g. fundamental, basic, essential | |
h. to complain about something | |
i. confused, not clear | |
j. a mistake in an argument or idea that makes it false |
2 Rue de la Houssinière
Building 2 - Office 109
Nantes 44322 cedex 3