Statistics and the replication crisis

If probabilism were just a mistaken philosophical theory, it wouldn’t matter. Philosophy has a million silly theories.1 Most are harmless, because no one takes them seriously.2

Probabilism being wrong matters because science and engineering and education and medicine and finance and government matter to everyone’s lives, and statistical methods are widely used in all those fields. When probabilism—misplaced faith in probabilistic methods—leads you to ignore nebulosity, catastrophes result.

I’ll use the science replication crisis as an example, although parallel analyses apply to financial crises and policy disasters—as well as less dramatic everyday business and government failures.

Science works, it’s based on probability, therefore probability works, therefore any objections to probabilism must be arcane philosophical nit-picking, which we can ignore in practice.

But science doesn’t work, most of the time, even. The replication crisis consists of the realization that, in many sciences, most of what had been believed, based on statistical analyses, was actually false.3 Fortunately, some scientists have taken this seriously and formed a replication movement, or credibility revolution, to address the problem.4

The deep causes of the crisis are bad incentives: institutions reward activities that led to false scientific conclusions, and do not reward—or even actively punish—activities that can correct them. However, the substance of the crisis is largely “doing statistics wrong.”

You can “do statistics wrong” at three levels:

  1. Making errors in calculations within a formal system
  2. Misunderstanding what could be concluded within the system if your small-world idealization held
  3. Not realizing you have made a small-world idealization, and taking it as Truth.

Technical errors are not the issue

Statistics is taught as a collection of complex, difficult calculation methods. If you have struggled through a stats course, when you hear “doing statistics wrong,” the natural assumption is that scientists have made sloppy mistakes like misplaced minus signs or putting data in the wrong column. The replication movement has found that these level-one errors are, indeed, far too common.

If that were the whole problem, fixing it by requiring more checking would be straightforward.5 Unfortunately, the other two levels, and the necessary fixes, are more subtle.

No solution to the problem of induction

Second level mistakes are misunderstandings of what statistical methods can do. What scientists want is a mathematically guaranteed general solution to the problem of induction. That would let you gain knowledge through a mindless mechanical procedure, without necessarily understanding the domain. You could feed a hypothesis and a bunch of measurements into a black box, and it would tell you how much you should believe the hypothesis. It would be objective, avoiding fallible human judgement. Also, conveniently, you wouldn’t have to think. Especially, you wouldn’t have to understand statistics, a boring and difficult field that just gets in the way of doing interesting lab work.6 In fact, no magic box can relieve you of the necessity of figuring out for yourself what (if anything) your data are telling you. But for half a century, many scientists assumed there was one, which is a main reason so much science is wrong.

The famous example is the P<0.05 criterion in null hypothesis significance testing. This has been the main statistical tool in many sciences for many decades. Software packages make it easy to compute; they do not make it understandable or meaningful. Scientists generally treat the analysis as an inscrutable oracle that tells you whether you should believe a theory—and whether or not you can publish.

What you would like P<0.05 to mean is that your theory has less than a 5% chance of being false; you can have a 0.95 confidence that it is correct. Unfortunately, it does not mean that. The P value doesn’t tell you anything about how confident you should be. In common cases your P<0.05 theory is more likely false than true.7

Few scientists understand P values. A recent paper explains eighteen different wrong ideas about what P<0.05 means, each common in peer-reviewed scientific papers.8 Confusion is understandable, because what the P value does tell you is both quite difficult to understand and something you almost certainly don’t care about. Confusion is not altogether the fault of individual scientists: the explanations of P values in statistics courses are often subtly wrong. It’s also reasonable to assume P values must tell you something useful (or else why would your professor have taught them to you?). It’s therefore reasonable to assume that they tell you the thing that you want and that sounds pretty much like the explanation you’ve read.

It’s tempting to think these misconceptions can and should be fixed with better statistical education. But emphasizing correct understanding of P values, including that they are usually irrelevant to scientific questions, would beg the question of what scientists should do instead.

Those who recognize this often assume that we somehow just chose the wrong black box. If P values are the wrong method, what should we use? If probabilism has the solution to the problem of induction—what is it?

Some reformers have advocated particular alternatives: confidence intervals or Bayes factors, for example. Unfortunately, each of these has its own problems. None of them can, by itself, tell you what you should believe. Any of them—including P values—can be a valuable tool in certain cases.

Why can no method tell you how confident you should be in a belief? One reason is that extraordinary claims require extraordinary evidence. (No amount of data out of a spectrometer should convince you that the moon is made of green cheese.) A numerical estimate of how likely you are to be wrong requires a numerical estimate of how extraordinary your claim is. But often you can’t meaningfully quantitate that. Science explores areas where no one knows what’s going on. Good scientists have different hunches, and reasonably disagree about what’s likely.

Statistics cannot do your thinking for you

Unfortunately, avoiding first and second level errors does not mean you will get correct answers about the real world. It only guarantees that your answers are correct about your formal small-world idealization.

Good statisticians understand the third level error: confusing formal inference with real-world truth. Conversations like this are increasingly common:

Statisticians: Your statistical reasoning is wrong. Your science is broken.

Scientists: Oh. Well, stats is confusing and not our field, so we just run the program we were taught to use in the intro course, and publish when it says it’s OK. So, what statistical test should we run instead?

Statisticians: There is no “correct” statistical test!9

Scientists: Well, tell us what arcane ritual you want us to perform to keep publishing our stuff. And can you put that in a user-friendly program, please?

Statisticians: No, you have to actually do science if you want to figure out what is going on.10

Scientists: “Do science”? What do you mean? Look, we’re scientists. For a shot at tenure, we have to get several papers published every year. How are we supposed to get our data OK’d for that now?

Your statistics package can’t do your thinking for you.11 There can’t be a general theory of induction, uncertainty, epistemology, or the rest. Too bad!

Meta-rational statistical practice

In poorly-understood domains, science requires a meta-rational approach to induction: in this situation, what method will give a meaningful answer? Can we apply statistics here at all? Why or why not? If we can, what specific method would give a meaningful answer, and what do we need to do to ensure that it does?

Analogously, in recent economic crises caused by misuse of statistics, the problem was not that financial economists were using statistics wrong at levels one and two (although that was also true). It’s that they were using statistics at all in a domain where it often doesn’t apply, because usualness conditions only hold temporarily; and they failed to monitor the adequacy of their idealization. In the run-up to the 2008 crisis, many financial models assumed that American housing prices could never go down country-wide—because they never had. That usualness condition was catastrophically violated.

The real-world applicability of a statistical approach is nebulous, because the real world is nebulous. Choosing a statistical method and building a statistical model always involves meta-rational judgements, based on a preliminary understanding of how the idealization relates to reality. There are no right or wrong answers (so long as you stay within the formal domain of applicability). Rather, choices can be more and less predictive, productive, or meaningful.12

There’s no substitute for obstinate curiosity, for actually figuring out what is going on; and no fixed method for that. Science can’t be reduced to any fixed method, nor evaluated by any fixed criterion. It uses methods and criteria; it is not defined or limited by them.13

Part Five of The Eggplant explains the emerging meta-rational approach to statistical practice.

  1. 1.Peter Unger, “Why There Are No People,” Midwest Studies in Philosophy 4 (1979), pp. 177-222.
  2. 2.Peter Unger, “I Do Not Exist,” in G.F. Macdonald (eds) Perception and Identity, 1979.
  3. 3.John Ioannidis’ “Why Most Published Research Findings Are False” (PLOS Medicine 30 August 2005) lit the fuse on the replication movement. Since then, many large studies, in several different sciences, have verified that a majority of what had been believed from statistical results was false. For example, when repeating 53 “landmark” studies in cancer research, a team at Amgen was able to replicate the positive results of only six (11%). (C. Glenn Begley and Lee M. Ellis, “Raise standards for preclinical cancer research,” Nature, 28 March 2012.) A team led by Brian Nosek repeated a hundred psychology experiments published in prestigious journals, and found a statistically significant result in less than half. (“Estimating the reproducibility of psychological science,” Science, 28 August 2015, 349:6251.) A large-scale automated statistical reanalysis of papers in cognitive neuroscience, by Szucs and Ioannidis, found that more than half are likely to be false (even if experimenters did everything else right). (“Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature,” PLOS Biology, March 2, 2017.)
  4. 4.The “replication crisis” is also called the “reproducibility” or “repeatability” or “credibility” crisis.
  5. 5.Measures to fix scientific sloppiness are reasonably straightforward, but not necessarily easy, due to institutional inertia.
  6. 6.Statistician Steven Goodman writes that “many scientists want only enough knowledge to run the statistical software that allows them to get their papers out quickly, and looking like all the others in their field.” In “Five ways to fix statistics,” Nature 28 November 2017.
  7. 7.The theory-confirmation false positive risk (FPR) “depends strongly on the plausibility of the hypothesis before an experiment is done—the prior probability of there being a real effect. If this prior probability were low, say 10%, then a P value close to 0.05 would carry an FPR of 76%. To lower that risk to 5% (which is what many people still believe P < 0.05 means), the P value would need to be 0.00045.” David Colquhoun in “Five ways to fix statistics,” Nature 28 November 2017. Even if the prior probability is 50%, a P value just under 0.05 gives false positives 26% of the time (Colquhoun, “An investigation of the false discovery rate and the misinterpretation of p-values,” Royal Society Open Science, 19 November 2014). These rates are actually best-case scenarios that assume everything else has gone as right as possible. In several large-scale replication efforts, the false positive rate was found to be greater than 50%.
  8. 8.Greenland, S., Senn, S.J., Rothman, K.J. et al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” European Journal of Epidemiology (2016) 31: 337.
  9. 9.For a technical explanation, see David Colquhoun’s “The reproducibility of research and the misinterpretation of p-values,” Royal Society Open Science, 6 December 2017. “Rigorous induction is impossible,” he says, because in science you can’t make meaningful estimates of prior probabilities, which dramatically affect the expected false positive rate. (As we saw earlier, this is one of the many problems fatal for probabilism.) For a popular summary, see his “The problem with p-values,” Aeon, 11th October 2016.
  10. 10.The official statement of the American Statistical Association: “Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.” Ronald L. Wasserstein & Nicole A. Lazar (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133.
  11. 11.McShane et al. explain why no statistical test can replace a mechanistic understanding of the domain and data generating process, in “Abandon Statistical Significance,” forthcoming in The American Statistician.
  12. 12.See Gigerenzer and Marewski’s “Surrogate Science: The Idol of a Universal Method for Scientific Inference,” Journal of Management 41:2, 2015, pp. 421–440.
  13. 13.As we will see later, science is done “by any means necessary.” See also my “Upgrade your cargo cult for the win.”