Statistics and the replication crisis

If probabilism were just a mistaken philosophical theory, it wouldn’t matter. Philosophy has a million silly theories.¹ Most are harmless, because no one takes them seriously.²

Probabilism being wrong matters because science and engineering and education and medicine and finance and government matter to everyone’s lives, and statistical methods are widely used in all those fields. When probabilism—misplaced faith in probabilistic methods—leads you to ignore nebulosity, catastrophes result.

I’ll use the science replication crisis as an example, although parallel analyses apply to financial crises and policy disasters—as well as less dramatic everyday business and government failures.

Science works, it’s based on probability, therefore probability works, therefore any objections to probabilism must be arcane philosophical nit-picking, which we can ignore in practice.

But science doesn’t work, most of the time, even. The replication crisis consists of the realization that, in many sciences, most of what had been believed, based on statistical analyses, was actually false.³ Fortunately, some scientists have taken this seriously and formed a replication movement, or credibility revolution, to address the problem.⁴

The deep causes of the crisis are bad incentives: institutions reward activities that led to false scientific conclusions, and do not reward—or even actively punish—activities that can correct them. However, the substance of the crisis is largely “doing statistics wrong.”

You can “do statistics wrong” at three levels:

Making errors in calculations within a formal system
Misunderstanding what could be concluded within the system if your small-world idealization held
Not realizing you have made a small-world idealization, and taking it as Truth.

Technical errors are not the issue

Statistics is taught as a collection of complex, difficult calculation methods. If you have struggled through a stats course, when you hear “doing statistics wrong,” the natural assumption is that scientists have made sloppy mistakes like misplaced minus signs or putting data in the wrong column. The replication movement has found that these level-one errors are, indeed, far too common.

If that were the whole problem, fixing it by requiring more checking would be straightforward.⁵ Unfortunately, the other two levels, and the necessary fixes, are more subtle.

No solution to the problem of induction

Second level mistakes are misunderstandings of what statistical methods can do. What scientists want is a mathematically guaranteed general solution to the problem of induction. That would let you gain knowledge through a mindless mechanical procedure, without necessarily understanding the domain. You could feed a hypothesis and a bunch of measurements into a black box, and it would tell you how much you should believe the hypothesis. It would be objective, avoiding fallible human judgement. Also, conveniently, you wouldn’t have to think. Especially, you wouldn’t have to understand statistics, a boring and difficult field that just gets in the way of doing interesting lab work.⁶ In fact, no magic box can relieve you of the necessity of figuring out for yourself what (if anything) your data are telling you. But for half a century, many scientists assumed there was one, which is a main reason so much science is wrong.

The famous example is the P<0.05 criterion in null hypothesis significance testing. This has been the main statistical tool in many sciences for many decades. Software packages make it easy to compute; they do not make it understandable or meaningful. Scientists generally treat the analysis as an inscrutable oracle that tells you whether you should believe a theory—and whether or not you can publish.

What you would like P<0.05 to mean is that your theory has less than a 5% chance of being false; you can have a 0.95 confidence that it is correct. Unfortunately, it does not mean that. The P value doesn’t tell you anything about how confident you should be. In common cases your P<0.05 theory is more likely false than true.⁷

Few scientists understand P values. A recent paper explains eighteen different wrong ideas about what P<0.05 means, each common in peer-reviewed scientific papers.⁸ Confusion is understandable, because what the P value does tell you is both quite difficult to understand and something you almost certainly don’t care about. Confusion is not altogether the fault of individual scientists: the explanations of P values in statistics courses are often subtly wrong. It’s also reasonable to assume P values must tell you something useful (or else why would your professor have taught them to you?). It’s therefore reasonable to assume that they tell you the thing that you want and that sounds pretty much like the explanation you’ve read.

It’s tempting to think these misconceptions can and should be fixed with better statistical education. But emphasizing correct understanding of P values, including that they are usually irrelevant to scientific questions, would beg the question of what scientists should do instead.

Those who recognize this often assume that we somehow just chose the wrong black box. If P values are the wrong method, what should we use? If probabilism has the solution to the problem of induction—what is it?

Some reformers have advocated particular alternatives: confidence intervals or Bayes factors, for example. Unfortunately, each of these has its own problems. None of them can, by itself, tell you what you should believe. Any of them—including P values—can be a valuable tool in certain cases.

Why can no method tell you how confident you should be in a belief? One reason is that extraordinary claims require extraordinary evidence. (No amount of data out of a spectrometer should convince you that the moon is made of green cheese.) A numerical estimate of how likely you are to be wrong requires a numerical estimate of how extraordinary your claim is. But often you can’t meaningfully quantitate that. Science explores areas where no one knows what’s going on. Good scientists have different hunches, and reasonably disagree about what’s likely.

Statistics cannot do your thinking for you

Unfortunately, avoiding first and second level errors does not mean you will get correct answers about the real world. It only guarantees that your answers are correct about your formal small-world idealization.

Good statisticians understand the third level error: confusing formal inference with real-world truth. Conversations like this are increasingly common:

Statisticians: Your statistical reasoning is wrong. Your science is broken.

Scientists: Oh. Well, stats is confusing and not our field, so we just run the program we were taught to use in the intro course, and publish when it says it’s OK. So, what statistical test should we run instead?

Statisticians: There is no “correct” statistical test!⁹

Scientists: Well, tell us what arcane ritual you want us to perform to keep publishing our stuff. And can you put that in a user-friendly program, please?

Statisticians: No, you have to actually do science if you want to figure out what is going on.¹⁰

Scientists: “Do science”? What do you mean? Look, we’re scientists. For a shot at tenure, we have to get several papers published every year. How are we supposed to get our data OK’d for that now?

Your statistics package can’t do your thinking for you.¹¹ There can’t be a general theory of induction, uncertainty, epistemology, or the rest. Too bad!

Meta-rational statistical practice

In poorly-understood domains, science requires a meta-rational approach to induction: in this situation, what method will give a meaningful answer? Can we apply statistics here at all? Why or why not? If we can, what specific method would give a meaningful answer, and what do we need to do to ensure that it does?

Analogously, in recent economic crises caused by misuse of statistics, the problem was not that financial economists were using statistics wrong at levels one and two (although that was also true). It’s that they were using statistics at all in a domain where it often doesn’t apply, because usualness conditions only hold temporarily; and they failed to monitor the adequacy of their idealization. In the run-up to the 2008 crisis, many financial models assumed that American housing prices could never go down country-wide—because they never had. That usualness condition was catastrophically violated.

The real-world applicability of a statistical approach is nebulous, because the real world is nebulous. Choosing a statistical method and building a statistical model always involves meta-rational judgements, based on a preliminary understanding of how the idealization relates to reality. There are no right or wrong answers (so long as you stay within the formal domain of applicability). Rather, choices can be more and less predictive, productive, or meaningful.¹²

There’s no substitute for obstinate curiosity, for actually figuring out what is going on; and no fixed method for that. Science can’t be reduced to any fixed method, nor evaluated by any fixed criterion. It uses methods and criteria; it is not defined or limited by them.¹³

Part Five of The Eggplant explains the emerging meta-rational approach to statistical practice.

6 Comments

1.Peter Unger, “Why There Are No People,” Midwest Studies in Philosophy 4 (1979), pp. 177-222.
2.Peter Unger, “I Do Not Exist,” in G.F. Macdonald (eds) Perception and Identity, 1979.
3.John Ioannidis’ “Why Most Published Research Findings Are False” (PLOS Medicine 30 August 2005) lit the fuse on the replication movement. Since then, many large studies, in several different sciences, have verified that a majority of what had been believed from statistical results was false. For example, when repeating 53 “landmark” studies in cancer research, a team at Amgen was able to replicate the positive results of only six (11%). (C. Glenn Begley and Lee M. Ellis, “Raise standards for preclinical cancer research,” Nature, 28 March 2012.) A team led by Brian Nosek repeated a hundred psychology experiments published in prestigious journals, and found a statistically significant result in less than half. (“Estimating the reproducibility of psychological science,” Science, 28 August 2015, 349:6251.) A large-scale automated statistical reanalysis of papers in cognitive neuroscience, by Szucs and Ioannidis, found that more than half are likely to be false (even if experimenters did everything else right). (“Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature,” PLOS Biology, March 2, 2017.)
4.The “replication crisis” is also called the “reproducibility” or “repeatability” or “credibility” crisis.
5.Measures to fix scientific sloppiness are reasonably straightforward, but not necessarily easy, due to institutional inertia.
6.Statistician Steven Goodman writes that “many scientists want only enough knowledge to run the statistical software that allows them to get their papers out quickly, and looking like all the others in their field.” In “Five ways to fix statistics,” Nature 28 November 2017.
7.The theory-confirmation false positive risk (FPR) “depends strongly on the plausibility of the hypothesis before an experiment is done—the prior probability of there being a real effect. If this prior probability were low, say 10%, then a P value close to 0.05 would carry an FPR of 76%. To lower that risk to 5% (which is what many people still believe P < 0.05 means), the P value would need to be 0.00045.” David Colquhoun in “Five ways to fix statistics,” Nature 28 November 2017. Even if the prior probability is 50%, a P value just under 0.05 gives false positives 26% of the time (Colquhoun, “An investigation of the false discovery rate and the misinterpretation of p-values,” Royal Society Open Science, 19 November 2014). These rates are actually best-case scenarios that assume everything else has gone as right as possible. In several large-scale replication efforts, the false positive rate was found to be greater than 50%.
8.Greenland, S., Senn, S.J., Rothman, K.J. et al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” European Journal of Epidemiology (2016) 31: 337.
9.For a technical explanation, see David Colquhoun’s “The reproducibility of research and the misinterpretation of p-values,” Royal Society Open Science, 6 December 2017. “Rigorous induction is impossible,” he says, because in science you can’t make meaningful estimates of prior probabilities, which dramatically affect the expected false positive rate. (As we saw earlier, this is one of the many problems fatal for probabilism.) For a popular summary, see his “The problem with p-values,” Aeon, 11th October 2016.
10.The official statement of the American Statistical Association: “Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.” Ronald L. Wasserstein & Nicole A. Lazar (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133.
11.McShane et al. explain why no statistical test can replace a mechanistic understanding of the domain and data generating process, in “Abandon Statistical Significance,” forthcoming in The American Statistician.
12.See Gigerenzer and Marewski’s “Surrogate Science: The Idol of a Universal Method for Scientific Inference,” Journal of Management 41:2, 2015, pp. 421–440.
13.As we will see later, science is done “by any means necessary.” See also my “Upgrade your cargo cult for the win.”

Audio version

The audio version of this page is read by Matt Arnold.

Book contents

In the Cells of the Eggplant
Meta-rationality: An introduction
Is this book for you? How meta-rationality can level up your work in science, technology, and engineering.
- Rationality and refrigerators
  This book offers more sophisticated understanding of truth than both rationalist absolutism and postmodernist relativism.
- Clouds and eggplants
  The relationship between nebulosity—the inherent fuzziness of the world—and rationality is a central concern of meta-rationality.
- A credibility revolution in the post-truth era
  Why meta-rationality matters for progress: leveling up science, technology, and society, even as they are unraveling.
- The structure of The Eggplant
  A structural overview of the meta-rationality book In The Cells Of The Eggplant.
Part One: Taking rationalism seriously
The hope that systematic rationality can reliably provide certainty, understanding, and control fails when it encounters nebulosity.
- Rationality, rationalism, and alternatives
  Defining the subject matter: rationality, rationalism, reasonableness, and meta-rationality.
- Rationalism’s responses to trouble
  Rationalism responds to its failures, in the face of nebulosity, by making more complicated formal theories.
- Positive and logical
  Early 20th-century logical positivism was the last serious rationalism. Better understandings of rationality learn from its mistakes.
- The world is everything that is the case
  Aristotelian logic was mistaken both in details and overall conception, yet its key ideas survive in contemporary rationalism.
- Depends upon what the meaning of the word “is” is
  Formal logic successfully addresses important defects in traditional, Aristotelian logic, but cannot deal with contextuality.
- The value of meaninglessness
  Recognizing that some statements are neither true nor false was a major advance in early 20th-century rationalism.
- The truth of the matter
  Formal rationality requires absolute truths, but those are rare in the eggplant-sized world. How do we do rationality without them?
- Reductio ad reductionem
  Reduction is a powertool of rationality, but reductionism can’t work as a general theory; most rationality is not reduction.
- Are eggplants fruits?
  Formal methods formally require impossibly precise definitions of terms. How do we use them effectively without that?
- When will you go bald?
  “Shades of gray” is sometimes a good way to think about nebulosity—the world’s inherent fuzziness—but not always.
- Overdriving approximation
  Approximation is a powerful technique, but is not applicable in all rational work, and so is not a good general theory of nebulosity.
- Reference: rationalism’s reality problem
  The correspondence theory of truth doesn’t work by metaphysical magic. We must do the work to make it work—by any means necessary.
- The National Omelet Registry
  Rationalism implicitly or explicitly assumes that every object in the universe has a unique ID number.
- Objects, objectively
  Rational methods assume objects are objectively separable; but they aren’t. How do we use rationality effectively anyway?
- Is this an eggplant which I see before me?
  Rationalist theories assume perception delivers an objective description of the world to rationality. It can’t, and doesn’t try to.
- What can you believe?
  Propositions are whatever sort of thing it is you can believe. Nothing can play that role; so we need a different understanding of belief.
- Where did you get that idea in the first place?
  Rationalism does not explain where hypotheses, theories, discoveries, inventions, or other new ideas come from.
- The Spanish Inquisition
  Unboundedly many issues may be relevant to any practical problem, so mathematical logic does not work as advertised.
- Interlude: A logical farce
  A musical comedy, set in 1987, about the failure of the logicist program in artificial intelligence.
- Probabilism
  Probability theory seems an attractive foundation for rationalism—but it is not up to the job.
- Leaving the casino
  Probabilistic rationalism encourages you to view the whole world as a gigantic casino—but mostly it is not like that.
- What probability can’t do
  If probability theory were an epistemology, we’d want it to tell us how confident to be in our beliefs. Unfortunately, it can’t do that.
- The probability of green cheese
  A thought experiment shows why probability theory and statistics cannot address uncertainty in general.
- Statistics and the replication crisis
  The mistaken belief that statistical methods can tell you what to believe drove the science replication crisis.
- Acting on the truth
  Rationalist theories of action try to deduce optimal choices from true beliefs. This is rarely possible in practice.
- ⚒︎ Overcoming post-rationalist nihilism
  Realizing rationalism is wrong can be devastating. Antidotes to the ensuing rage, anxiety, and depression are available, fortunately!
Part Two: Taking reasonableness seriously
Everyday reasonableness is the foundation of technical, formal, and systematic rationality.
- This is not cognitive science
  The Eggplant is neither cognitive nor science, although it seeks a better understanding of some phenomena cognitive science has studied.
- The ethnomethodological flip
  A dramatic perspective shift: understanding rationality as dependent on mere reasonableness to connect it with reality.
- Aspects of reasonableness
  A summary explanation of everyday reasonable activity, with a tabular guide and a concrete example.
- Reasonableness is meaningful activity
  Understanding concrete, purposeful activity is a prerequisite to understanding the formal rationality that depends on it.
- You are accountable for reasonableness
  Accountability is the key concept in understanding mere reasonableness, as contrasted with systematic rationality.
- Reasonableness is routine
  Routine activity usually goes smoothly overall, despite frequent minor glitches, because we have methods for repairing trouble.
- Meaningful perception
  We actively work to perceive aspects of the world as meaningful, in terms of our purposes, in context.
- The purpose of meaning
  Peculiar features of language make sense as tools to enable collaboration, rather than to express objective truths.
- How we refer
  We accomplish reference by any means necessary: observable, improvised work that makes it clear what we are talking about in context.
- Reasonable believings
  The epistemological categories—truth, belief, inference—are richer, more complex, diverse, and nebulous than rationalism supposes.
- ⚒︎ Reasonable ontology
  Reasonableness works with nebulous, tacit, interactive, accountable, purposeful ontologies, which enable everyday routine activity.
- Instructed activity
  Using instructions requires figuring out what they mean in the context of your activity, and relative to your purposes.
⚒︎ Part Three: Taking rationality seriously
A pragmatic understanding of how systematic rationality works in practice can help you level up your technical work.
- The parable of the pebbles
  Even counting, the simplest rational method, works only with the aid of non-rational support.
Interlude: Ontological remodeling
Reconfiguring categories, properties, and relationships is a meta-rational skill—key in scientific revolutions.
⚒︎ Part Four: Taking meta-rationality seriously
The heart of the meta-rationality book: what meta-rationality is, why it matters, and how to do it.
⚒︎ Part Five: Taking rational work seriously
Putting meta-rationality to work, in statistics, experimental science, software development, and entrepreneurship.

About

Meta-rationality powers up technical skills by better understanding how rationality works.

This site includes a book about meta-rationality, In the Cells of the Eggplant; stand-alone essays that don't fit in the book; and a metablog of news, views, and commentary.

The book is a work in progress; I post drafts of new sections on Substack.

To hear about new content, subscribe to my email newsletter, follow me on Twitter, use the RSS feed, or see the list of recent pages. You can read recent comments and join the discussion.

Hover or click on terms with underlining to see a definition. Pages marked ⚒︎ are under construction.