The small S scientist: significance testing

Showing posts with label significance testing. Show all posts

Tuesday, 26 March 2019

The Anatidae Principle

If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family Anatidae on our hands.
- Douglas Adams

I like to teach my students how they can apply to their data-analysis what I call the Anatidae Principle (or the Principle of the Duck). (The name is obviously inspired by the above quote from Douglas Adam's Dirk Gently's Holistic Detective Agency).

For the purpose of data-analysis, the Anatidae Principle simply boils down to the following: If it looks like you found a relation, difference, or effect in your sample you should at least consider the possibility that there indeed is a relation, difference or effect. That is, look at your data, summarize, make figures, and think (hard) about what your data potentially mean for the answer to your research question, hypotheses, hunches, whatever you like. Do this before you start calculating p-values, confidence intervals, Bayes Factors, Posterior distributions, etc., etc.

In my experience, researchers too often violate the Anatidae Principle: they calculate a p-value, and if it is not significant they simply ignore their sample results. Never mind that, as they predicted, group A outperforms group B, if it is not significant, they will claim they found no effect. And, worse still, believe it.

Kline (2013) ) (p. 117) gives solid advice:

"Null hypothesis rejections do not imply substantive significance, so researchers need other frames of reference to explain to their audiences why the results are interesting or important. A start is to learn to describe your results without mention of statistical significance at all. In its place, refer to descriptive statistics and effect sizes and explain why those effect sizes matter in a particular context. Doing so may seem odd at first, but you should understand that statistical tests are not generally necessary to detect meaningful or noteworthy effects, which should be obvious to visual inspection of relatively simple kinds of graphical displays (Cohen, 1994). The description of results at a level closer to the data may also help researchers to develop better communication skills."

Friday, 21 April 2017

What is NHST, anyway?

I am not a fan of NHST (Null Hypothesis Significance Testing). Or maybe I should say, I am no longer a fan. I used to believe that rejecting null-hypotheses of zero differences based on the p-value was the proper way of gathering evidence for my substantive hypotheses. And the evidential nature of the p-value seemed so obvious to me, that I frequently got angry when encountering what I believed were incorrect p-values, reasoning that if the p-value is incorrect, so must be the evidence in support of the substantive hypothesis.

For this reason, I refused to use the significance tests that were most frequently used in my field, i.e. performing a by-subjects analysis and a by-item analysis and concluding the existence of an effect if both are significant, because the by-subjects analyses in particular regularly leads to p-values that are too low, which leads to believing you have evidence while you really don't. And so I spent a huge amount of time, coming from almost no statistical background - I followed no more than a few introductory statistics courses - , mastering mixed model ANOVA and hierarchical linear modelling (up to a reasonable degree; i.e. being able to get p-values for several experimental designs). Because these techniques, so I believed, gave me correct p-values. At the moment, this all seems rather silly to me.

I still have some NHST unlearning to do. For example, I frequently catch myself looking at a 95% confidence interval to see whether zero is inside or outside the interval, and actually feeling happy when zero lies outside it (this happens when the result is statistically significant). Apparently, traces of NHST are strongly embedded in my thinking. I still have to tell myself not to be silly, so to say.

One reason for writing this blog is to sharpen my thinking about NHST and trying to figure out new and comprehensible ways of explaining to students and researchers why they should be vary careful in considering NHST as the sine qua non of research. Of course, if you really want to make your reasoning clear, one of the first things you should do is define the concepts you're reasoning about. The purpose of this post is therefore to make clear what my "definition" of NHST is.

My view of NHST is very much based on how Gigerenzer et al. (1989) describe it:

"Fisher's theory of significance testing, which was historically first, was merged with concepts from the Neyman-Pearson theory and taught as "statistics" per se. We call this compromise the "hybrid theory" of statistical inference, and it goes without saying the neither Fisher nor Neyman and Pearson would have looked with favor on this offspring of their forced marriage." (p. 123, italics in original).

Actually, Fisher's significance testing and Neyman-Pearson's hypothesis testing are fundamentally incompatible (I will come back to this later), but almost no texts explaining statistics to psychologists "presented Neyman and Pearson's theory as an alternative to Fisher's, still less as a competing theory. The great mass of texts tried to fuse the controversial ideas into some hybrid statistical theory, as described in section 3.4. Of course, this meant doing the impossible." (p. 219, italics in original).

So, NHST is an impossible, as in logically incoherent, "statistical theory", because it (con)fuses concepts from incompatible statistical theories. If this is true, which I think it is, doing science with a small s, which involves logical thinking, disqualifies NHST as a main means of statistical inference. But let me write a little bit more about Fisher's ideas and those of Neyman and Pearson, to explain the illogic of NHST.

Scientific with a small s

My inspiration for this blog's motto comes from Zilliak & McCloskey (2004). They quote from Bob Solow's Nobel Prize acceptance speech, after which they write:

"Solow recommends we "try very hard to be scientific with a small s"; but the authors we have surveyed in the AER [American Economic Review, GM], by contrast, are trying to be scientific with a small t." (p. 544).

Their "small t" refers to the t statistic on the basis of which researchers determine the p-values they use to assess the statistical significance of their findings. A small p (smaller than .05) is usually taken to mean that the test result is statistically significant.

There are a lot of reasons to believe that null-hypothesis significance testing (NHST) is basically unscientific. That's why I got convinced that you cannot do science with a small p (significance testing). I hope that after reading the blog posts yet to come, you will be convinced as well. (If you can't wait: Kline (2014) (see below) is a good place to start getting convinced).

What does it mean to be scientific with a small s? To Solow (as cited in Zilliak & McCloskey, 2004) it simply means thinking logically and respecting the facts. To my mind, thinking logically as a prerequisite of being scientific (with a small s) includes thinking logically about the results of statistical analyses. For instance, that you should not mistakenly believe that a small p value means that it is unlikely that a result is due to chance, or that you should not mistakenly believe that the long term behavior of a decision procedure has anything to do with the evidence in your actual data (the facts).

Zilliak & McCloskey (2004) write about economic research, but significance testing is of course not limited to economic research. Kline (2013, p. 118-199) concludes in his chapter about cognitive distortions in significance testing (and he is putting it mildly):

"Significance testing has been like a collective Rorschach inkblot test for the behavioral sciences: What we see in it has more to do with wish fulfillment than reality. This magical thinking has impeded the development of psychology and other disciplines as cumulative sciences. [...] the gap between what is required for significance tests to be accurate and characteristics of real world studies is just too great."

So, this blog is about being scientific with a small s, with a main focus on the logic and illogic of NHST, because you simply cannot do science with only a small p.

References
Kline, R.B. (2013). Beyond significance testing. Statistics reform in the behavioral sciences. Second Edition. Washington: APA.
Zilliak, S.T., & McCloskey, D.N. (2004). Size matters: the standard error of regressions in the American Economic Review, Journal of Socio-Economics, 33, 527-547.

The small S scientist

Pagina's

Tuesday, 26 March 2019

The Anatidae Principle

Friday, 21 April 2017

What is NHST, anyway?

Friday, 20 January 2017

Scientific with a small s