Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors

Abstract

Null hypothesis statistical tests are often conducted in healthcare research [[1]], including in the physiotherapy field [[2]]. Despite their widespread use, null hypothesis statistical tests have important limitations. This co-published editorial explains statistical inference using null hypothesis statistical tests and the problems inherent to this approach; examines an alternative approach for statistical inference (known as estimation); and encourages readers of physiotherapy research to become familiar with estimation methods and how the results are interpreted. It also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.

What is statistical inference?

Statistical inference is the process of making inferences about populations using data from samples [[1]]. Imagine, for example, that some researchers want to investigate something (perhaps the effect of an intervention, the prevalence of a comorbidity or the usefulness of a prognostic model) in people after stroke. It is unfeasible for the researchers to test all stroke survivors in the world; instead, the researchers can only recruit a sample of stroke survivors and conduct their study with that sample. Typically, such a sample makes up a miniscule fraction of the population, so the result from the sample is likely to differ from the result in the population [[3]]. Researchers must therefore use their statistical analysis of the data from the sample to infer what the result is likely to be in the population.

What are null hypothesis statistical tests?

Traditionally, statistical inference has relied on null hypothesis statistical tests. Such tests involve positing a null hypothesis (e.g., that there is no effect of an intervention on an outcome, that there is no effect of exposure on risk or that there is no relationship between two variables). Such tests also involve calculating a P-value, which quantifies the probability (if the study were to be repeated many times) of observing an effect or relationship at least as large as the one that was observed in the study sample, if the null hypothesis is true. Note that the null hypothesis refers to the population, not the study sample.

Because the reasoning behind these tests is linked to imagined repetition of the study, they are said to be conducted within a ‘frequentist’ framework. In this framework, the focus is on how much a statistical result (e.g., a mean difference, a proportion or a correlation) would vary among the repeats of the study. If the data obtained from the study sample indicate that the result is likely to be similar among the imagined repeats of the study, this is interpreted as an indication that the result is in some way more credible.

One type of null hypothesis statistical test is significance testing, developed by Fisher [4 56]. In significance testing, if a result at least as large as the result observed in the study would be unlikely to occur in the imagined repeats of the study if the null hypothesis is true (as reflected by P < 0.05), then this is interpreted as evidence that the null hypothesis is false. Another type of null hypothesis statistical test is hypothesis testing, developed by Neyman and Pearson [456]. Here, two hypotheses are posited: the null hypothesis (i.e., that there is no difference in the population) and the alternative hypothesis (i.e., that there is a difference in the population). The P-value tells the researchers which hypothesis to accept: if P ≥ 0.05, retain the null hypothesis; if P < 0.05, reject the null hypothesis and accept the alternative. Although these two approaches are mathematically similar, they differ substantially in how they should be interpreted and reported. Despite this, many researchers do not recognise the distinction and analyse their data using an unreasoned hybrid of the two methods.

Problems with null hypothesis statistical tests

Regardless of whether significance testing or hypothesis testing (or a hybrid) is considered, null hypothesis statistical tests have numerous problems [457]. Five crucial problems are explained in Box 1. Each of these problems is fundamental enough to make null hypothesis statistical tests unfit for use in research. This may surprise many readers, given how widely such tests are used in published research [12].