Wednesday, January 11, 2012

Controversies Surrounding Classical Null Hypothesis Testing (HNT)

Classical Null Hypothesis testing (NHT) is the dominant paradigm for conducting scientific investigations. The focus in NHT is on stating and testing hypotheses prior to conducting research. This singular focus is both a strength and a weakness. In the first part of this post, I'll describe NHT using the diagram above (click to enlarge). In the second part of the post I'll catalog the long list of problems that have emerged over time with the application of NHT. The bottom line from experience with NHT would seem to be that if you have a precise hypothesis and if you meet all the requirements of the NHT approach, it is a powerful technique for advancing science. Problems enter when you don't have a narrow hypothesis (for example, you have a more complicated structural equation model you are testing) and when you don't meet all the requirements of the approach (a very common occurrence).

NHT was developed in the 1930's by Ronald Fisher, Jerzy Neyman and Karl Pearson (it is often called Neyman-Pearson hypothesis testing) in response to the needs of agricultural experimentation. Over time, NHT was generalized to cover any well formulated experimental design. The steps in NHT are:
  1. State The Research Hypothesis A scientific hypothesis is often defined as a proposed explanation for some phenomenon that can be testable. A hypothesis is not the same as a scientific theory which is viewed as an internally consistent collection of hypotheses. An example of a scientific hypothesis being widely discussed is the Global Warming hypothesis that anthropogenic (human-induced) CO2 emissions will lead to an increase in global temperature. The GW hypothesis is interesting because it is not inherently testable by experimental manipulation and is controversial. A simpler hypothesis that could potentially be determined by a scientific experiment is that Yoga is is superior to other forms of exercise and stretching to relieve lower back pain (the hypothesis is the subject of a current study here).
  2. Restate The Hypothesis in Null Form In NHT decision making, the research hypothesis is not directly tested. Instead, the researcher attempts to state and reject a null hypothesis. For example, CO2 emissions do not create global warming or Yoga has no effect on reducing lower back pain. The null hypothesis restatement brings statistical analysis squarely into the research process. It provokes the question of how much of an increase in global temperature as a result of CO2 emissions would be necessary to reject the null hypothesis or how much reduction in lower back pain as a result of Yoga would be necessary to reject the null hypothesis. The restatement is based on the assumption that it is easier to reject some assertion than it is to accept an assertion as true. For example, the assertion that all Swans are white is hard to accept because it would be refuted by the first black swan that was observed (see Black Swan Theory and knowledge of how financial crises develop).
  3. Identify Statistical Tests, Distributions and Assumptions The next step involves trying to decide how to test the null hypothesis. It is a large step and could potentially involve a wide range of considerations. Typically, NHT chooses from a restricted range of predefined test statistics based on normal distribution theory. The unusual idea that we can assume that the distribution of some unknown small sample has been drawn from a known population distribution is based on the central limit theorem (CLT). Under the CLT, it can be both proven and demonstrated (here) that any relatively large (say over 100) collection of weirdly distributed observations will converge to the bell-shaped normal distribution. Therefore, any statistic based on the sample of weirdly distributed random variables will also be normally distributed in the population. The CLT breaks down if the random variables are correlated or if the sample was not drawn by a random process (a random sample). For example, did we take all our global temperature measurements in urban heat islands or did we random sample the entire world? Did we assign all the people with untreatable spinal conditions to the control group rather than the experimental group that attended Yoga classes?
  4. Compute Test Statistic(s) Now we are at the point of collecting data and computing tests statistics. This brings up the issue of how many observations to collect. For example, critics of the GW hypothesis argue that we need thousands of years of data, not just data from the beginning of the Industrial revolution, for determining the earths natural climate cycles. It turns of that the power of a statistical test can be manipulated to any degree of precision given a large enough sample size. In other words, if you want to prove that Yoga is better than control, just choose a large enough sample size and you are always likely to find a statistically significant difference.
  5. Test Assumptions Once you have gathered the data and computed your test statistic(s), the issue of whether you have actually met the assumptions of your test statistic remains. Tests of assumptions cover the normality assumption using, for example, the Wilks-Shapiro test, and assumptions of homogeneity of variance using, for example, the Levene's test. If you fail to meet assumptions, you can then switch to nonparametric statistical tests either based on rank transformations or on monte-carlo techniques such as the bootstrap. The nonparametric techniques do not necessarily generalize to some idealized larger population distribution.
  6. Reject or Fail To Reject Null Hypothesis Whether you have met all the assumptions or switched to nonparametric tests, you are now at the point of either rejecting or failing to reject the null hypothesis. This is done by choosing some extreme point in the null distribution (typically beyond which 5% or 1% is the sample would lie) and assuming that a result this far out in the critical region of the null hypothesis distribution could not have resulted from chance alone.
Being the dominant paradigm in scientific investigation, experience with NHT has both demonstrated its usefulness and also demonstrated a wide range of practical and theoretical problems in application. I will start out with what I think are the most important criticism and by no means exhaust the entire list (more here and here):
  • Data Snooping and Cherry Picking A lot of the success of NHT depends on unknown assumptions about the experimenter and how either the experiment was conducted or the data were generated. When NHT was developed in the 1930s, it was not possible to accumulate large data bases and mine the data for significant results. It was more realistic to simply construct a small experiment and follow the NHT rules. With the wide availability of computers and associated databases in the 1960s, data snooping became possible. Data dredging is a problem because it invalidates the NHT assumptions. A result may appear significant in a non-randomly collected sample and not be statistically significant in the population. Cherry picking is the (unintentional?) process of selecting cases that prove your point, again through non-random sampling. Researchers testing the GW hypothesis have specifically been criticized for allegedly cherry picking results and using biased statistical techniques (the hockey stick controversy).
  • P-value Fixation A minimum criterion for publication in scientific journals is commonly taken as a p-value of either p ≤ 0.05 or p ≤ 0.01, weak and strong evidence respectively. Unfortunately, the p-value says nothing about the size of the observed effect. In an excellent article by Steve Goodman (here), it is pointed out that "A small effect in a study with large sample size can have the same P value as a large effect in a small study." This problem leads into the difficult and contradictory area of statistical power analysis which I will take up in a future post.
  • Introducing Bias Into The Scientific Literature Because there are many ways to game the NHT approach, because academic journals will not publish null results and because scientific careers are based on publishing, the scientific literature itself is subject to publication bias. Since readers can never know precisely how researchers generated their data, there is always a question about individual findings. The solution to the problem involves replication, that is, another independent researcher confirming a set of findings. Unfortunately, journals seldom publish replication studies and some studies are so expensive to conduct that replication is infeasible.
  • Theories and Hypotheses vs. Models The idea that theories are collections of hypotheses waiting to be tested is a very Victorian view of science (think Sherlock Holmes) and may not really be appropriate to current scientific activity. Rather than focusing on verbal constructs (theories and hypotheses), current scientific activity is directed more to developing causal models of the real world. For example, rather than narrowly testing the GW hypothesis, scientists are more concerned with developing global climate models sometimes called General Circulation Models (GCMs). Science evolves from simple, easily understood models to more complicated representations based on evolving knowledge and the ability of new models to outperform old models (the development of weather forecasting models which have obviously improved our ability to predict local weather is a good example). Model development has its own series of problems and doesn't guarantee that complex models are always better than simple models (see for example the experience with econometric models in the late-2000 financial crisis). However, as in the GW controversy, the onus now is on critics to propose better models rather than focus narrowly on individual (possibly cherry picked) hypothesis tests.
  • Theoretical Deficiencies Critics have argued that NHT is incoherent theoretically. For example, from the perspective of Bayesian inference, we are never starting from the blank slate implied by NHT. We have some reason for performing the study and the reason is usually that we believe in the research hypothesis, not the null hypotheses. Bayesian inference, which I will discuss in a later post, specifically takes into account prior expectations about the research hypotheses and also lends itself to the comparison of multiple models (multi-model selection) rather than NHT.
  • Artificial Limitations on the Advance of Knowledge Under NHT, scientific knowledge advances through experimentation. Unfortunately, much useful scientific knowledge has been developed without experimentation and where experimentation is impossible. Scientific knowledge of the global climate system has developed rapidly in the last few decades without the possibility of experimental manipulation, random assignment or random sampling. Advances have resulted from focusing on model building rather than hypothesis testing.
  • Failures in Practice NHT involves many essentially complicated steps that are often ignored in practice. Typically, statistical power is not controlled and assumptions are seldom tested. What effect these omissions might have on the quality of scientific results is generally unknown and possibly unknowable.
The lessons from experience with NHT should be pretty clear: If you have a simple hypothesis that can be subjected to experimental manipulation and if you follow all the NHT rules, it can be an effective approach to advancing scientific knowledge. If you don't there are other approaches. In future posts I will discuss statistical power analysis, Bayesian inference and multi-model selection.

No comments:

Post a Comment