Saturday, January 28, 2012

How Complicated Models Broke the NHT Paradigm

In a prior post (here) I pointed out the weaknesses of the Null Hypothesis Testing (NHT) paradigm using a simple data set on automobile mileage (here). What I demonstrated was that if you arrive at the end of the NHT process (described here) with an insignificant result, you can always try to increase the sample size (if this is possible) until you can detect as small a difference as you observed. The ability to produce significant results based on manipulation of effect size and sample size should make everyone a little uneasy about the body of scientific work that has been generated under the NHT paradigm. In this post, I will return to the beginning of the NHT process and ask what types of theories generate the research hypothesis on which NHT is based. The conclusion here is that if the theories always generate "injection models," then all the weaknesses of NHT apply. However, if the theories generate multiple models, rather than research and null hypotheses, there is a way out of the NHT dead end.

At first, it might seem that there could not possibly be a uniform class of theories that generate research hypotheses given the wide breadth and depth of scientific theories in current use. If we try to classify the type of theories most used in classical NHT by the types of models involved, however, there is a general class of models called "injection models" that are well suited to the NHT paradigm.

By a model I mean a structural equation model and its associated directed graph. For example, the model underlying the research hypothesis about automobile miles per gallon (MPG) is displayed in the directed path above. The independent variable we focused on was the type of transmission, but obviously there are a lot of other factors that go into the determination of MPG. These types of models are always easily described by a general linear model and can often be tested by simple linear regression. They are all "injection models" because the effects of the independent variables are "injected" into the dependent variable. The null hypothesis is always that there is no injection.


There is a problem, however, with the causal implications of regression models expressed in mathematical terms. The first equation above is the standard regression equation. If y = MPG and x = [0,1] depending on whether we do or do not have an automatic transmission, then we have an injection model for automobile mileage. Mathematical objects, however, are symmetric and nothing prevents us from re-expressing the model to predict the type of transmission from MPG. While predicting the type of transmission from MPG might be an interesting exercise, we know a priori that mileage doesn't "cause" a car to have a particular type of transmission. We need something beyond mathematics to describe causal relationships.

Causal diagrams make the direction of causality clear from the direction of the arrow connecting two variables. The importance of path diagrams for encoding causal information was discovered by geneticist Sewell Wright in the 1920's. Wright retired to the University of Wisconsin in 1955 where he had an influence on economists and sociologists.
With path diagrams it was relatively easy to develop complicated causal models that do not lend themselves to re-statement as a simple null hypothesis. For example, in the path diagram above (from Pedhazur and Kerlinger, 1982 and Karl Wuensch, here), socioeconomic status (SES) is taken as an exogenous variable that influences IQ, need achievement (nAch) and resulting grade point average (GPA). The variables that are the result of causal arrows (IQ, nAch and GPA) are considered endogenous. It is important to this theory of grade point determination to note that IQ and nAch are mediating influences on SES. That is, you will find lower SES students with high GPAs as a result of their IQ and their need for achievement. You will also find some high SES students with neither high IQ nor high need achievement, but they will probably still have better GPAs than low SES students.

In the NHT paradigm, path models have to be converted into injection models as in the graphic above. In the "testable" NHT model, SES, IQ and nAch are all considered exogenous variables determining grade point. The simple null hypothesis would be that none of these factors have a statistically significant impact on GPA. The idea of mediating or intervening variables (IQ and nAch mediating the influence of SES on grade point) is dropped. The reformulated model above, however, is still not enough for hard-core experimentalists.

For the hard-core experimentalists, Pedhzur and Kerlinger have to give up any ideas about drawing causal links from SES and IQ to GPA because these variables cannot be experimentally manipulated. It would not be possible to conduct an Eliza Doolittle experiment where subjects were randomly assigned to have their SES manipulated so changes in GPA could be observed. It would also not be practical to manipulate IQ. Since it might be possible to manipulate nAch (annotated as do_nAch using Judea Pearl's terminology) what the experimentalists require is that all extraneous factors be eliminated through random assignment. The fact that we are no longer testing Pedhazur and Kerlinger's theory is not an issue because the theory is not really testable within the NHT paradigm.

Unfortunately, there are an infinite number of important research questions where experimentation is neither possible nor desirable, global climate change being just one example (we simply cannot manipulate the world system). Instead, we are forced to study intact populations or intact systems using models. Path diagrams provide a useful and general (that is, nonparametric) way for describing causality based on an understanding of the system being studied. Contending models always exist in various stages of testing. NHT is well suited for handling the critical experiment where a hypothesis can be directly tested. Even here we never really test the research hypothesis, only the null alternative. The focus of the NHT statistician is on hypotheses while the focus of scientists is on models.

Returning to Pedhazur and Kerlinger's path model, not only did the absence of experimental design generate howls of protest from the experimental and statistical scolds (here and here, basically "correlation cannot prove causation") but the approach to estimating path diagrams was also held up for criticism because technical statistical requirements (the assumptions of classical normal distribution theory) were being violated. I'll get into these issues in a future post. For now, what is important to understand is that multiple, competing causal models (rather than hypotheses or statistical distributions) are the important parts of science that always exist in various states of confirmation and acceptance. Statistical technique needs to be able to test models in addition to merely testing hypotheses. Multi-model development, testing and selection provides one way out of the NHT dead end.

Sunday, January 15, 2012

What Happens When You Fail To Reject?


In a prior post (here), I summarized the approach of classical Null Hypothesis Testing (NHT). Rather than starting at the beginning of the process in the graph above (click to enlarge), I want to start,
for this post, at the end an ask "What if you complete a study and fail to reject the null hypothesis?" Posing this questions brings up the complicated and controversial problem of statistical power analysis and will eventually take us back to the beginning of the NHT process with a better understanding of how to plan for experiments, studies and their appropriate statistical analysis.

First, a point about hypothesis testing has to be clarified because it is so often misunderstood. Failing to reject the null hypothesis means "Given that the null hypothesis, H0, is true, what is the probability of these or more extreme results?" It does not mean "Given these data, what is the probability that H0 is true?" The former statement involves accepting the null hypothesis, an issue I will defer to later. For now, these and other confusions are part of the well-known problems with NHT.

Now, a naive reading of the NHT process might suggest that if you fail to reject the null hypothesis (assuming you are not confused about whether you are accepting H0 or not), that's the end of the game. In reality, such a result is a disaster for the researcher. Time, energy and funding have been expended on the study. Unless the study can somehow be saved, months maybe years of work have been lost; promotion, renewed funding and even tenure might be on the line. As a statistical consultant, I have had many clients in my office over the years desperate for me to rescue a failed study. Assuming that other issues in the "Testing Assumptions" loop (above) have been dealt with, there are only two alternatives to failure: (1) accept the null hypothesis or (2) increase the sample size. Both of these strategies have to do with statistical power analysis. In order to explain the subtleties and controversies involved, I will have to start explaining this complicated area.

To be concrete, imagine that you and your significant other want to purchase a new automobile. You have both taken an introductory course in statistics and you both used a 2010 textbook by Alan Bluman (here). The big issue between you and your significant other is whether to get an automatic transmission or a stick shift. You are a bad driver and would like the automatic. Your significant other would like the stick shift and has been arguing that it gets better gas mileage. You have found a built in data frame in the R statistical language called mtcars (here). You think this is a random sample of automobiles conducted by Motor Trend magazine (it isn't), so you proceed with your analysis by taking a look at the data in case it's not quite what you want.

First, there's a question about whether "data peaking" is allowed by the NHT framework. You've heard about Exploratory Data Analysis (EDA) and like the idea of peaking around even though you feel a little guilty. What you find in the data is somewhat surprising. There are all sorts of high powered sports cars included in the sample. Since there is no chance you and your significant other will be buying a high powered sports car, you decide that something must be done. You've heard about rejection sampling and decide to resample the data rejecting automobiles that you have no interest in purchasing.

Before you get on to the rejection sampling, you remember from page 7-10 of your textbook that there is a way to calculate the desired sample size to detect a given difference (the effect size) between manual and automatic transmissions using the following formula:



where X is observed MPG, the greek letter mu is the population value of MPG, E is the maximum error estimate from the confidence interval that brackets the population value, z is a percentage point from the quantile function for the normal distribution for the two-tailed test (0.05/2 in this case), s is the standard error, and the square root root of n turns the standard error into a standard deviation.

Your first step in using the formula is to think a bit about miles per gallon (MPG). For automobiles, it probably ranges from about 5 to a max of 40 miles per gallon, at least for the time your sample was taken in the early 1970's. You would like to be able to detect a 5 MPG difference and you estimate the standard error to be approximately s = (40-5)/6 = 5.8 and choose a conventional 0.05% probability for making an error with E = 5. The calculation tells you that you'll need 8 observations to construct a confidence interval around the estimated effect of automatic vs. manual transmission on MPG (if you verify the calculations by hand with z = 1.96 you get n = 5; due to the small sample size, n < 30, percentage points were actually taken from a t-distribution to arrive at n = 8) . The sample size seems somewhat small but you proceed with the rejection sampling in any event.

The new sample contains nine automobiles. The overall F-test for the regression model of MPG ~ am (in R-syntax, where "am" is automatic-or-manual) is not significant p-value = 0.4332 and you wonder whether you have enough evidence to get your automatic transmission.

When you present the result to your significant other, it is met with a lot of skepticism. In a closer look at the rejection sample, you find out that you only have one automatic transmission automobile in the sample. Your significant other asks you to get a few more automatic transmission automobiles in the sample so you add a Hudson Hornet 4 Drive, Hornet Sportabout, and a Plymouth Valiant (really!) into the sample. You rerun the analysis and get a significant difference this time of p-value = 0.04918 with the stick shift getting about 6 MPG more than the automatic transmission [0.17 < MPG=6.35 < 12.5] which is within your criterion for an important difference in mileage.

What does it all mean? From my experience as a consulting statistician, this is pretty much how data analysis is conducted in the personal computer era. It might even go a step further by remembering to conduct an initial power analysis to determine sample size. The only lesson in this fabricated example is that if you get an insignificant result it does not mean, within the NHT paradigm, that you can accept the null hypothesis. There is always some sample size for which you will be able to find a difference because as sample size increases, the standard error decreases and your ability to detect a difference increases. If you had a firm idea of what was an important difference, you would know when to stop. But you seldom have a firm position about effect size (maybe a 2 MPG difference would be important over the life of the automobile and if gasoline prices went up to $5 per gallon).

This well known result (that increasing sample size increases the chance of finding significant differences) coupled with the unwillingness of journals to publish negative results should make everyone a little queasy. Had you not found a significant result with a sample of 11, you could have gone out and obtained more data until a significant difference was found. Your idea about accepting the null is always subverted by increasing sample size. On the other hand, your significant other can take little joy in a significant result under small sample size conditions since the result may well have been a fluke (especially given the type of automobiles added to the sample).

In a previous post on classical hypothesis testing (here) I mentioned the issue of statistical power analysis. Power is the ability of a study to detect a significant difference. Think about it as an analogy to the power of a microscope. If a microscope is underpowered, you won't see any microbes. If the microscope is over-powered, you won't see any microbes either because you've descended to the atomic level.

The person who has arguably done the most to bring power analysis to the attention of researchers was Jacob Cohen (1923-1998). In a 1992 piece titled A Power Primer, Cohen stated the problem very clearly. The power of typical small sample studies was so low that the "...chance of obtaining a significant result was about that of tossing a head with a fair coin." This state of affairs threatens the entire edifice of classical null hypothesis testing (NHT).

In Cohen's 1992 piece he suggests that the reason might be that power calculations are too difficult. That surely can't be true in the 21st century since there are a full range of univariate power procedures in the R pwr package (here). There are deeper reason I will discuss in future posts about confidence intervals, Bayesian inference and Multimodel inference. In the alternative approaches to NHT, power issues seem less important but are lurking in the background in any event. Steven Goodman in a series of papers (here) has probably done the most to bring NHT problems to the attention of the research community. He suggests the use of Bayes factors.

In future posts, I'll go back to the beginning of the NHT diagram above and work through the steps a little more slowly now that we are sensitized to the problems in getting to the final destination.

NOTE: I learned statistical power analysis from G. William Walster when he was at the University of Wisconsin in the 1970's. Neither Bill nor I had either a Hudson Hornet or a Plymouth Valiant at the time. He and Marietta Tretter did the initial work on computing exact noncentral multivariate distributions (here). The difficulties in doing these calculations on computers involved finite word length and error accumulation (getting rounded off). These numerical problems led Bill to do work on interval arithmetic for Sun Microsystems (rather than being rounded off, you would know the interval in which your results had to be, see more discussion here). Multivariate power calculations still present challenging numerical problems.

Wednesday, January 11, 2012

Controversies Surrounding Classical Null Hypothesis Testing (HNT)

Classical Null Hypothesis testing (NHT) is the dominant paradigm for conducting scientific investigations. The focus in NHT is on stating and testing hypotheses prior to conducting research. This singular focus is both a strength and a weakness. In the first part of this post, I'll describe NHT using the diagram above (click to enlarge). In the second part of the post I'll catalog the long list of problems that have emerged over time with the application of NHT. The bottom line from experience with NHT would seem to be that if you have a precise hypothesis and if you meet all the requirements of the NHT approach, it is a powerful technique for advancing science. Problems enter when you don't have a narrow hypothesis (for example, you have a more complicated structural equation model you are testing) and when you don't meet all the requirements of the approach (a very common occurrence).

NHT was developed in the 1930's by Ronald Fisher, Jerzy Neyman and Karl Pearson (it is often called Neyman-Pearson hypothesis testing) in response to the needs of agricultural experimentation. Over time, NHT was generalized to cover any well formulated experimental design. The steps in NHT are:
  1. State The Research Hypothesis A scientific hypothesis is often defined as a proposed explanation for some phenomenon that can be testable. A hypothesis is not the same as a scientific theory which is viewed as an internally consistent collection of hypotheses. An example of a scientific hypothesis being widely discussed is the Global Warming hypothesis that anthropogenic (human-induced) CO2 emissions will lead to an increase in global temperature. The GW hypothesis is interesting because it is not inherently testable by experimental manipulation and is controversial. A simpler hypothesis that could potentially be determined by a scientific experiment is that Yoga is is superior to other forms of exercise and stretching to relieve lower back pain (the hypothesis is the subject of a current study here).
  2. Restate The Hypothesis in Null Form In NHT decision making, the research hypothesis is not directly tested. Instead, the researcher attempts to state and reject a null hypothesis. For example, CO2 emissions do not create global warming or Yoga has no effect on reducing lower back pain. The null hypothesis restatement brings statistical analysis squarely into the research process. It provokes the question of how much of an increase in global temperature as a result of CO2 emissions would be necessary to reject the null hypothesis or how much reduction in lower back pain as a result of Yoga would be necessary to reject the null hypothesis. The restatement is based on the assumption that it is easier to reject some assertion than it is to accept an assertion as true. For example, the assertion that all Swans are white is hard to accept because it would be refuted by the first black swan that was observed (see Black Swan Theory and knowledge of how financial crises develop).
  3. Identify Statistical Tests, Distributions and Assumptions The next step involves trying to decide how to test the null hypothesis. It is a large step and could potentially involve a wide range of considerations. Typically, NHT chooses from a restricted range of predefined test statistics based on normal distribution theory. The unusual idea that we can assume that the distribution of some unknown small sample has been drawn from a known population distribution is based on the central limit theorem (CLT). Under the CLT, it can be both proven and demonstrated (here) that any relatively large (say over 100) collection of weirdly distributed observations will converge to the bell-shaped normal distribution. Therefore, any statistic based on the sample of weirdly distributed random variables will also be normally distributed in the population. The CLT breaks down if the random variables are correlated or if the sample was not drawn by a random process (a random sample). For example, did we take all our global temperature measurements in urban heat islands or did we random sample the entire world? Did we assign all the people with untreatable spinal conditions to the control group rather than the experimental group that attended Yoga classes?
  4. Compute Test Statistic(s) Now we are at the point of collecting data and computing tests statistics. This brings up the issue of how many observations to collect. For example, critics of the GW hypothesis argue that we need thousands of years of data, not just data from the beginning of the Industrial revolution, for determining the earths natural climate cycles. It turns of that the power of a statistical test can be manipulated to any degree of precision given a large enough sample size. In other words, if you want to prove that Yoga is better than control, just choose a large enough sample size and you are always likely to find a statistically significant difference.
  5. Test Assumptions Once you have gathered the data and computed your test statistic(s), the issue of whether you have actually met the assumptions of your test statistic remains. Tests of assumptions cover the normality assumption using, for example, the Wilks-Shapiro test, and assumptions of homogeneity of variance using, for example, the Levene's test. If you fail to meet assumptions, you can then switch to nonparametric statistical tests either based on rank transformations or on monte-carlo techniques such as the bootstrap. The nonparametric techniques do not necessarily generalize to some idealized larger population distribution.
  6. Reject or Fail To Reject Null Hypothesis Whether you have met all the assumptions or switched to nonparametric tests, you are now at the point of either rejecting or failing to reject the null hypothesis. This is done by choosing some extreme point in the null distribution (typically beyond which 5% or 1% is the sample would lie) and assuming that a result this far out in the critical region of the null hypothesis distribution could not have resulted from chance alone.
Being the dominant paradigm in scientific investigation, experience with NHT has both demonstrated its usefulness and also demonstrated a wide range of practical and theoretical problems in application. I will start out with what I think are the most important criticism and by no means exhaust the entire list (more here and here):
  • Data Snooping and Cherry Picking A lot of the success of NHT depends on unknown assumptions about the experimenter and how either the experiment was conducted or the data were generated. When NHT was developed in the 1930s, it was not possible to accumulate large data bases and mine the data for significant results. It was more realistic to simply construct a small experiment and follow the NHT rules. With the wide availability of computers and associated databases in the 1960s, data snooping became possible. Data dredging is a problem because it invalidates the NHT assumptions. A result may appear significant in a non-randomly collected sample and not be statistically significant in the population. Cherry picking is the (unintentional?) process of selecting cases that prove your point, again through non-random sampling. Researchers testing the GW hypothesis have specifically been criticized for allegedly cherry picking results and using biased statistical techniques (the hockey stick controversy).
  • P-value Fixation A minimum criterion for publication in scientific journals is commonly taken as a p-value of either p ≤ 0.05 or p ≤ 0.01, weak and strong evidence respectively. Unfortunately, the p-value says nothing about the size of the observed effect. In an excellent article by Steve Goodman (here), it is pointed out that "A small effect in a study with large sample size can have the same P value as a large effect in a small study." This problem leads into the difficult and contradictory area of statistical power analysis which I will take up in a future post.
  • Introducing Bias Into The Scientific Literature Because there are many ways to game the NHT approach, because academic journals will not publish null results and because scientific careers are based on publishing, the scientific literature itself is subject to publication bias. Since readers can never know precisely how researchers generated their data, there is always a question about individual findings. The solution to the problem involves replication, that is, another independent researcher confirming a set of findings. Unfortunately, journals seldom publish replication studies and some studies are so expensive to conduct that replication is infeasible.
  • Theories and Hypotheses vs. Models The idea that theories are collections of hypotheses waiting to be tested is a very Victorian view of science (think Sherlock Holmes) and may not really be appropriate to current scientific activity. Rather than focusing on verbal constructs (theories and hypotheses), current scientific activity is directed more to developing causal models of the real world. For example, rather than narrowly testing the GW hypothesis, scientists are more concerned with developing global climate models sometimes called General Circulation Models (GCMs). Science evolves from simple, easily understood models to more complicated representations based on evolving knowledge and the ability of new models to outperform old models (the development of weather forecasting models which have obviously improved our ability to predict local weather is a good example). Model development has its own series of problems and doesn't guarantee that complex models are always better than simple models (see for example the experience with econometric models in the late-2000 financial crisis). However, as in the GW controversy, the onus now is on critics to propose better models rather than focus narrowly on individual (possibly cherry picked) hypothesis tests.
  • Theoretical Deficiencies Critics have argued that NHT is incoherent theoretically. For example, from the perspective of Bayesian inference, we are never starting from the blank slate implied by NHT. We have some reason for performing the study and the reason is usually that we believe in the research hypothesis, not the null hypotheses. Bayesian inference, which I will discuss in a later post, specifically takes into account prior expectations about the research hypotheses and also lends itself to the comparison of multiple models (multi-model selection) rather than NHT.
  • Artificial Limitations on the Advance of Knowledge Under NHT, scientific knowledge advances through experimentation. Unfortunately, much useful scientific knowledge has been developed without experimentation and where experimentation is impossible. Scientific knowledge of the global climate system has developed rapidly in the last few decades without the possibility of experimental manipulation, random assignment or random sampling. Advances have resulted from focusing on model building rather than hypothesis testing.
  • Failures in Practice NHT involves many essentially complicated steps that are often ignored in practice. Typically, statistical power is not controlled and assumptions are seldom tested. What effect these omissions might have on the quality of scientific results is generally unknown and possibly unknowable.
The lessons from experience with NHT should be pretty clear: If you have a simple hypothesis that can be subjected to experimental manipulation and if you follow all the NHT rules, it can be an effective approach to advancing scientific knowledge. If you don't there are other approaches. In future posts I will discuss statistical power analysis, Bayesian inference and multi-model selection.