Sunday, January 15, 2012

What Happens When You Fail To Reject?


In a prior post (here), I summarized the approach of classical Null Hypothesis Testing (NHT). Rather than starting at the beginning of the process in the graph above (click to enlarge), I want to start,
for this post, at the end an ask "What if you complete a study and fail to reject the null hypothesis?" Posing this questions brings up the complicated and controversial problem of statistical power analysis and will eventually take us back to the beginning of the NHT process with a better understanding of how to plan for experiments, studies and their appropriate statistical analysis.

First, a point about hypothesis testing has to be clarified because it is so often misunderstood. Failing to reject the null hypothesis means "Given that the null hypothesis, H0, is true, what is the probability of these or more extreme results?" It does not mean "Given these data, what is the probability that H0 is true?" The former statement involves accepting the null hypothesis, an issue I will defer to later. For now, these and other confusions are part of the well-known problems with NHT.

Now, a naive reading of the NHT process might suggest that if you fail to reject the null hypothesis (assuming you are not confused about whether you are accepting H0 or not), that's the end of the game. In reality, such a result is a disaster for the researcher. Time, energy and funding have been expended on the study. Unless the study can somehow be saved, months maybe years of work have been lost; promotion, renewed funding and even tenure might be on the line. As a statistical consultant, I have had many clients in my office over the years desperate for me to rescue a failed study. Assuming that other issues in the "Testing Assumptions" loop (above) have been dealt with, there are only two alternatives to failure: (1) accept the null hypothesis or (2) increase the sample size. Both of these strategies have to do with statistical power analysis. In order to explain the subtleties and controversies involved, I will have to start explaining this complicated area.

To be concrete, imagine that you and your significant other want to purchase a new automobile. You have both taken an introductory course in statistics and you both used a 2010 textbook by Alan Bluman (here). The big issue between you and your significant other is whether to get an automatic transmission or a stick shift. You are a bad driver and would like the automatic. Your significant other would like the stick shift and has been arguing that it gets better gas mileage. You have found a built in data frame in the R statistical language called mtcars (here). You think this is a random sample of automobiles conducted by Motor Trend magazine (it isn't), so you proceed with your analysis by taking a look at the data in case it's not quite what you want.

First, there's a question about whether "data peaking" is allowed by the NHT framework. You've heard about Exploratory Data Analysis (EDA) and like the idea of peaking around even though you feel a little guilty. What you find in the data is somewhat surprising. There are all sorts of high powered sports cars included in the sample. Since there is no chance you and your significant other will be buying a high powered sports car, you decide that something must be done. You've heard about rejection sampling and decide to resample the data rejecting automobiles that you have no interest in purchasing.

Before you get on to the rejection sampling, you remember from page 7-10 of your textbook that there is a way to calculate the desired sample size to detect a given difference (the effect size) between manual and automatic transmissions using the following formula:



where X is observed MPG, the greek letter mu is the population value of MPG, E is the maximum error estimate from the confidence interval that brackets the population value, z is a percentage point from the quantile function for the normal distribution for the two-tailed test (0.05/2 in this case), s is the standard error, and the square root root of n turns the standard error into a standard deviation.

Your first step in using the formula is to think a bit about miles per gallon (MPG). For automobiles, it probably ranges from about 5 to a max of 40 miles per gallon, at least for the time your sample was taken in the early 1970's. You would like to be able to detect a 5 MPG difference and you estimate the standard error to be approximately s = (40-5)/6 = 5.8 and choose a conventional 0.05% probability for making an error with E = 5. The calculation tells you that you'll need 8 observations to construct a confidence interval around the estimated effect of automatic vs. manual transmission on MPG (if you verify the calculations by hand with z = 1.96 you get n = 5; due to the small sample size, n < 30, percentage points were actually taken from a t-distribution to arrive at n = 8) . The sample size seems somewhat small but you proceed with the rejection sampling in any event.

The new sample contains nine automobiles. The overall F-test for the regression model of MPG ~ am (in R-syntax, where "am" is automatic-or-manual) is not significant p-value = 0.4332 and you wonder whether you have enough evidence to get your automatic transmission.

When you present the result to your significant other, it is met with a lot of skepticism. In a closer look at the rejection sample, you find out that you only have one automatic transmission automobile in the sample. Your significant other asks you to get a few more automatic transmission automobiles in the sample so you add a Hudson Hornet 4 Drive, Hornet Sportabout, and a Plymouth Valiant (really!) into the sample. You rerun the analysis and get a significant difference this time of p-value = 0.04918 with the stick shift getting about 6 MPG more than the automatic transmission [0.17 < MPG=6.35 < 12.5] which is within your criterion for an important difference in mileage.

What does it all mean? From my experience as a consulting statistician, this is pretty much how data analysis is conducted in the personal computer era. It might even go a step further by remembering to conduct an initial power analysis to determine sample size. The only lesson in this fabricated example is that if you get an insignificant result it does not mean, within the NHT paradigm, that you can accept the null hypothesis. There is always some sample size for which you will be able to find a difference because as sample size increases, the standard error decreases and your ability to detect a difference increases. If you had a firm idea of what was an important difference, you would know when to stop. But you seldom have a firm position about effect size (maybe a 2 MPG difference would be important over the life of the automobile and if gasoline prices went up to $5 per gallon).

This well known result (that increasing sample size increases the chance of finding significant differences) coupled with the unwillingness of journals to publish negative results should make everyone a little queasy. Had you not found a significant result with a sample of 11, you could have gone out and obtained more data until a significant difference was found. Your idea about accepting the null is always subverted by increasing sample size. On the other hand, your significant other can take little joy in a significant result under small sample size conditions since the result may well have been a fluke (especially given the type of automobiles added to the sample).

In a previous post on classical hypothesis testing (here) I mentioned the issue of statistical power analysis. Power is the ability of a study to detect a significant difference. Think about it as an analogy to the power of a microscope. If a microscope is underpowered, you won't see any microbes. If the microscope is over-powered, you won't see any microbes either because you've descended to the atomic level.

The person who has arguably done the most to bring power analysis to the attention of researchers was Jacob Cohen (1923-1998). In a 1992 piece titled A Power Primer, Cohen stated the problem very clearly. The power of typical small sample studies was so low that the "...chance of obtaining a significant result was about that of tossing a head with a fair coin." This state of affairs threatens the entire edifice of classical null hypothesis testing (NHT).

In Cohen's 1992 piece he suggests that the reason might be that power calculations are too difficult. That surely can't be true in the 21st century since there are a full range of univariate power procedures in the R pwr package (here). There are deeper reason I will discuss in future posts about confidence intervals, Bayesian inference and Multimodel inference. In the alternative approaches to NHT, power issues seem less important but are lurking in the background in any event. Steven Goodman in a series of papers (here) has probably done the most to bring NHT problems to the attention of the research community. He suggests the use of Bayes factors.

In future posts, I'll go back to the beginning of the NHT diagram above and work through the steps a little more slowly now that we are sensitized to the problems in getting to the final destination.

NOTE: I learned statistical power analysis from G. William Walster when he was at the University of Wisconsin in the 1970's. Neither Bill nor I had either a Hudson Hornet or a Plymouth Valiant at the time. He and Marietta Tretter did the initial work on computing exact noncentral multivariate distributions (here). The difficulties in doing these calculations on computers involved finite word length and error accumulation (getting rounded off). These numerical problems led Bill to do work on interval arithmetic for Sun Microsystems (rather than being rounded off, you would know the interval in which your results had to be, see more discussion here). Multivariate power calculations still present challenging numerical problems.

No comments:

Post a Comment