Random Variation: Model Selection

Showing posts with label Model Selection. Show all posts

Saturday, January 28, 2012

How Complicated Models Broke the NHT Paradigm

In a prior post (here) I pointed out the weaknesses of the Null Hypothesis Testing (NHT) paradigm using a simple data set on automobile mileage (here). What I demonstrated was that if you arrive at the end of the NHT process (described here) with an insignificant result, you can always try to increase the sample size (if this is possible) until you can detect as small a difference as you observed. The ability to produce significant results based on manipulation of effect size and sample size should make everyone a little uneasy about the body of scientific work that has been generated under the NHT paradigm. In this post, I will return to the beginning of the NHT process and ask what types of theories generate the research hypothesis on which NHT is based. The conclusion here is that if the theories always generate "injection models," then all the weaknesses of NHT apply. However, if the theories generate multiple models, rather than research and null hypotheses, there is a way out of the NHT dead end.

At first, it might seem that there could not possibly be a uniform class of theories that generate research hypotheses given the wide breadth and depth of scientific theories in current use. If we try to classify the type of theories most used in classical NHT by the types of models involved, however, there is a general class of models called "injection models" that are well suited to the NHT paradigm.

By a model I mean a structural equation model and its associated directed graph. For example, the model underlying the research hypothesis about automobile miles per gallon (MPG) is displayed in the directed path above. The independent variable we focused on was the type of transmission, but obviously there are a lot of other factors that go into the determination of MPG. These types of models are always easily described by a general linear model and can often be tested by simple linear regression. They are all "injection models" because the effects of the independent variables are "injected" into the dependent variable. The null hypothesis is always that there is no injection.

There is a problem, however, with the causal implications of regression models expressed in mathematical terms. The first equation above is the standard regression equation. If y = MPG and x = [0,1] depending on whether we do or do not have an automatic transmission, then we have an injection model for automobile mileage. Mathematical objects, however, are symmetric and nothing prevents us from re-expressing the model to predict the type of transmission from MPG. While predicting the type of transmission from MPG might be an interesting exercise, we know a priori that mileage doesn't "cause" a car to have a particular type of transmission. We need something beyond mathematics to describe causal relationships.

Causal diagrams make the direction of causality clear from the direction of the arrow connecting two variables. The importance of path diagrams for encoding causal information was discovered by geneticist Sewell Wright in the 1920's. Wright retired to the University of Wisconsin in 1955 where he had an influence on economists and sociologists.

With path diagrams it was relatively easy to develop complicated causal models that do not lend themselves to re-statement as a simple null hypothesis. For example, in the path diagram above (from Pedhazur and Kerlinger, 1982 and Karl Wuensch, here), socioeconomic status (SES) is taken as an exogenous variable that influences IQ, need achievement (nAch) and resulting grade point average (GPA). The variables that are the result of causal arrows (IQ, nAch and GPA) are considered endogenous. It is important to this theory of grade point determination to note that IQ and nAch are mediating influences on SES. That is, you will find lower SES students with high GPAs as a result of their IQ and their need for achievement. You will also find some high SES students with neither high IQ nor high need achievement, but they will probably still have better GPAs than low SES students.

In the NHT paradigm, path models have to be converted into injection models as in the graphic above. In the "testable" NHT model, SES, IQ and nAch are all considered exogenous variables determining grade point. The simple null hypothesis would be that none of these factors have a statistically significant impact on GPA. The idea of mediating or intervening variables (IQ and nAch mediating the influence of SES on grade point) is dropped. The reformulated model above, however, is still not enough for hard-core experimentalists.

For the hard-core experimentalists, Pedhzur and Kerlinger have to give up any ideas about drawing causal links from SES and IQ to GPA because these variables cannot be experimentally manipulated. It would not be possible to conduct an Eliza Doolittle experiment where subjects were randomly assigned to have their SES manipulated so changes in GPA could be observed. It would also not be practical to manipulate IQ. Since it might be possible to manipulate nAch (annotated as do_nAch using Judea Pearl's terminology) what the experimentalists require is that all extraneous factors be eliminated through random assignment. The fact that we are no longer testing Pedhazur and Kerlinger's theory is not an issue because the theory is not really testable within the NHT paradigm.

Unfortunately, there are an infinite number of important research questions where experimentation is neither possible nor desirable, global climate change being just one example (we simply cannot manipulate the world system). Instead, we are forced to study intact populations or intact systems using models. Path diagrams provide a useful and general (that is, nonparametric) way for describing causality based on an understanding of the system being studied. Contending models always exist in various stages of testing. NHT is well suited for handling the critical experiment where a hypothesis can be directly tested. Even here we never really test the research hypothesis, only the null alternative. The focus of the NHT statistician is on hypotheses while the focus of scientists is on models.

Returning to Pedhazur and Kerlinger's path model, not only did the absence of experimental design generate howls of protest from the experimental and statistical scolds (here and here, basically "correlation cannot prove causation") but the approach to estimating path diagrams was also held up for criticism because technical statistical requirements (the assumptions of classical normal distribution theory) were being violated. I'll get into these issues in a future post. For now, what is important to understand is that multiple, competing causal models (rather than hypotheses or statistical distributions) are the important parts of science that always exist in various states of confirmation and acceptance. Statistical technique needs to be able to test models in addition to merely testing hypotheses. Multi-model development, testing and selection provides one way out of the NHT dead end.

Wednesday, January 11, 2012

Controversies Surrounding Classical Null Hypothesis Testing (HNT)

Classical Null Hypothesis testing (NHT) is the dominant paradigm for conducting scientific investigations. The focus in NHT is on stating and testing hypotheses prior to conducting research. This singular focus is both a strength and a weakness. In the first part of this post, I'll describe NHT using the diagram above (click to enlarge). In the second part of the post I'll catalog the long list of problems that have emerged over time with the application of NHT. The bottom line from experience with NHT would seem to be that if you have a precise hypothesis and if you meet all the requirements of the NHT approach, it is a powerful technique for advancing science. Problems enter when you don't have a narrow hypothesis (for example, you have a more complicated structural equation model you are testing) and when you don't meet all the requirements of the approach (a very common occurrence).

NHT was developed in the 1930's by Ronald Fisher, Jerzy Neyman and Karl Pearson (it is often called Neyman-Pearson hypothesis testing) in response to the needs of agricultural experimentation. Over time, NHT was generalized to cover any well formulated experimental design. The steps in NHT are:

State The Research Hypothesis A scientific hypothesis is often defined as a proposed explanation for some phenomenon that can be testable. A hypothesis is not the same as a scientific theory which is viewed as an internally consistent collection of hypotheses. An example of a scientific hypothesis being widely discussed is the Global Warming hypothesis that anthropogenic (human-induced) CO2 emissions will lead to an increase in global temperature. The GW hypothesis is interesting because it is not inherently testable by experimental manipulation and is controversial. A simpler hypothesis that could potentially be determined by a scientific experiment is that Yoga is is superior to other forms of exercise and stretching to relieve lower back pain (the hypothesis is the subject of a current study here).
Restate The Hypothesis in Null Form In NHT decision making, the research hypothesis is not directly tested. Instead, the researcher attempts to state and reject a null hypothesis. For example, CO2 emissions do not create global warming or Yoga has no effect on reducing lower back pain. The null hypothesis restatement brings statistical analysis squarely into the research process. It provokes the question of how much of an increase in global temperature as a result of CO2 emissions would be necessary to reject the null hypothesis or how much reduction in lower back pain as a result of Yoga would be necessary to reject the null hypothesis. The restatement is based on the assumption that it is easier to reject some assertion than it is to accept an assertion as true. For example, the assertion that all Swans are white is hard to accept because it would be refuted by the first black swan that was observed (see Black Swan Theory and knowledge of how financial crises develop).
Identify Statistical Tests, Distributions and Assumptions The next step involves trying to decide how to test the null hypothesis. It is a large step and could potentially involve a wide range of considerations. Typically, NHT chooses from a restricted range of predefined test statistics based on normal distribution theory. The unusual idea that we can assume that the distribution of some unknown small sample has been drawn from a known population distribution is based on the central limit theorem (CLT). Under the CLT, it can be both proven and demonstrated (here) that any relatively large (say over 100) collection of weirdly distributed observations will converge to the bell-shaped normal distribution. Therefore, any statistic based on the sample of weirdly distributed random variables will also be normally distributed in the population. The CLT breaks down if the random variables are correlated or if the sample was not drawn by a random process (a random sample). For example, did we take all our global temperature measurements in urban heat islands or did we random sample the entire world? Did we assign all the people with untreatable spinal conditions to the control group rather than the experimental group that attended Yoga classes?
Compute Test Statistic(s) Now we are at the point of collecting data and computing tests statistics. This brings up the issue of how many observations to collect. For example, critics of the GW hypothesis argue that we need thousands of years of data, not just data from the beginning of the Industrial revolution, for determining the earths natural climate cycles. It turns of that the power of a statistical test can be manipulated to any degree of precision given a large enough sample size. In other words, if you want to prove that Yoga is better than control, just choose a large enough sample size and you are always likely to find a statistically significant difference.
Test Assumptions Once you have gathered the data and computed your test statistic(s), the issue of whether you have actually met the assumptions of your test statistic remains. Tests of assumptions cover the normality assumption using, for example, the Wilks-Shapiro test, and assumptions of homogeneity of variance using, for example, the Levene's test. If you fail to meet assumptions, you can then switch to nonparametric statistical tests either based on rank transformations or on monte-carlo techniques such as the bootstrap. The nonparametric techniques do not necessarily generalize to some idealized larger population distribution.
Reject or Fail To Reject Null Hypothesis Whether you have met all the assumptions or switched to nonparametric tests, you are now at the point of either rejecting or failing to reject the null hypothesis. This is done by choosing some extreme point in the null distribution (typically beyond which 5% or 1% is the sample would lie) and assuming that a result this far out in the critical region of the null hypothesis distribution could not have resulted from chance alone.

Being the dominant paradigm in scientific investigation, experience with NHT has both demonstrated its usefulness and also demonstrated a wide range of practical and theoretical problems in application. I will start out with what I think are the most important criticism and by no means exhaust the entire list (more here and here):

Data Snooping and Cherry Picking A lot of the success of NHT depends on unknown assumptions about the experimenter and how either the experiment was conducted or the data were generated. When NHT was developed in the 1930s, it was not possible to accumulate large data bases and mine the data for significant results. It was more realistic to simply construct a small experiment and follow the NHT rules. With the wide availability of computers and associated databases in the 1960s, data snooping became possible. Data dredging is a problem because it invalidates the NHT assumptions. A result may appear significant in a non-randomly collected sample and not be statistically significant in the population. Cherry picking is the (unintentional?) process of selecting cases that prove your point, again through non-random sampling. Researchers testing the GW hypothesis have specifically been criticized for allegedly cherry picking results and using biased statistical techniques (the hockey stick controversy).
P-value Fixation A minimum criterion for publication in scientific journals is commonly taken as a p-value of either p ≤ 0.05 or p ≤ 0.01, weak and strong evidence respectively. Unfortunately, the p-value says nothing about the size of the observed effect. In an excellent article by Steve Goodman (here), it is pointed out that "A small effect in a study with large sample size can have the same P value as a large effect in a small study." This problem leads into the difficult and contradictory area of statistical power analysis which I will take up in a future post.
Introducing Bias Into The Scientific Literature Because there are many ways to game the NHT approach, because academic journals will not publish null results and because scientific careers are based on publishing, the scientific literature itself is subject to publication bias. Since readers can never know precisely how researchers generated their data, there is always a question about individual findings. The solution to the problem involves replication, that is, another independent researcher confirming a set of findings. Unfortunately, journals seldom publish replication studies and some studies are so expensive to conduct that replication is infeasible.
Theories and Hypotheses vs. Models The idea that theories are collections of hypotheses waiting to be tested is a very Victorian view of science (think Sherlock Holmes) and may not really be appropriate to current scientific activity. Rather than focusing on verbal constructs (theories and hypotheses), current scientific activity is directed more to developing causal models of the real world. For example, rather than narrowly testing the GW hypothesis, scientists are more concerned with developing global climate models sometimes called General Circulation Models (GCMs). Science evolves from simple, easily understood models to more complicated representations based on evolving knowledge and the ability of new models to outperform old models (the development of weather forecasting models which have obviously improved our ability to predict local weather is a good example). Model development has its own series of problems and doesn't guarantee that complex models are always better than simple models (see for example the experience with econometric models in the late-2000 financial crisis). However, as in the GW controversy, the onus now is on critics to propose better models rather than focus narrowly on individual (possibly cherry picked) hypothesis tests.
Theoretical Deficiencies Critics have argued that NHT is incoherent theoretically. For example, from the perspective of Bayesian inference, we are never starting from the blank slate implied by NHT. We have some reason for performing the study and the reason is usually that we believe in the research hypothesis, not the null hypotheses. Bayesian inference, which I will discuss in a later post, specifically takes into account prior expectations about the research hypotheses and also lends itself to the comparison of multiple models (multi-model selection) rather than NHT.
Artificial Limitations on the Advance of Knowledge Under NHT, scientific knowledge advances through experimentation. Unfortunately, much useful scientific knowledge has been developed without experimentation and where experimentation is impossible. Scientific knowledge of the global climate system has developed rapidly in the last few decades without the possibility of experimental manipulation, random assignment or random sampling. Advances have resulted from focusing on model building rather than hypothesis testing.
Failures in Practice NHT involves many essentially complicated steps that are often ignored in practice. Typically, statistical power is not controlled and assumptions are seldom tested. What effect these omissions might have on the quality of scientific results is generally unknown and possibly unknowable.

The lessons from experience with NHT should be pretty clear: If you have a simple hypothesis that can be subjected to experimental manipulation and if you follow all the NHT rules, it can be an effective approach to advancing scientific knowledge. If you don't there are other approaches. In future posts I will discuss statistical power analysis, Bayesian inference and multi-model selection.