Saturday, March 10, 2012

Six Useful Classes of Linear Models

Statistics (or Sadistics as my students like to call it) is a difficult subject to teach with many seemingly unrelated and obscure concepts, many of which are listed under "Background Links" in the right-hand column. Compare Statistics to Calculus, another difficult topic for students. In the Calculus there are only three concepts: a differential, an integral and a limit. Statisticians could only dream of such simplicity and elegance (have you finished counting all the concepts listed on the right)?

One way that has helped me bring some order to this chaos is to concentrate on the classes of models typically used by statisticians. The six major classes are displayed above in matrix notation (if you need a review on matrix algebra, try the one developed by George Bebis here). Clearly understanding each of the common linear models will also help classify Hierarchical Linear Models.

Equation (1) states simply that some matrix of dependent variables, Y, is the result of some multivariate distribution, E. The letter E is chosen since this distribution will eventually become an error distribution. The subscript, i, denotes that there can be multiple models for different populations--a really important distinction.

Before leaving model (1) it is important to emphasize that we believe our dependent variables have some distribution. Consider the well-known distribution of Intelligence (IQ) displayed above (the "career potential" line in the graphic is a little hard to take; I know many police officers with higher intelligence than lawyers, but at least the graphic didn't use the "moron," "imbecile," and "idiot" classifications). It just so happens that the normal distribution that describes IQ is the same distribution that describes E in many cases. The point is that most dependent variables have a unimodal distribution that looks roughly bell-shaped. And, there are some interesting ways to determine a reasonable distribution for your dependent variable that I will cover in future posts.

Equation (2) is the classical linear model where X is a matrix of fixed-effect, independent variables and B is a matrix of unknown coefficients that must be estimated. There are many ways to estimate B, adding further complexities to statistics, but least squares is the standard approach (I'll cover all the different approaches in future posts). Notice also that once B is estimated, E = Y - XB by simple subtraction. The important point to emphasize here, a point that is often ignored in practice, is that X is fixed either by the researcher or is fixed under sampling. For example, if we think that IQ might affect college grade point (GPA) as in the model introduce here, then we have to sample equal numbers of observations across the entire range of IQ, from below 70 to above 130. Of course, that doesn't make sense since people with low IQ do not usually get into college. The result of practical sampling restrictions, however, might change the shape of the IQ distribution, creating another set of problems. Another way that the independent variables can be fixed is through assignment. When patients are assigned to experimental and control groups in clinical trials, for example, each group is given a [0,1] dummy variable classification (one if you are in the group, zero if not). Basically, the columns of X must each contain continuous variables such as IQ or dummy variables for group assignments.

Equations (3-6) start lifting the restriction on fixed independent variables. In Equation (3), the random (non-fixed) variables are contained in the matrix Z with it's own set of coefficients, A, that have to be estimated. Along with the coefficients, random variables have variances that also must be estimated, thus model (3) is called the random effects model. Typically, hierarchical linear models are lumped in with random effects models but in future posts and at the end of this post, I will argue that the two types of models should be thought of separately.

Equation (4) is the two-stage least squares model often used for simultaneous equation estimation in economics. Here, one of the dependent variables, y, is singled out as of primary interest even though there are other endogenous variables, Y, that are simultaneously related to y. Model (4) describes structural equation models (SEMs, introduced in a previous post here) as typically used in economics. The variables in X are considered exogenous variables or predetermined variables meaning that any predictions from model (4) are generally restricted to the available values of X. Again, there are a number of specialized estimation techniques for the two-stage least squares model.

Equation (5) is the path analytic model used in both Sociology and Economics. In the path analytic SEM, all the Y variables are considered endogenous. Both model (4) and model (5) bring up the problem of identification which I will cover in a future post. Again, there are separate techniques for the estimation of parameters in model (5). There are also a whole range of controversies surrounding SEMs and causality that I will also discuss in future posts. For the time being, it is important to point out that model (5) starts to look at the unit of analysis as a system that can be described with linear equations.

Finally, Equation (6) describes the basic time series model as an n-th order difference equation. The Y matrix, in this case, has a time ordering (t and t-1) where the Y's are considered output variables and the X's are considered input variables. Time series models introduce a specific type of causality called Granger causality, for the economist Clive Granger. In model (6) the Y(t-1) variables are considered fixed since values of variables in the past have already been determined. However, the errors in E can be correlated over time introducing the concept of autocorrelation. Also, since time series variables tend to move together over time (think of population growth and the growth of Gross National Product), time series causality can be confounded by cointegration, a problem that was also studied extensively by Clive Granger.

Models (1-6) don't exhaust all the possible models used in Statistics, but they do help classify the vast majority of problems currently being studied and do help us fix ideas and suggest topics for future posts. In the next post, my intention is to start with model (1) and describe the different types of distributions one typically finds in E and introduce some interesting approaches to determining the probable distribution of your data. These approaches differ from the typical null-hypothesis testing (NHT) approach (here) where you are required to test whether your data are normally distribution (Step 3, Testing Assumptions) before proceeding with conventional statistical tests (a topic I will also cover in a future post).

Finally, returning to the problem of hierarchical linear models (HLMs), the subscript i was introduced in models (1-6) because there is a hierarchical form of each model, not just model (3). The essential idea in HLMs (see my earlier post here) is that the unit of analysis is a hierarchical system with separate subpopulations (e.g., students, classrooms, hospitals, patients, countries, regions, etc.). Aspects of the subpopulations could be described by any of the models above. The area of multi-model inference (MMI) provides a more flexible approach to HLMs and solves the problems introduced by Simpson's paradox (described in an earlier post here).

MMI can be used successfully to determine an appropriate distribution for your data and help avoid the dead-end of NHT assumption testing. I'll cover the MMI approach to Equation (1) above in the next post. The important take away for this post is that determining the class or classes of models that are appropriate to your research question will help determine the appropriate type of analysis and types of assumptions that have to be made to complete the analysis.