Random Variation: Linear Models

Showing posts with label Linear Models. Show all posts

Sunday, September 16, 2012

The 'hlmmc' package

From July 1 through August 15, I taught Sociology S534 Advanced Sociological Analysis: Hierarchical Linear Models at the University of Tennessee in Knoxville. A syllabus for the course is available here. Over the next few months I will serialize the lectures and a user guide for the software package ('hlmmc') used in the course. 'hlmmc' is a software package written in the R programming language for studying Hierarchical Linear Models (HLMs) using Monte Carlo simulation. A manual for the software package is available here. It explains how to download the software and how each function in the package can be used (also, see NOTE below).

The approach I used in the course was to write out the equations for each linear model as a function in the R programming language. The functions were designed so that data for the study could be simulated for each observation in the sample. This is a useful exercise because it forces you to confront a number of difficult issues. What statistical distributions will be used to not only simulate the distributions of the error terms but also the distributions of the independent variables? Typically, a normal distribution is chosen for the error terms and the distributions of the independent variables are ignored in planning a study. Monte Carlo Simulation forces the researcher to confront these issues in addition to explicitly stating the functional form of the model, the actual parameters in the model and a likely range for the parameter values that might be encountered in practice.

In the next posting I'll start with a very simple hierarchical data set and develop an R function for simulating data from a model that could be used to generate similar data. The Monte Carlo function frees us from the sample data and lets us explore more general issues raised by the model. The pitch to my students was that the Monte Carlo approach would provide a deeper understanding of the issues raised by hierarchical models.

The basic feedback from students was that, in the short time available for the class, they were able to get a deeper understanding of HLMs than if the class had be structured as a conventional statistics class (math lectures and exercises?). One student comment in the course evaluations that s/he "...wished all stat classes were taught this way." For better or worse, my classes always have been.

If you are interested in the topic and want to follow the serialization in future postings, I will try to follow my lectures, which were live programming demonstrations, as closely as possible keeping in mind some of the questions and problems my students had when confronting the material.

NOTE: Assuming you have R installed on your machine and you have downloaded the files LibraryLoad.R and HLMprocedures.R to some folder on you local machine, you can load the libraries and procedures using the following commands in R:

> setwd(W)

> source(file="LibraryLoad.R")

> source(file="HLMprocedures.R")

where > is the R prompt and W is the absolute path to the working directory where you downloaded the files LibraryLoad.R and HLMprocedures.R. The Macintosh and Windows version of R have 'Source' selection in the File menu that will also allow you to navigate your directory to the folder where these two files have been downloaded. When you source the LibraryLoad.R file, error messages will be produced if you have not installed the supporting libraries listed in the manual (here).

Saturday, March 10, 2012

Six Useful Classes of Linear Models

Statistics (or Sadistics as my students like to call it) is a difficult subject to teach with many seemingly unrelated and obscure concepts, many of which are listed under "Background Links" in the right-hand column. Compare Statistics to Calculus, another difficult topic for students. In the Calculus there are only three concepts: a differential, an integral and a limit. Statisticians could only dream of such simplicity and elegance (have you finished counting all the concepts listed on the right)?

One way that has helped me bring some order to this chaos is to concentrate on the classes of models typically used by statisticians. The six major classes are displayed above in matrix notation (if you need a review on matrix algebra, try the one developed by George Bebis here). Clearly understanding each of the common linear models will also help classify Hierarchical Linear Models.

Equation (1) states simply that some matrix of dependent variables, Y, is the result of some multivariate distribution, E. The letter E is chosen since this distribution will eventually become an error distribution. The subscript, i, denotes that there can be multiple models for different populations--a really important distinction.

Before leaving model (1) it is important to emphasize that we believe our dependent variables have some distribution. Consider the well-known distribution of Intelligence (IQ) displayed above (the "career potential" line in the graphic is a little hard to take; I know many police officers with higher intelligence than lawyers, but at least the graphic didn't use the "moron," "imbecile," and "idiot" classifications). It just so happens that the normal distribution that describes IQ is the same distribution that describes E in many cases. The point is that most dependent variables have a unimodal distribution that looks roughly bell-shaped. And, there are some interesting ways to determine a reasonable distribution for your dependent variable that I will cover in future posts.

Equation (2) is the classical linear model where X is a matrix of fixed-effect, independent variables and B is a matrix of unknown coefficients that must be estimated. There are many ways to estimate B, adding further complexities to statistics, but least squares is the standard approach (I'll cover all the different approaches in future posts). Notice also that once B is estimated, E = Y - XB by simple subtraction. The important point to emphasize here, a point that is often ignored in practice, is that X is fixed either by the researcher or is fixed under sampling. For example, if we think that IQ might affect college grade point (GPA) as in the model introduce here, then we have to sample equal numbers of observations across the entire range of IQ, from below 70 to above 130. Of course, that doesn't make sense since people with low IQ do not usually get into college. The result of practical sampling restrictions, however, might change the shape of the IQ distribution, creating another set of problems. Another way that the independent variables can be fixed is through assignment. When patients are assigned to experimental and control groups in clinical trials, for example, each group is given a [0,1] dummy variable classification (one if you are in the group, zero if not). Basically, the columns of X must each contain continuous variables such as IQ or dummy variables for group assignments.

Equations (3-6) start lifting the restriction on fixed independent variables. In Equation (3), the random (non-fixed) variables are contained in the matrix Z with it's own set of coefficients, A, that have to be estimated. Along with the coefficients, random variables have variances that also must be estimated, thus model (3) is called the random effects model. Typically, hierarchical linear models are lumped in with random effects models but in future posts and at the end of this post, I will argue that the two types of models should be thought of separately.

Equation (4) is the two-stage least squares model often used for simultaneous equation estimation in economics. Here, one of the dependent variables, y, is singled out as of primary interest even though there are other endogenous variables, Y, that are simultaneously related to y. Model (4) describes structural equation models (SEMs, introduced in a previous post here) as typically used in economics. The variables in X are considered exogenous variables or predetermined variables meaning that any predictions from model (4) are generally restricted to the available values of X. Again, there are a number of specialized estimation techniques for the two-stage least squares model.

Equation (5) is the path analytic model used in both Sociology and Economics. In the path analytic SEM, all the Y variables are considered endogenous. Both model (4) and model (5) bring up the problem of identification which I will cover in a future post. Again, there are separate techniques for the estimation of parameters in model (5). There are also a whole range of controversies surrounding SEMs and causality that I will also discuss in future posts. For the time being, it is important to point out that model (5) starts to look at the unit of analysis as a system that can be described with linear equations.

Finally, Equation (6) describes the basic time series model as an n-th order difference equation. The Y matrix, in this case, has a time ordering (t and t-1) where the Y's are considered output variables and the X's are considered input variables. Time series models introduce a specific type of causality called Granger causality, for the economist Clive Granger. In model (6) the Y(t-1) variables are considered fixed since values of variables in the past have already been determined. However, the errors in E can be correlated over time introducing the concept of autocorrelation. Also, since time series variables tend to move together over time (think of population growth and the growth of Gross National Product), time series causality can be confounded by cointegration, a problem that was also studied extensively by Clive Granger.

Models (1-6) don't exhaust all the possible models used in Statistics, but they do help classify the vast majority of problems currently being studied and do help us fix ideas and suggest topics for future posts. In the next post, my intention is to start with model (1) and describe the different types of distributions one typically finds in E and introduce some interesting approaches to determining the probable distribution of your data. These approaches differ from the typical null-hypothesis testing (NHT) approach (here) where you are required to test whether your data are normally distribution (Step 3, Testing Assumptions) before proceeding with conventional statistical tests (a topic I will also cover in a future post).

Finally, returning to the problem of hierarchical linear models (HLMs), the subscript i was introduced in models (1-6) because there is a hierarchical form of each model, not just model (3). The essential idea in HLMs (see my earlier post here) is that the unit of analysis is a hierarchical system with separate subpopulations (e.g., students, classrooms, hospitals, patients, countries, regions, etc.). Aspects of the subpopulations could be described by any of the models above. The area of multi-model inference (MMI) provides a more flexible approach to HLMs and solves the problems introduced by Simpson's paradox (described in an earlier post here).

MMI can be used successfully to determine an appropriate distribution for your data and help avoid the dead-end of NHT assumption testing. I'll cover the MMI approach to Equation (1) above in the next post. The important take away for this post is that determining the class or classes of models that are appropriate to your research question will help determine the appropriate type of analysis and types of assumptions that have to be made to complete the analysis.