Monday, September 24, 2012

The Dyestuff Model

In a prior post (here), I introduced the Dyestuff data set. At first, one might be tempted to run an analysis of variance (ANOVA) on this data since the stated intention is to analyze variation from batch to batch. The ANOVA model is equivalent to the standard regression model.


Here, all the terms represent matrices: Y is an (n x 1) matrix of dependent variables, Yield in this case.  X is an (n x m) design matrix (to be explained in the NOTE below) coding the Batch. The beta matrix is a (m x 1) matrix of unknown coefficients. And, the e matrix is an (n x 1) matrix of unknown errors (departures from perfect fit) distributed normally with mean zero and a single variance, sigma. If your matrix algebra is weak, R provides a great environment for learning linear algebra (see for example here).

The regression model is completely equivalent to ANOVA. You can convince yourself of this by running the two analyses in R.


> m1 <- lm(Yield ~ Batch,Dyestuff,x=TRUE)
> summary(m1)

Call:
lm(formula = Yield ~ Batch, data = Dyestuff, x = TRUE)

Residuals:
   Min     1Q Median     3Q    Max 
-85.00 -33.00   3.00  31.75  97.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1505.00      22.14  67.972  < 2e-16 ***
BatchB         23.00      31.31   0.735  0.46975    
BatchC         59.00      31.31   1.884  0.07171 .  
BatchD         -7.00      31.31  -0.224  0.82500    
BatchE         95.00      31.31   3.034  0.00572 ** 
BatchF        -35.00      31.31  -1.118  0.27474    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 49.51 on 24 degrees of freedom
Multiple R-squared: 0.4893, Adjusted R-squared: 0.3829 
F-statistic: 4.598 on 5 and 24 DF,  p-value: 0.004398 

> anova(m1)
Analysis of Variance Table

Response: Yield
          Df Sum Sq Mean Sq F value   Pr(>F)   
Batch      5  56358 11271.5  4.5983 0.004398 **
Residuals 24  58830  2451.2                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


In R, the anova(m1) command produces the same results as summary(m1), only the results are presented as the standard ANOVA table.

The naive regression result finds that the means of Batches C and E are significantly different from the other batches. What the result says for the initial question is that most batches are +/- 31.31 units from the average yield of 1505. However, some batches can be significantly outside this range. But, is this an analysis of means or analysis of variances?  The history and development of terminology (analysis-of-variance, analysis of means, random error terms, random components, random effects, etc.) makes interesting reading that I will explore in a future post. To maintain sanity, it's important to keep focused on the model rather than the terminology.

If we write out another model that might have generated this data, it becomes easier to clarify terminology and understand why ANOVA (analysis of means) might not be the right approach to analyze the  variation from batch to batch.



Since we have no control over the sampling of batches, we expect that the Yield coefficient will have a random source of variation in addition to the true population yield (see the first equation above). Under normal experimental conditions we might try to randomize away this source of variation by sampling batches for a wide range of suppliers. But, since we are actually not using random assignment, we have a unique source of variation associated with the yield of each batch. As can be seen from the first equation above, this is where the term random coefficients comes from.  Where we have random sampling, the effect of Batch on Yield would be a fixed effect, that is, fixed under sampling. In this model, the fixed effect is lambda_00 rather than the usual beta.

The equation that determines Yield (the second and third equations above) thus has a random coefficient. If we substitute Equation 1 into Equation 2, we get Equation 4. Notice that we now have a larger error term, one error associated with Batch and another associated with the model. The purpose of hierarchical modeling is to account for the source of error associated with how the Batches are generated. The "analysis of variance" here involves the two sigma parameters, sigma_u0 (the standard error of the yield coefficient) and sigma_e (the standard error of the regression).


What you might be asking yourself right now is "If this is a HLM, where's the hierarchy?" Let's assume that we were receiving batches from more than one manufacturer. Then, we would have batches nested within suppliers, but still no random sampling. To account for the effect of supplier, we need another second-level variable (second level in the hierarchy), Z, that is an indicator variable identifying the manufacturer.


The organizational chart above makes the hierarchy explicit. If we take the Z out of the model, however, we are still left with a random coefficients model. Since most human activity is embedded within some social hierarchy, it is hard to find a unit of analysis that is not hierarchical: (1) students within class rooms, classrooms within schools, schools within school systems, etc., (2) countries nested within regions, (3) etc. When these implicit hierarchies are not modeled, a researcher can only hope that the second-level random effects are small.

Returning to terminology for a moment, if we were just interested in the analysis of means, then standard ANOVA models (sometimes called Type I ANOVA) gives us a fairly good analysis of differences in Yield. We might have more error associated with the non-random Batches in the error term, but we still have significant results (some Batches have significantly different means in spite of the unanalyzed  sources of Batch variance). However, if we are interested in true analysis of variance (sometimes called Type II ANOVA) and if we expect that there is some unique source of variation associated with Batches that we want to analyze, we need a technique that will produce estimates of both sigma_u0 (the standard error of the yield coefficient) and sigma_e (the standard error of the regression).

There are multiple techniques that will perform a Type II ANOVA or random coefficients analysis. In the next post I'll discuss the most popular approach, Maximum Likelihood Estimation (MLE).

NOTE: The equivalence of ANOVA and regression becomes a little clearer by looking at the  design matrix for the Dyestuff model. Most students are familiar with the standard regression model where the independent variable, X, is an (n x 1) matrix containing values for a continuous variable. This is how the difference between ANOVA and regression is typically explained in introductory text books. However, if the X matrix contains dummy [0,1] variables coding each Batch, the regression model is equivalent to ANOVA. Here is the design matrix, as the ANOVA X matrix is often called, for the Dyestuff model:


> m1$x
   (Intercept) BatchB BatchC BatchD BatchE BatchF
1            1      0      0      0      0      0
2            1      0      0      0      0      0
3            1      0      0      0      0      0
4            1      0      0      0      0      0
5            1      0      0      0      0      0
6            1      1      0      0      0      0
7            1      1      0      0      0      0
8            1      1      0      0      0      0
9            1      1      0      0      0      0
10           1      1      0      0      0      0
11           1      0      1      0      0      0
12           1      0      1      0      0      0
13           1      0      1      0      0      0
14           1      0      1      0      0      0
15           1      0      1      0      0      0
16           1      0      0      1      0      0
17           1      0      0      1      0      0
18           1      0      0      1      0      0
19           1      0      0      1      0      0
20           1      0      0      1      0      0
21           1      0      0      0      1      0
22           1      0      0      0      1      0
23           1      0      0      0      1      0
24           1      0      0      0      1      0
25           1      0      0      0      1      0
26           1      0      0      0      0      1
27           1      0      0      0      0      1
28           1      0      0      0      0      1
29           1      0      0      0      0      1
30           1      0      0      0      0      1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Batch
[1] "contr.treatment"


Notice that there is no BatchA, since it plays the role of the reference class to which other batches are compared. That is, the coefficients attached to BatchB through BatchF are expressed as deviations from the grand mean (this construction is necessary because there are only 6 parameters and 6 degrees of freedom, so one class has to become the grand mean).

Saturday, September 22, 2012

A Simple HLM: The Dyestuff Example

Hierarchical Linear Models (HLMs) have their own terminology that challenges a lot of what is taught as basic statistics. We have random effects, random coefficients, and mixed-models added on to what we already know about random error terms and fixed effects. It will help to start with a simple data set that so we can introduce new terms one at a time.

The lme package in the R programming language has a simple data set that provides a good starting point. If you have R running on your machine and the lme package installed, typing help(Dyestuff) produces the following description of the data set:

The Dyestuff data frame provides the yield of dyestuff (Naphthalene Black 12B) from 5 different preparations from each of 6 different batchs of an intermediate product (H-acid). 

The Dyestuff data are described in Davies and Goldsmith (1972) as coming from “an investigation to find out how much the variation from batch to batch in the quality of an intermediate product (H-acid) contributes to the variation in the yield of the dyestuff (Naphthalene Black 12B) made from it. In the experiment six samples of the intermediate, representing different batches of works manufacture, were obtained, and five preparations of the dyestuff were made in the laboratory from each sample. The equivalent yield of each preparation as grams of standard colour was determined by dye-trial.”

Notice that the batches are not randomly chosen out of some universe of manufactured batches. They are whatever we were able to "obtain".  Just to make things concrete, here are the first few lines of the file printed by the R head() command (the before the head command is the R prompt):


> head(Dyestuff)
  Batch Yield
1     A  1545
2     A  1440
3     A  1440
4     A  1520
5     A  1580
6     B  1540


At first, one might be tempted to run an analysis of variance (ANOVA) on this data since the stated intention is to analyze variation from batch to batch. However, there are issues presented by this simple data set that help us understand the added complexity and potential benefits of hierarchical modeling.

In general, the standard approach is to move directly from a data set to some kind of estimator. In the next few posts, I will argue that the standard approach misses a number of important opportunities.

Sunday, September 16, 2012

The 'hlmmc' package

From July 1 through August 15, I taught Sociology S534 Advanced Sociological Analysis: Hierarchical Linear Models at the University of Tennessee in Knoxville. A syllabus for the course is available here. Over the next few months I will serialize the lectures and a user guide for the software package ('hlmmc') used in the course. 'hlmmc' is a software package written in the R programming language for studying Hierarchical Linear Models (HLMs) using Monte Carlo simulation. A manual for the software package is available here. It explains how to download the software and how each function in the package can be used (also, see NOTE below).

The approach I used in the course was to write out the equations for each linear model as a function in the R programming language. The functions were designed so that data for the study could be simulated for each observation in the sample. This is a useful exercise because it forces you to confront a number of difficult issues. What statistical distributions will be used to not only simulate the distributions of the error terms but also the distributions of the independent variables? Typically, a normal distribution is chosen for the error terms and the distributions of the independent variables are ignored in planning a study. Monte Carlo Simulation forces the researcher to confront these issues in addition to explicitly stating the functional form of the model, the actual parameters in the model and a likely range for the parameter values that might be encountered in practice.

In the next posting I'll start with a very simple hierarchical data set and develop an R function for simulating data from a model that could be used to generate similar data. The Monte Carlo function frees us from the sample data and lets us explore more general issues raised by the model. The pitch to my students was that the Monte Carlo approach would provide a deeper understanding of the issues raised by hierarchical models.

The basic feedback from students was that, in the short time available for the class, they were able to get a deeper understanding of HLMs than if the class had be structured as a conventional statistics class (math lectures and exercises?). One student comment in the course evaluations that s/he "...wished all stat classes were taught this way." For better or worse, my classes always have been.

If you are interested in the topic and want to follow the serialization in future postings, I will try to follow my lectures, which were live programming demonstrations, as closely as possible keeping in mind some of the questions and problems my students had when confronting the material.

NOTE: Assuming you have R installed on your machine and you have downloaded the files LibraryLoad.R and HLMprocedures.R to some folder on you local machine, you can load the libraries and procedures using the following commands in R:

> setwd(W)
> source(file="LibraryLoad.R")
> source(file="HLMprocedures.R")

where is the R prompt and W is the absolute path to the working directory where you downloaded the files LibraryLoad.R and HLMprocedures.RThe Macintosh and Windows version of R have  'Source' selection in the File menu that will also allow you to navigate your directory to the folder where these two files have been downloaded. When you source the LibraryLoad.R file, error messages will be produced if you have not installed the supporting libraries listed in the manual (here).