Monday, September 24, 2012

The Dyestuff Model

In a prior post (here), I introduced the Dyestuff data set. At first, one might be tempted to run an analysis of variance (ANOVA) on this data since the stated intention is to analyze variation from batch to batch. The ANOVA model is equivalent to the standard regression model.


Here, all the terms represent matrices: Y is an (n x 1) matrix of dependent variables, Yield in this case.  X is an (n x m) design matrix (to be explained in the NOTE below) coding the Batch. The beta matrix is a (m x 1) matrix of unknown coefficients. And, the e matrix is an (n x 1) matrix of unknown errors (departures from perfect fit) distributed normally with mean zero and a single variance, sigma. If your matrix algebra is weak, R provides a great environment for learning linear algebra (see for example here).

The regression model is completely equivalent to ANOVA. You can convince yourself of this by running the two analyses in R.


> m1 <- lm(Yield ~ Batch,Dyestuff,x=TRUE)
> summary(m1)

Call:
lm(formula = Yield ~ Batch, data = Dyestuff, x = TRUE)

Residuals:
   Min     1Q Median     3Q    Max 
-85.00 -33.00   3.00  31.75  97.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1505.00      22.14  67.972  < 2e-16 ***
BatchB         23.00      31.31   0.735  0.46975    
BatchC         59.00      31.31   1.884  0.07171 .  
BatchD         -7.00      31.31  -0.224  0.82500    
BatchE         95.00      31.31   3.034  0.00572 ** 
BatchF        -35.00      31.31  -1.118  0.27474    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 49.51 on 24 degrees of freedom
Multiple R-squared: 0.4893, Adjusted R-squared: 0.3829 
F-statistic: 4.598 on 5 and 24 DF,  p-value: 0.004398 

> anova(m1)
Analysis of Variance Table

Response: Yield
          Df Sum Sq Mean Sq F value   Pr(>F)   
Batch      5  56358 11271.5  4.5983 0.004398 **
Residuals 24  58830  2451.2                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


In R, the anova(m1) command produces the same results as summary(m1), only the results are presented as the standard ANOVA table.

The naive regression result finds that the means of Batches C and E are significantly different from the other batches. What the result says for the initial question is that most batches are +/- 31.31 units from the average yield of 1505. However, some batches can be significantly outside this range. But, is this an analysis of means or analysis of variances?  The history and development of terminology (analysis-of-variance, analysis of means, random error terms, random components, random effects, etc.) makes interesting reading that I will explore in a future post. To maintain sanity, it's important to keep focused on the model rather than the terminology.

If we write out another model that might have generated this data, it becomes easier to clarify terminology and understand why ANOVA (analysis of means) might not be the right approach to analyze the  variation from batch to batch.



Since we have no control over the sampling of batches, we expect that the Yield coefficient will have a random source of variation in addition to the true population yield (see the first equation above). Under normal experimental conditions we might try to randomize away this source of variation by sampling batches for a wide range of suppliers. But, since we are actually not using random assignment, we have a unique source of variation associated with the yield of each batch. As can be seen from the first equation above, this is where the term random coefficients comes from.  Where we have random sampling, the effect of Batch on Yield would be a fixed effect, that is, fixed under sampling. In this model, the fixed effect is lambda_00 rather than the usual beta.

The equation that determines Yield (the second and third equations above) thus has a random coefficient. If we substitute Equation 1 into Equation 2, we get Equation 4. Notice that we now have a larger error term, one error associated with Batch and another associated with the model. The purpose of hierarchical modeling is to account for the source of error associated with how the Batches are generated. The "analysis of variance" here involves the two sigma parameters, sigma_u0 (the standard error of the yield coefficient) and sigma_e (the standard error of the regression).


What you might be asking yourself right now is "If this is a HLM, where's the hierarchy?" Let's assume that we were receiving batches from more than one manufacturer. Then, we would have batches nested within suppliers, but still no random sampling. To account for the effect of supplier, we need another second-level variable (second level in the hierarchy), Z, that is an indicator variable identifying the manufacturer.


The organizational chart above makes the hierarchy explicit. If we take the Z out of the model, however, we are still left with a random coefficients model. Since most human activity is embedded within some social hierarchy, it is hard to find a unit of analysis that is not hierarchical: (1) students within class rooms, classrooms within schools, schools within school systems, etc., (2) countries nested within regions, (3) etc. When these implicit hierarchies are not modeled, a researcher can only hope that the second-level random effects are small.

Returning to terminology for a moment, if we were just interested in the analysis of means, then standard ANOVA models (sometimes called Type I ANOVA) gives us a fairly good analysis of differences in Yield. We might have more error associated with the non-random Batches in the error term, but we still have significant results (some Batches have significantly different means in spite of the unanalyzed  sources of Batch variance). However, if we are interested in true analysis of variance (sometimes called Type II ANOVA) and if we expect that there is some unique source of variation associated with Batches that we want to analyze, we need a technique that will produce estimates of both sigma_u0 (the standard error of the yield coefficient) and sigma_e (the standard error of the regression).

There are multiple techniques that will perform a Type II ANOVA or random coefficients analysis. In the next post I'll discuss the most popular approach, Maximum Likelihood Estimation (MLE).

NOTE: The equivalence of ANOVA and regression becomes a little clearer by looking at the  design matrix for the Dyestuff model. Most students are familiar with the standard regression model where the independent variable, X, is an (n x 1) matrix containing values for a continuous variable. This is how the difference between ANOVA and regression is typically explained in introductory text books. However, if the X matrix contains dummy [0,1] variables coding each Batch, the regression model is equivalent to ANOVA. Here is the design matrix, as the ANOVA X matrix is often called, for the Dyestuff model:


> m1$x
   (Intercept) BatchB BatchC BatchD BatchE BatchF
1            1      0      0      0      0      0
2            1      0      0      0      0      0
3            1      0      0      0      0      0
4            1      0      0      0      0      0
5            1      0      0      0      0      0
6            1      1      0      0      0      0
7            1      1      0      0      0      0
8            1      1      0      0      0      0
9            1      1      0      0      0      0
10           1      1      0      0      0      0
11           1      0      1      0      0      0
12           1      0      1      0      0      0
13           1      0      1      0      0      0
14           1      0      1      0      0      0
15           1      0      1      0      0      0
16           1      0      0      1      0      0
17           1      0      0      1      0      0
18           1      0      0      1      0      0
19           1      0      0      1      0      0
20           1      0      0      1      0      0
21           1      0      0      0      1      0
22           1      0      0      0      1      0
23           1      0      0      0      1      0
24           1      0      0      0      1      0
25           1      0      0      0      1      0
26           1      0      0      0      0      1
27           1      0      0      0      0      1
28           1      0      0      0      0      1
29           1      0      0      0      0      1
30           1      0      0      0      0      1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Batch
[1] "contr.treatment"


Notice that there is no BatchA, since it plays the role of the reference class to which other batches are compared. That is, the coefficients attached to BatchB through BatchF are expressed as deviations from the grand mean (this construction is necessary because there are only 6 parameters and 6 degrees of freedom, so one class has to become the grand mean).

No comments:

Post a Comment