Friday, December 7, 2012

How to Write Hierarchical Model Equations

In the last few posts (here and here) I've written out systems of equations for hierarchical models. So far, I've written out equations for the Dyestuff and the Black Friday Sales models. Hopefully, it's easy to follow the equations once they are written out. Starting from scratch and writing your own equations might be another matter. In this post I will work through some examples that should give some idea about how to start.

First, I will review the examples I have already used and in future posts introduce a few more. When I develop equations, there is an interaction between how I write the equations and how I know I will have to write simulation code to generate data for the model. Until I actually show you how to write that R code in a future post, I'm going to use pseudo-code, that is, an English language (rather than machine readable) description of the algorithm necessary to generate the data. If you have not written pseudo-code before, Wikipedia provides a nice description with a number of examples of "mathematical style pseudo-code" (here).

For the Dyestuff example (here and here) we were "provided" six "samples" representing different "batches of works manufacturer." The subtle point here is that we are not told that the batches were randomly sampled from the universe of all batches we might receive at the laboratory (probably an unrealistic and impossible sampling plan). So, we have to deal with each batch as a unit without population characteristics. So, I can start with the following pseudo-code:

For (Every Batch in the Sample)
     For (Every Observation in the Batch)
        Generate a Yield coefficient from a batch distribution.
        Generate a Yield from a sample distribution.
     End (Batch)
End (Sample)
        
If we had random sampling, I would have been able to simply generate a Yield from a sample Yield Coefficient and a sample Yield distribution (the normal regression model). What seems difficult for students is that many introductory regression texts are a little unclear about how the data were generated. On careful examination, examples turn out to not have been randomly sampled from some large population. Hierarchical models, and the attempt to simulate the underlying data generation process, sensitize us to the need for a more complicated representation. So, instead of the single regression equation we get a system of equations:



where lambda_00 is the yield parameter, mu_0j is the batch error term, beta_0j is the "random" yield coefficient, X is the batch (coded as a dummy variable matrix displayed here), epsilon_ij is the sample error and Y_ij is the observed yield. With random sampling, beta_0j would be "fixed under sampling" at the sample level. Without random sampling, it is a "random coefficient". Here, the batch and the sample distributions are both assumed to be normal with mean 0 and standard errors sigma_u0 and sigma_e, respectively. I'll write the actual R code to simulate this data set in the next post.

The second model I introduced was the Black Friday Sales model (here). I assumed that we have yearly weekly sales data and Black Friday week sales generated at the store level. I also assume that we have retail stores of different sizes. In the real world, not only do stores of different sizes have different yearly sales totals but they probably have somewhat different success with Black Friday Sales events (I always seem to see crazed shoppers crushing into a Walmart store at the stroke of midnight, for example here, rather than pushing their way into a small boutique store). For the time being, I'll assume that all stores have the same basic response to Black Friday Sales just different levels of sales. In pseudo-code:

For (Each Replication)
     For (Each Size Store)
         For (Each Store)
              Generate a random intercept term for store
              Generate Black Friday Sales from some distribution
              Generate Yearly Sales using sample distribution
         End (Store)
     End (Store Size)
End (Replication)             

and in equations

where the terms have meanings that are similar to the Dyestuff example. 

I showed the actual R code for generating a Black Friday Sales data set in the last post (here). The two important equations to notice within the outer loops are

b0 <- b[1] - store + rnorm(1,sd=s[1])

yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])

The first gets the intercept term using a normal random number generator and the second forecasts the actual yearly sales using a second normal random number generator. The parameters to the function rnorm(1,sd=s[1],mean=0) tell the random number generator to generate one number with mean zero (the default) and standard deviation given s[1]. For more information on the random number generators in R type

help(rnorm)

after the R prompt (>).  In a later post I will describe how to generate random numbers from any arbitrary or empirically generated distribution using the hlmmc package. For now, standard random numbers will be just fine.  

In the next post I'll describe in more detail how to generate the Dyestuff data.

No comments:

Post a Comment