Random Variation: generating hierarchical data

Showing posts with label generating hierarchical data. Show all posts

Friday, December 7, 2012

How to Write Hierarchical Model Equations

In the last few posts (here and here) I've written out systems of equations for hierarchical models. So far, I've written out equations for the Dyestuff and the Black Friday Sales models. Hopefully, it's easy to follow the equations once they are written out. Starting from scratch and writing your own equations might be another matter. In this post I will work through some examples that should give some idea about how to start.

First, I will review the examples I have already used and in future posts introduce a few more. When I develop equations, there is an interaction between how I write the equations and how I know I will have to write simulation code to generate data for the model. Until I actually show you how to write that R code in a future post, I'm going to use pseudo-code, that is, an English language (rather than machine readable) description of the algorithm necessary to generate the data. If you have not written pseudo-code before, Wikipedia provides a nice description with a number of examples of "mathematical style pseudo-code" (here).

For the Dyestuff example (here and here) we were "provided" six "samples" representing different "batches of works manufacturer." The subtle point here is that we are not told that the batches were randomly sampled from the universe of all batches we might receive at the laboratory (probably an unrealistic and impossible sampling plan). So, we have to deal with each batch as a unit without population characteristics. So, I can start with the following pseudo-code:

For (Every Batch in the Sample)
For (Every Observation in the Batch)
Generate a Yield coefficient from a batch distribution.
Generate a Yield from a sample distribution.
End (Batch)
End (Sample)

If we had random sampling, I would have been able to simply generate a Yield from a sample Yield Coefficient and a sample Yield distribution (the normal regression model). What seems difficult for students is that many introductory regression texts are a little unclear about how the data were generated. On careful examination, examples turn out to not have been randomly sampled from some large population. Hierarchical models, and the attempt to simulate the underlying data generation process, sensitize us to the need for a more complicated representation. So, instead of the single regression equation we get a system of equations:

where lambda_00 is the yield parameter, mu_0j is the batch error term, beta_0j is the "random" yield coefficient, X is the batch (coded as a dummy variable matrix displayed here), epsilon_ij is the sample error and Y_ij is the observed yield. With random sampling, beta_0j would be "fixed under sampling" at the sample level. Without random sampling, it is a "random coefficient". Here, the batch and the sample distributions are both assumed to be normal with mean 0 and standard errors sigma_u0 and sigma_e, respectively. I'll write the actual R code to simulate this data set in the next post.

The second model I introduced was the Black Friday Sales model (here). I assumed that we have yearly weekly sales data and Black Friday week sales generated at the store level. I also assume that we have retail stores of different sizes. In the real world, not only do stores of different sizes have different yearly sales totals but they probably have somewhat different success with Black Friday Sales events (I always seem to see crazed shoppers crushing into a Walmart store at the stroke of midnight, for example here, rather than pushing their way into a small boutique store). For the time being, I'll assume that all stores have the same basic response to Black Friday Sales just different levels of sales. In pseudo-code:

For (Each Replication)
For (Each Size Store)
For (Each Store)
Generate a random intercept term for store
Generate Black Friday Sales from some distribution
Generate Yearly Sales using sample distribution
End (Store)
End (Store Size)
End (Replication)

and in equations

where the terms have meanings that are similar to the Dyestuff example.

I showed the actual R code for generating a Black Friday Sales data set in the last post (here). The two important equations to notice within the outer loops are

b0 <- b[1] - store + rnorm(1,sd=s[1])

yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])

The first gets the intercept term using a normal random number generator and the second forecasts the actual yearly sales using a second normal random number generator. The parameters to the function rnorm(1,sd=s[1],mean=0) tell the random number generator to generate one number with mean zero (the default) and standard deviation given s[1]. For more information on the random number generators in R type

> help(rnorm)

after the R prompt (>). In a later post I will describe how to generate random numbers from any arbitrary or empirically generated distribution using the hlmmc package. For now, standard random numbers will be just fine.

In the next post I'll describe in more detail how to generate the Dyestuff data.

Thursday, December 6, 2012

Generating Hierarchical Black Friday Data

Following up on my last post (here), I'll show how to generate hierarchical Black Friday data. The reason for starting from scratch and trying to simulate your own data is that it helps clarify the data generating process. In the real world, we really don't know how data are generated. In a simulated world, we know exactly how the data are generated. We can then explore different estimation algorithms to see which one gives the best answer.

My guess is that Paul Dales of Capital Economics chose regression analysis because he was testing, in a straight-forward way, whether Black Friday store sales could be used to predict yearly store sales. The problem is that, even though this can be stated as a regression equation, it does not describe how the process actually works. Black Friday sales are generated at the store level as are all the other weekly sales for the year. I'll show how that data-generating model should be written below.

Since this posting is meant to show you how to generate your own Black Friday data set, load the hlmmc libraries and procedures as usual in the R console window (using the hlmmc package is described here and the R programming language, a free multi-platform public-domain program, is available here):

> setwd(W <- "absolute path to your working directory")

> source(file="LibraryLoad.R")

> source(file="HLMprocedures.R")

The hlmmc package contains a function, BlackFriday(), to generate data that looks similar to that presented in the last post. It generates data at the store level using the following set of equations:

To understand these equations, refer back to the prior post (here) where I drew red lines on Paul Dales' graph indicating fictitious store data (here's the graph again):

To make things simple to start (I'll explain how to make it more complicated in a future post), assume that the red store lines are all parallel. In terms of a regression equation, the only difference between these lines then is in the constant term.

The first equation above models that term. (Store) is an indicator variable which is zero for the largest store (imagine it's Target). Lambda_00 plus some normally distributed error term, mu_0j, determines the intercept value for that store. To create data that looked like Paul Dales', I chose a value of 1.3 for lambda_00. The smallest store (imagine a small, hip, boutique shop) was coded 3 since it was the fourth type of store by size and would have an intercept that was 3 less than Target plus some error term. Stores of other sizes would fall in between. The regression lines would all be the same with a value chosen at Beta_1j = 0.6 in the second equation.

I chose the Black Friday Sales value (the X in the second equation) by generating a uniform random number between -2 and 0 (the approximate values for the biggest store, say Target) and then shifted the value upward based on the store number (the details of this are presented in the code below). The modeling assumes that, as part of our sampling plan, we deliberately chose stores of different sizes (again, I have no idea how the original stores were sampled by Mr. Dales but the simulation exercise is designed to make me think about the issue).

The third equation above is the naive regression equation used in Paul Dales' analysis. By substituting in our more realistic data generating model in the fourth equation we see that we actually have two error terms, mu_0j and epsilon_ij. This result should sensitize us to the possibility that we might not find significant results in a naive regression model if the data were really generated hierarchically with error both at the store- and aggregate-levels .

Hopefully, the data generating model is now clear and we can start creating some simulated data. In the R console window you can type the following (after the prompt >):

> data <- BlackFriday(5,b=c(1.3,.6),s=c(.1,.1))

> print(data)

   rep store    yr.sales    bf.sales

1    1     0  0.60498656 -1.28033788

2    1     1  0.42855766  0.55819436

3    1     2  0.44069794  1.75098032

4    1     3 -1.13941544  1.15851739

5    2     0  0.67690293 -0.76659779

6    2     1 -0.02461608 -0.70585148

7    2     2 -0.29758061  0.94070016

8    2     3 -0.81785998  1.30860288

9    3     0  0.73137890 -0.76108426

10   3     1  0.10962475  0.03706152

11   3     2 -0.25702733  0.78483895

12   3     3 -0.26299819  2.27981412

13   4     0  0.18242785 -1.83775363

14   4     1  0.58905336  0.42490799

15   4     2 -0.10427473  1.13952646

16   4     3 -0.59097226  2.01763782

17   5     0  1.00266302 -0.51554774

18   5     1  0.29517583  0.17561673

19   5     2  0.07723300  1.06761291

20   5     3 -0.20515208  2.67945502

> naive.model <- lm(yr.sales ~ bf.sales,data)

> lmplot(naive.model,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")

> htsplot(yr.sales ~ bf.sales | store,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")

>

For this example, we are going to generate 5 samples (the first parameter in the call to the BlackFriday() function) from four stores of similar sizes, the stores labelled 0-3. The second command prints the data we've generated (if you entered the BlackFriday command again, you would get a different data set; if you changed the parameters, for example changing 5 to 6, you would get six replications).

The next two commands estimated and display (in the graph above) a naive regression model fit to the data. The red regression line slopes downward as reported by Paul Dales.

The next command computes and plots (above) a separate regression line for each store. These lines all have a positive slope, each one somewhat different due to random error. Thus, at the store level, Black Friday sales are a good predictor of yearly sales while at the aggregate level, the relationship is negative or even non-existant as Mr. Dales reports. Which answer is right?

The answer to that question depends on your point of view. From the standpoint of the US retail sector, maybe Black Friday is 'a bunch of meaningless hype'. From each store's perspective, however, it is a very important sales day and explains why a lot of effort is put into the promotion. The reason the aggregate regression line is negative is simply that there are stores of different sizes in the aggregate sample. Total yearly sales are smaller in smaller stores.

Remember, I have no idea how Paul Dales generated his data, what the sampling plan was or where the data came from (stores or a government agency). It could just as easily be the case that the individual Black Friday store sales are negatively related to yearly sales, contradicting Economics 101. Before we can make this spectacular assertion, however, the data has to be analyzed at the store level with a hierarchical model.

I'll describe hierarchical model estimation in future posts. The models essentially estimate regression equations at the lowest (e.g., store) level and then average the coefficients to determine the aggregate relationship.

TECH NOTE: If you'd like to look a little more closely at the R code in the BlackFriday() function, just type BlackFriday after the prompt and the code will be displayed. In a future post, I'll describe how to write your own code to implement hierarchical models.

> BlackFriday

function(n,b,s,factor=FALSE,chop=FALSE,...) {

 out <- NULL

 for (rep in 1:n) {

  for (store in 0:3) {

   b0 <- b[1] - store + rnorm(1,sd=s[1])

   if (chop) {

    bf.sales <- trunc((runif(1,min=-2,max=0) + store)*100)/100

    yr.sales <- trunc((b0 + b[2]*bf.sales + rnorm(1,sd=s[2]))*100)/100

    } else {

    bf.sales <- runif(1,min=-2,max=0) + store

    yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])

}

   out <- rbind(out,cbind(rep,store,yr.sales,bf.sales))

}

}

 data <- as.data.frame(out)

 if (factor) return(within(data,store <- as.factor(store)))

 else return(data)

}

>

I came up with this code by looking at the data, making some guesses and playing around a little until the graphs looked right. I'll explain this code more fully in a future post.