Thursday, December 6, 2012

Generating Hierarchical Black Friday Data

Following up on my last post (here), I'll show how to generate hierarchical Black Friday data. The reason for starting from scratch and trying to simulate your own data is that it helps clarify the data generating process. In the real world, we really don't know how data are generated. In a simulated world, we know exactly how the data are generated. We can then explore different estimation algorithms to see which one gives the best answer.

My guess is that Paul Dales of Capital Economics chose regression analysis because he was testing, in a straight-forward way, whether Black Friday store sales could be used to predict yearly store sales. The problem is that, even though this can be stated as a regression equation, it does not describe how the process actually works. Black Friday sales are generated at the store level as are all the other weekly sales for the year. I'll show how that data-generating model should be written below.

Since this posting is meant to show you how to generate your own Black Friday data set, load the hlmmc libraries and procedures as usual in the R console window (using the hlmmc package is described here and the R programming language, a free multi-platform public-domain program, is available here):

setwd(W <- "absolute path to your working directory")
> source(file="LibraryLoad.R")
> source(file="HLMprocedures.R")

The hlmmc package contains a function, BlackFriday(), to generate data that looks similar to that presented in the last post. It generates data at the store level using the following set of equations:


To understand these equations, refer back to the prior post (here) where I drew red lines on Paul Dales' graph indicating fictitious store data (here's the graph again):


To make things simple to start (I'll explain how to make it more complicated in a future post), assume that the red store lines are all parallel. In terms of a regression equation, the only difference between these lines then is in the constant term.

The first equation above models that term. (Store) is an indicator variable which is zero for the largest store (imagine it's Target). Lambda_00 plus some normally distributed error term, mu_0j, determines the intercept value for that store. To create data that looked like Paul Dales', I chose a value of 1.3 for lambda_00. The smallest store (imagine a small, hip, boutique shop) was coded 3 since it was the fourth type of store by size and would have an intercept that was 3 less than Target plus some error term. Stores of other sizes would fall in between. The regression lines would all be the same with a value chosen at Beta_1j = 0.6 in the second equation.

I chose the Black Friday Sales value (the X in the second equation) by generating a uniform random number between -2 and 0 (the approximate values for the biggest store, say Target) and then shifted the value upward based on the store number (the details of this are presented in the code below). The modeling assumes that, as part of our sampling plan, we deliberately chose stores of different sizes (again, I have no idea how the original stores were sampled by Mr. Dales but the simulation exercise is designed to make me think about the issue).

The third equation above is the naive regression equation used in Paul Dales' analysis. By substituting in our more realistic data generating model in the fourth equation we see that we actually have two error terms, mu_0j and epsilon_ij. This result should sensitize us to the possibility that we might not find significant results in a naive regression model if the data were really generated hierarchically with error both at the store- and aggregate-levels .

Hopefully, the data generating model is now clear and we can start creating some simulated data. In the R console window you can type the following (after the prompt >):

> data <- BlackFriday(5,b=c(1.3,.6),s=c(.1,.1))
> print(data)
   rep store    yr.sales    bf.sales
1    1     0  0.60498656 -1.28033788
2    1     1  0.42855766  0.55819436
3    1     2  0.44069794  1.75098032
4    1     3 -1.13941544  1.15851739
5    2     0  0.67690293 -0.76659779
6    2     1 -0.02461608 -0.70585148
7    2     2 -0.29758061  0.94070016
8    2     3 -0.81785998  1.30860288
9    3     0  0.73137890 -0.76108426
10   3     1  0.10962475  0.03706152
11   3     2 -0.25702733  0.78483895
12   3     3 -0.26299819  2.27981412
13   4     0  0.18242785 -1.83775363
14   4     1  0.58905336  0.42490799
15   4     2 -0.10427473  1.13952646
16   4     3 -0.59097226  2.01763782
17   5     0  1.00266302 -0.51554774
18   5     1  0.29517583  0.17561673
19   5     2  0.07723300  1.06761291
20   5     3 -0.20515208  2.67945502
> naive.model <- lm(yr.sales ~ bf.sales,data)
> lmplot(naive.model,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")
> htsplot(yr.sales ~ bf.sales | store,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")

For this example, we are going to generate 5 samples (the first parameter in the call to the BlackFriday() function) from four stores of similar sizes, the stores labelled 0-3. The second command prints the data we've generated (if you entered the BlackFriday command again, you would get a different data set; if you changed the parameters, for example changing 5 to 6, you would get six replications).

The next two commands estimated and display (in the graph above) a naive regression model fit to the data. The  red regression line slopes downward as reported by Paul Dales.

The next command computes and plots (above) a separate regression line for each store. These lines all have a positive slope, each one somewhat different due to random error. Thus, at the store level, Black Friday sales are a good predictor of yearly sales while at the aggregate level, the relationship is negative or even non-existant as Mr. Dales reports. Which answer is right?

The answer to that question depends on your point of view. From the standpoint of the US retail sector, maybe Black Friday is 'a bunch of meaningless hype'. From each store's perspective, however, it is a very important sales day and explains why a lot of effort is put into the promotion. The reason the aggregate regression line is negative is simply that there are stores of different sizes in the aggregate sample. Total yearly sales are smaller in smaller stores.

Remember, I have no idea how Paul Dales generated his data, what the sampling plan was or where the data came from (stores or a government agency). It could just as easily be the case that the individual Black Friday store sales are negatively related to yearly sales, contradicting Economics 101. Before we can make this spectacular assertion, however, the data has to be analyzed at the store level with a hierarchical model. 

I'll describe hierarchical model estimation in future posts. The models essentially estimate regression equations at the lowest (e.g., store) level and then average the coefficients to determine the aggregate relationship.


TECH NOTE: If you'd like to look a little more closely at the R code in the BlackFriday() function, just type BlackFriday after the prompt and the code will be displayed. In a future post, I'll describe how to write your own code to implement hierarchical models.

> BlackFriday

function(n,b,s,factor=FALSE,chop=FALSE,...) {
out <- NULL
for (rep in 1:n) {
for (store in 0:3) {
b0 <- b[1] - store + rnorm(1,sd=s[1])
if (chop) {
bf.sales <- trunc((runif(1,min=-2,max=0) + store)*100)/100
yr.sales <- trunc((b0 + b[2]*bf.sales + rnorm(1,sd=s[2]))*100)/100
} else {
bf.sales <- runif(1,min=-2,max=0) + store
yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])
}
out <- rbind(out,cbind(rep,store,yr.sales,bf.sales))
}
}
data <- as.data.frame(out)
if (factor) return(within(data,store <- as.factor(store)))
else return(data)
}
> 

I came up with this code by looking at the data, making some guesses and playing around a little until the graphs looked right. I'll explain this code more fully in a future post.

No comments:

Post a Comment