Two earlier posts (here and here) described a simple hierarchical linear model (HLM) for Black Friday Retail Sales using data from an article in the Washington Post (here). The pedagogical purpose of the HLM exercise was to display one answer to Simpson's Paradox. It wasn't meant to be a general recommendation for the analysis of retail sales.
The analysis of aggregate retail sales is essentially a time series problem in that we are analyzing sales over time for, ideally, multiple years. HLMs can be written to handle time series problems, a topic that I will return to in a later post. The analysis of aggregate US retail sales, however, can be analyzed with a straight-forward state space time series model. If you are interested in the pure time series approach, I have done that in another post (here).
The insight from the time series analysis is that US Retail Sales are being driven by the World economy which makes sense in a world of globalized retail trade. The idea that Black Friday Sales might be a good predictor of aggregate retail sales is, based on purely theoretical considerations, not a very appealing hypothesis.
"Things go wrong. The odds catch up. Probability is like gravity: you cannot negotiate with gravity." James 'Sonny' Crockett, Miami Vice The Movie.
Friday, December 28, 2012
Friday, December 7, 2012
How to Write Hierarchical Model Equations
In the last few posts (here and here) I've written out systems of equations for hierarchical models. So far, I've written out equations for the Dyestuff and the Black Friday Sales models. Hopefully, it's easy to follow the equations once they are written out. Starting from scratch and writing your own equations might be another matter. In this post I will work through some examples that should give some idea about how to start.
First, I will review the examples I have already used and in future posts introduce a few more. When I develop equations, there is an interaction between how I write the equations and how I know I will have to write simulation code to generate data for the model. Until I actually show you how to write that R code in a future post, I'm going to use pseudo-code, that is, an English language (rather than machine readable) description of the algorithm necessary to generate the data. If you have not written pseudo-code before, Wikipedia provides a nice description with a number of examples of "mathematical style pseudo-code" (here).
For the Dyestuff example (here and here) we were "provided" six "samples" representing different "batches of works manufacturer." The subtle point here is that we are not told that the batches were randomly sampled from the universe of all batches we might receive at the laboratory (probably an unrealistic and impossible sampling plan). So, we have to deal with each batch as a unit without population characteristics. So, I can start with the following pseudo-code:
For (Every Batch in the Sample)
For (Every Observation in the Batch)
Generate a Yield coefficient from a batch distribution.
Generate a Yield from a sample distribution.
End (Batch)
End (Sample)
If we had random sampling, I would have been able to simply generate a Yield from a sample Yield Coefficient and a sample Yield distribution (the normal regression model). What seems difficult for students is that many introductory regression texts are a little unclear about how the data were generated. On careful examination, examples turn out to not have been randomly sampled from some large population. Hierarchical models, and the attempt to simulate the underlying data generation process, sensitize us to the need for a more complicated representation. So, instead of the single regression equation we get a system of equations:
where lambda_00 is the yield parameter, mu_0j is the batch error term, beta_0j is the "random" yield coefficient, X is the batch (coded as a dummy variable matrix displayed here), epsilon_ij is the sample error and Y_ij is the observed yield. With random sampling, beta_0j would be "fixed under sampling" at the sample level. Without random sampling, it is a "random coefficient". Here, the batch and the sample distributions are both assumed to be normal with mean 0 and standard errors sigma_u0 and sigma_e, respectively. I'll write the actual R code to simulate this data set in the next post.
The second model I introduced was the Black Friday Sales model (here). I assumed that we have yearly weekly sales data and Black Friday week sales generated at the store level. I also assume that we have retail stores of different sizes. In the real world, not only do stores of different sizes have different yearly sales totals but they probably have somewhat different success with Black Friday Sales events (I always seem to see crazed shoppers crushing into a Walmart store at the stroke of midnight, for example here, rather than pushing their way into a small boutique store). For the time being, I'll assume that all stores have the same basic response to Black Friday Sales just different levels of sales. In pseudo-code:
For (Each Replication)
For (Each Size Store)
For (Each Store)
Generate a random intercept term for store
Generate Black Friday Sales from some distribution
Generate Yearly Sales using sample distribution
End (Store)
End (Store Size)
End (Replication)
and in equations
where the terms have meanings that are similar to the Dyestuff example.
I showed the actual R code for generating a Black Friday Sales data set in the last post (here). The two important equations to notice within the outer loops are
b0 <- b[1] - store + rnorm(1,sd=s[1])
yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])
The first gets the intercept term using a normal random number generator and the second forecasts the actual yearly sales using a second normal random number generator. The parameters to the function rnorm(1,sd=s[1],mean=0) tell the random number generator to generate one number with mean zero (the default) and standard deviation given s[1]. For more information on the random number generators in R type
> help(rnorm)
after the R prompt (>). In a later post I will describe how to generate random numbers from any arbitrary or empirically generated distribution using the hlmmc package. For now, standard random numbers will be just fine.
In the next post I'll describe in more detail how to generate the Dyestuff data.
First, I will review the examples I have already used and in future posts introduce a few more. When I develop equations, there is an interaction between how I write the equations and how I know I will have to write simulation code to generate data for the model. Until I actually show you how to write that R code in a future post, I'm going to use pseudo-code, that is, an English language (rather than machine readable) description of the algorithm necessary to generate the data. If you have not written pseudo-code before, Wikipedia provides a nice description with a number of examples of "mathematical style pseudo-code" (here).
For the Dyestuff example (here and here) we were "provided" six "samples" representing different "batches of works manufacturer." The subtle point here is that we are not told that the batches were randomly sampled from the universe of all batches we might receive at the laboratory (probably an unrealistic and impossible sampling plan). So, we have to deal with each batch as a unit without population characteristics. So, I can start with the following pseudo-code:
For (Every Batch in the Sample)
For (Every Observation in the Batch)
Generate a Yield coefficient from a batch distribution.
Generate a Yield from a sample distribution.
End (Batch)
End (Sample)
If we had random sampling, I would have been able to simply generate a Yield from a sample Yield Coefficient and a sample Yield distribution (the normal regression model). What seems difficult for students is that many introductory regression texts are a little unclear about how the data were generated. On careful examination, examples turn out to not have been randomly sampled from some large population. Hierarchical models, and the attempt to simulate the underlying data generation process, sensitize us to the need for a more complicated representation. So, instead of the single regression equation we get a system of equations:
The second model I introduced was the Black Friday Sales model (here). I assumed that we have yearly weekly sales data and Black Friday week sales generated at the store level. I also assume that we have retail stores of different sizes. In the real world, not only do stores of different sizes have different yearly sales totals but they probably have somewhat different success with Black Friday Sales events (I always seem to see crazed shoppers crushing into a Walmart store at the stroke of midnight, for example here, rather than pushing their way into a small boutique store). For the time being, I'll assume that all stores have the same basic response to Black Friday Sales just different levels of sales. In pseudo-code:
For (Each Replication)
For (Each Size Store)
For (Each Store)
Generate a random intercept term for store
Generate Black Friday Sales from some distribution
Generate Yearly Sales using sample distribution
End (Store)
End (Store Size)
End (Replication)
and in equations
where the terms have meanings that are similar to the Dyestuff example.
I showed the actual R code for generating a Black Friday Sales data set in the last post (here). The two important equations to notice within the outer loops are
b0 <- b[1] - store + rnorm(1,sd=s[1])
yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])
The first gets the intercept term using a normal random number generator and the second forecasts the actual yearly sales using a second normal random number generator. The parameters to the function rnorm(1,sd=s[1],mean=0) tell the random number generator to generate one number with mean zero (the default) and standard deviation given s[1]. For more information on the random number generators in R type
> help(rnorm)
after the R prompt (>). In a later post I will describe how to generate random numbers from any arbitrary or empirically generated distribution using the hlmmc package. For now, standard random numbers will be just fine.
In the next post I'll describe in more detail how to generate the Dyestuff data.
Thursday, December 6, 2012
Generating Hierarchical Black Friday Data
Following up on my last post (here), I'll show how to generate hierarchical Black Friday data. The reason for starting from scratch and trying to simulate your own data is that it helps clarify the data generating process. In the real world, we really don't know how data are generated. In a simulated world, we know exactly how the data are generated. We can then explore different estimation algorithms to see which one gives the best answer.
My guess is that Paul Dales of Capital Economics chose regression analysis because he was testing, in a straight-forward way, whether Black Friday store sales could be used to predict yearly store sales. The problem is that, even though this can be stated as a regression equation, it does not describe how the process actually works. Black Friday sales are generated at the store level as are all the other weekly sales for the year. I'll show how that data-generating model should be written below.
Since this posting is meant to show you how to generate your own Black Friday data set, load the hlmmc libraries and procedures as usual in the R console window (using the hlmmc package is described here and the R programming language, a free multi-platform public-domain program, is available here):
> setwd(W <- "absolute path to your working directory")
> source(file="LibraryLoad.R")
> source(file="HLMprocedures.R")
To make things simple to start (I'll explain how to make it more complicated in a future post), assume that the red store lines are all parallel. In terms of a regression equation, the only difference between these lines then is in the constant term.
The first equation above models that term. (Store) is an indicator variable which is zero for the largest store (imagine it's Target). Lambda_00 plus some normally distributed error term, mu_0j, determines the intercept value for that store. To create data that looked like Paul Dales', I chose a value of 1.3 for lambda_00. The smallest store (imagine a small, hip, boutique shop) was coded 3 since it was the fourth type of store by size and would have an intercept that was 3 less than Target plus some error term. Stores of other sizes would fall in between. The regression lines would all be the same with a value chosen at Beta_1j = 0.6 in the second equation.
I chose the Black Friday Sales value (the X in the second equation) by generating a uniform random number between -2 and 0 (the approximate values for the biggest store, say Target) and then shifted the value upward based on the store number (the details of this are presented in the code below). The modeling assumes that, as part of our sampling plan, we deliberately chose stores of different sizes (again, I have no idea how the original stores were sampled by Mr. Dales but the simulation exercise is designed to make me think about the issue).
The third equation above is the naive regression equation used in Paul Dales' analysis. By substituting in our more realistic data generating model in the fourth equation we see that we actually have two error terms, mu_0j and epsilon_ij. This result should sensitize us to the possibility that we might not find significant results in a naive regression model if the data were really generated hierarchically with error both at the store- and aggregate-levels .
Hopefully, the data generating model is now clear and we can start creating some simulated data. In the R console window you can type the following (after the prompt >):
> data <- BlackFriday(5,b=c(1.3,.6),s=c(.1,.1))
> print(data)
rep store yr.sales bf.sales
1 1 0 0.60498656 -1.28033788
2 1 1 0.42855766 0.55819436
3 1 2 0.44069794 1.75098032
4 1 3 -1.13941544 1.15851739
5 2 0 0.67690293 -0.76659779
6 2 1 -0.02461608 -0.70585148
7 2 2 -0.29758061 0.94070016
8 2 3 -0.81785998 1.30860288
9 3 0 0.73137890 -0.76108426
10 3 1 0.10962475 0.03706152
11 3 2 -0.25702733 0.78483895
12 3 3 -0.26299819 2.27981412
13 4 0 0.18242785 -1.83775363
14 4 1 0.58905336 0.42490799
15 4 2 -0.10427473 1.13952646
16 4 3 -0.59097226 2.01763782
17 5 0 1.00266302 -0.51554774
18 5 1 0.29517583 0.17561673
19 5 2 0.07723300 1.06761291
20 5 3 -0.20515208 2.67945502
> naive.model <- lm(yr.sales ~ bf.sales,data)
> lmplot(naive.model,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")
> htsplot(yr.sales ~ bf.sales | store,data,xlab="Black Friday Store Sales",ylab="Yearly Store Sales")
>
The next command computes and plots (above) a separate regression line for each store. These lines all have a positive slope, each one somewhat different due to random error. Thus, at the store level, Black Friday sales are a good predictor of yearly sales while at the aggregate level, the relationship is negative or even non-existant as Mr. Dales reports. Which answer is right?
The answer to that question depends on your point of view. From the standpoint of the US retail sector, maybe Black Friday is 'a bunch of meaningless hype'. From each store's perspective, however, it is a very important sales day and explains why a lot of effort is put into the promotion. The reason the aggregate regression line is negative is simply that there are stores of different sizes in the aggregate sample. Total yearly sales are smaller in smaller stores.
Remember, I have no idea how Paul Dales generated his data, what the sampling plan was or where the data came from (stores or a government agency). It could just as easily be the case that the individual Black Friday store sales are negatively related to yearly sales, contradicting Economics 101. Before we can make this spectacular assertion, however, the data has to be analyzed at the store level with a hierarchical model.
I'll describe hierarchical model estimation in future posts. The models essentially estimate regression equations at the lowest (e.g., store) level and then average the coefficients to determine the aggregate relationship.
TECH NOTE: If you'd like to look a little more closely at the R code in the BlackFriday() function, just type BlackFriday after the prompt and the code will be displayed. In a future post, I'll describe how to write your own code to implement hierarchical models.
> BlackFriday
function(n,b,s,factor=FALSE,chop=FALSE,...) {
out <- NULL
for (rep in 1:n) {
for (store in 0:3) {
b0 <- b[1] - store + rnorm(1,sd=s[1])
if (chop) {
bf.sales <- trunc((runif(1,min=-2,max=0) + store)*100)/100
yr.sales <- trunc((b0 + b[2]*bf.sales + rnorm(1,sd=s[2]))*100)/100
} else {
bf.sales <- runif(1,min=-2,max=0) + store
yr.sales <- b0 + b[2]*bf.sales + rnorm(1,sd=s[2])
}
out <- rbind(out,cbind(rep,store,yr.sales,bf.sales))
}
}
data <- as.data.frame(out)
if (factor) return(within(data,store <- as.factor(store)))
else return(data)
}
>
I came up with this code by looking at the data, making some guesses and playing around a little until the graphs looked right. I'll explain this code more fully in a future post.
Tuesday, December 4, 2012
Black Friday Sales and Store Size
The conventional wisdom about Black Friday (the day after Thanksgiving) Shopping in the US is that it is the primary driver of retail sales for the year! Economic reasoning would tell us that retailers would not be opening earlier and earlier, creating labor problems, and giving deeper and deeper loss-leader discounts if something economically important wasn't happening. So when Paul Dales of Capital Economics published the chart above, which contradicts the convenitonal wisdom, the story was picked up by many news outlets (for example, in the Washington Post, Black Friday is a bunch of meaningless hype, in one chart).
What the chart shows is that when percentage change in sales during Thanksgiving week (the x-axis) is use to predict overall retail sales (the y-axis), the relationship is very weak or nonexistent (the regression line is almost flat). Readers familiar with the "wrong-level problem" (see my earlier post here) should immediately be skeptical and suspect we are seeing another example of Simpson's Paradox.
Simpson's Paradox results when data from different groups are combined and correlations between independent and dependent variables are reversed. My guess (and I'm only guessing here) is that data displayed above are from stores of different sizes. Assume for the moment that Big Box stores (Walmart and Target) have smaller percentage increases than boutique stores. Assume that the red lines imposed on Chart 2 above connect stores of similar size. Each store can have an increase in Black Friday sales while the overall macro-relationship is negative.
In order to resolve this paradox, yearly sales in each store have to be analyzed as a function of Thanksgiving week sales and their same-store sales coefficients analyzed to determine the effect of Black Friday sales. It's hard to believe that individual stores aren't doing this kind of analysis and would continue with Black Friday Blowouts if their results were not positive. If Mr. Dales had access to store-level data, he could have done the analysis with the hlmmc package.
Subscribe to:
Posts (Atom)