Tuesday, June 17, 2014

Aesop's Fables and Regression Analysis

Who knew that Aesop was a source on the challenges of causal inference? I certainly didn't before reading the tale of THE WOMAN AND THE HEN to my daughters History and Mystery earlier this evening. There are many translations of the tale, but the Barnes & Noble edition that we own has an interesting moral for students of statistics and their utility in researching observational data. Before I reveal that moral, let me give you a quick paraphrase of the tale.

The Woman and the Hen
I once heard of a woman who owned a Hen that laid a single egg every day. These eggs were of the highest quality and the woman always received a high price for them when she sold them at the weekly farmer's market. One day, the woman wondered if she might be able to get the Hen to lay two eggs instead of one each day and so collected some observational data regarding the Hen's egg laying habits. She found that by feeding the single hen one ounce of feed a day the Hen would lay one egg. She hypothesized that feeding the chicken two ounces of feed a day would cause the Hen to lay two eggs a day. She tested this hypothesis by feeding the Hen the two ounces of feed for two weeks. In the process of the experiment, the Hen became overweight and stopped laying eggs altogether.
I modified the story the way I did because of the particular translated moral of the Barnes & Noble edition:

Figures are not always facts.

The moral is usually presented as some modification of "don't be greedy," but the formulation in my edition got me thinking about observational data and how collinearity can lead to bad inferences.

In the case of our woman, she sets forth to rigorously measure some observational data regarding her Hen's laying habits and finds collects the following information every day:

Hen = 1
Food = 1
Egg =1

From this information, she clearly imagines that the relationship is something like the following.

Egg = B1 + B(Food)

Instead of immediately seeing the collinearity problem in her data - because she didn't have STATA, R, or SPSS - she assumes that her B1 value is 0 and that the independent variable is food. If she was a little craftier, she might have decided that the relationship is something akin to:

Egg = B1 + B(Food) + B(Chicken)

Which is probably a better hypothesis as far as an overall model goes, but one which still suffers problems of collinearity. She should have noticed this trend almost immediately since there was zero variation in her data. Which brings us to the central problem here and that is the N of her observation pool. One cannot make generalizable statements from a single case. This is often what people mean when they say pithily that "data isn't the plural of anecdote." As smart as that sounds, it can actually be a very wrong statement because in opinion data with a sufficiently large sample that is randomly selected that is seeking to understand what people believe, the plural of anecdote is in fact data. 

For detailed discussions of Verbal Reports as Data, you can read Ericsson's paper. Suffice to say that anecdotes can be data if they are in a properly structured and controlled experiment or rigorously designed survey instrument. 

All of which is kind of beside the point of our tale, as the woman's problem is with pure observational data using a single subject in a very controlled fashion. The problems here are collinearity, lack of variation, and a small N. Even given the small N, if we had some variation in production and food given we could make a sound hypothesis for this one Hen.  Let's create a STATA do file that gives us this variation.

set obs 100
gen chicken=.
gen eggs=.
gen food=.
replace chicken=1
replace eggs=1
replace food=1
replace eggs = 2 if _n >10 & _n < 35
replace food = 2 if _n >=20
replace eggs = 0 if _n > 55
replace food = 0 if _n >74
regress eggs food chicken

This gives us the following result with Chicken removed due to collinearity. 

Here we can see that the amount of food does have an impact on egg production, but this would be based on observational data that the woman did not have for her Hen. It should be noted that the above information also tells us that if we don't feed the Hen at all, we will still produce .29 of an egg a day. All of which leads us to the same conclusion as Aesop that observational data can be misleading, even when statistically significant and where there visually appears to be a relationship.

This was probably the nerdiest post I've ever written, and it is filled with some holes as the model really could be refined to be much better as could the do file, but it all points to the same moral.

Be careful to have a good theory when designing statistical models. Excluded variables can matter, as can a lack of variation in the data. The woman would have been better served getting two Hens, or by varying the amount of food in a way that was rigorously experimental and not based on a single treatment. So my additional Aesop moral is:

Replication if Vital.

No comments: