Charles Whelan’s Naked Statistics is an enjoyable and informative read. He does a very good job of simplifying statistics. He explains what statistical methods can do but also the problems that people get into using statistics. Here I’ll focus on him explaining Omitted Variable Bias. Whelan tackles this problem very clearly.
Explaining Omitted Variable Bias
Omitted variable bias sounds like an intimidating idea but isn’t really. The bias comes from the fact that in any data there will be lots of things happening. These things may be associated with what we are testing but they don’t make it into our model. Basically, there are important things we have left out.
Whelan seeks to explain the effect of golf on heart disease. The obvious thing to examine is if Golfers have more heart disease than non-golfers. Whelan tells us what he’d expect.
I would not be surprised if Golfers have a higher incidence of [heart disease] than nongolfers.
Whelan, 2013, page 217
If true this seems pretty damning evidence of golf’s danger — maybe we should ban golfing on the grounds of public health.
Golfing Is Not (Very) Dangerous
Of course, this is where omitted variable bias comes in. Golf might not really have anything to do with heart disease. There may be another variable that is associated with both heart disease and golfing. The problem is pretty obvious is you think about it.
In general, people play more golf as they get older… Golf isn’t killing people, old age is killing people, and they happen to be playing golf while it does so.
Whelan, 2013, page 217
Omitted Variable Bias Causes Confusion
Omitted variable bias can explain a lot of confusion in the real world. My advice is that we shouldn’t be convinced by any result that we don’t have a plausible theory about why it happens. Of course, plausible-sounding theories are relatively easy to find. This is the point. It means if someone can’t plausibly explain any effect that they extract from data in a reasonably convincing manner you might want to start worrying about omitted variable bias. This problem is only going to get bigger in a world of big data.
In big data there are any number of relationships that one can test. Many will show up as significant. Many will be spurious correlations. If you don’t know why it is significant it might be that you are forgetting to add the real cause to your analysis.
For more on statistics see here, here, and here.
Read: Charles Whelan, 2013, Naked Statistics: Stripping the Dread from the Data, Norton