Charles Whelan’s Naked Statistics is an enjoyable and informative read. He does a very good job of simplify statistics. He explains what statistical methods can do but also the problems that people get into using statistics. Here I’ll focus on the problem of Omitted Variable Bias which Whelan explains very clearly.
Omitted variable bias sounds like an intimidating idea but isn’t really. The bias comes from the fact that in any data there will be lots of things happening that may be associated with what we are testing but don’t make it into our model. Whelan seeks to explain the effect of golf on heart disease. The obvious thing to examine is if Golfers have more heart disease than non-golfers. Whelan tells us what he’d expect. “I would not be surprised if Golfers have a higher incidence of [heart disease] than nongolfers.” (Whelan, 2013, page 217). If true this seems pretty damning evidence of golf’s danger — maybe we should ban golfing on the grounds of public health.
Of course this is where omitted variable bias comes in. Golf might not really have anything to do with heart disease if there is another variable that is associated with both heart disease and golfing. The problem is that: “In general, people play more golf as they get older… Golf isn’t killing people, old age is killing people, and they happen to be playing golf while it does.” (Whelan, 2013, page 217).
Omitted variable bias can explain a lot of confusion in the real world. My advice is that we shouldn’t be convinced by any result that we don’t have a plausible theory about why it happens. Of course plausible sounding theories are relatively easy to find which means if someone can’t plausibly explain any effect that they extract from data in a reasonably convincing manner you might want to start worrying about omitted variable bias. This problem is only going to get bigger in a world of big data. In big data there are any number of relationships that one can test and many will show up as significant. If you don’t know why it is significant it might be that you are forgetting to add the real cause to your analysis.
Read: Charles Whelan, 2013, Naked Statistics: Stripping the Dread from the Data, Norton