Simpson’s Paradox: Data can be very confusing

One of the strangest things in statistics is Simpson’s paradox. The paradox happens when two sets of data each show the same result but when you combine the data into a single data set the combined table gives you a different result.

Smith explains this using a click data example. In the data he shows when you look at aggregate data a two-click format is more profitable for the entire group than a one-click format. One might conclude that the two-click is better as it performs best in aggregate. “This conclusion might be an expensive mistake” (Smith, 2014, page 112).

The problem is that when you dig into the data there are two groups, U.S. and international customers. It is strange to notice that the one-click format is actually better for both groups of customers. What is going on?

The explanation is that in the example there are relatively more US customers using the two-click format than the one-click format. The US customers are much more profitable. The relatively high number of the more profitable type of customers in the two-click format makes it look more profitable but it is not the format that is more profitable. It is that the specific format happens to have more of the profitable customers. If you compare like with like you notice that one click is simply better.

As Smith says: “The Key to being alert to a possible Simpson’s Paradox is to think about whether a confounding factor has been ignored” (Smith, 2014, page 112).

Data can be strange but often very interesting.

Read: Gary Smith, 2014, Standard Deviations: Flawed Assumptions, Tortured Data and Other Ways to Lie With Statistics, The Overlook Press.