Spurious Correlations: A Big Problem With Big Data?

I’m impressed with Tyler Vigen for his work popularizing Spurious Correlations. He has found an effective way to convey an important message. Namely that correlation does not equal causation. Lots of things are correlated but that doesn’t mean that they have anything to do with each other.

To create his graphs Vigen indulges in “Data Dredging… a technique used to find something that correlates with one variable by comparing it to hundreds of other variables” (Vigen, 2015, page xii). To illustrate he provides numerous silly connections such as Margarine Consumption versus the Divorce Rate in Maine. Of course one can always come up with a story to connect them, but the stories will be a stretch. His book contains lots of similar illustrations, Natalie Portman Movies and Christmas Tree Sales have a pretty high correlation. His website, which inspired the book, has many of the same pictures: click here.

Sometimes the underlying cause might be obvious, as populations grow the number of knitting shops and hospitals in town increase but they aren’t necessarily directly connected. Often there isn’t even an obvious underlying connection. If you compare enough things together by complete chance some things are going to increase or decrease at the same time.

There are many Hollywood actors and loads of sales data.  This means some actor’s career is likely to correlate with some sales data if you look at enough actors and enough sales data. This is one reason why theory shouldn’t be seen as a dirty word even when we are trying to teach practical subjects. Don’t believe correlations have meaning unless you have a theory to explain how Natalie Portman impacts tree sales or vice versa.

Spurious Correlation is an especially big problem in a world of big data. Big Data encourages data dredging. Sometimes you can find something meaningful that you would never have thought of, but many of the correlations will be nonsense. If you don’t act on Spurious Correlations they are just a bit of fun, but sometimes the nonsense can sound plausible — a lot of bad ideas get perpetuated this way. We should always remember that some connections in the data just don’t mean anything at all.

Read: Tyler Vigen, Spurious Correlations: Correlation Does Not Equal Causation, Hachette Books, New York, NY