The Danger of Data Mining

In prior posts I have discussed whether data analysis is associated with bigotry. It is a sensitive subject. Data enthusiasts (and I’d probably be in this camp) hope analysis can get rid of silly ideas. After all when we get better information we will be able to combat old prejudices. I am genuinely opportunistic.

That said it is important to understand that analysis is to a certain extent always clouded by prior beliefs. If you come at a problem with a strongly held incorrect perspective you are likely to find an answer that suits your prior prejudice. If there is enough data you will look and look until you find what you want to find.

This is the reason why data mining can be such a problem. If you look at enough data you will find a vast array of patterns. Many of these will be innocuous and some will be obviously absurd to all. Unfortunately, some patterns will appeal to some people and fit with their worldview. The problem is no different in scholarship. Too much marketing scholarship seems to aim to find relationships in data but if you don’t think deeply about what you are doing you end up with nonsense which at its worst can be plain offensive.

This is the case with Gerry Tellis and his colleagues’ look at how products takeoff. This paper is not too old, 2003, but reads like something from a bygone age. They focus on how new products are received in European countries and give an explanation that, to my mind, should have been cast in the dustbin generations ago.

When doing a regression with secondary data there is often much hay to be made by throwing in “cultural” factors. Throw in enough of them and some will “explain” what is happening. These cultural factors are often poorly specified. It is not that differences between countries don’t exist, it is that the differences used are often extremely poorly substantiated and explained. A vast range of post hoc explanations can be fitted to the data. Any data can be explained by any number of dubious stories – some of which are pretty offensive – given the vast amount of data associated with each country/culture/region or whatever is being discussed.

To explain new product adoption Tellis and colleagues turn to the protestant work ethic, a trope often used to explain patterns in data when authors don’t want to think any more deeply. Of course, there are differences between European countries but throwing % of protestants into a regression and stating that is a “reason” for the difference is not substantially different to throwing in skin color.

“The major religious difference among nations in Europe is the ratio of Protestants to Catholics. There is strong evidence in sociology that Protestant religions are more supportive of a high need for achievement than is the Catholic faith (McClelland 1961, Weber 1958). Therefore, we will operationalize need for achievement by the percentage of Protestants (see Parker 1997).” (Tellis, Stremersch, and Yin, 2003, page 198)

If you aren’t convinced that this is a an example of a problematic lack of thought about potential prejudice note that they say: “We use climate as a proxy for industriousness (reverse scaled).” (Tellis, Stremersch, and Yin, 2003, page 198). Researchers need to be more careful with their “theory”, they don’t want their data mining to be associated with prejudice.

Read: Gerry Tellis, Stefan Stremersch, and Eden Yin (2003) The International Takeoff of New Products: The Role of Economics, Culture, and Country Innovativeness, Marketing Science, 22 (2), 188-208