Data Without Theory

My final note on Gary Smith’s impressive Standard Deviations book concerns an important point that statistically inclined people often seem to miss. He is keen to note that data isn’t enough on its own. “Data without theory… is treacherous” (Smith, 2014, page 233).

Smith describes a case where a cholera outbreak was statistically associated with people not leaving their villages a few days before. If this was thought to be useful one might look for lack of movement between villages and conclude that we can predict cholera from movements. This is a lot of work to go to when a simple piece of “theory” — by which I mean thought about causes — helps us work out what is happening without massive amounts of number crunching. What is happening is pretty simple. Floods come and people stop leaving their villages, then cholera comes borne by the flood water. We can predict cholera from easily observed floods; we don’t need to capture movement data.

Thinking — developing a causal theory — allows us to use the data much more effectively. Theory without data is a problem which can afflict academics — we can get divorced from reality.

Data without theory is also a problem — you can end up believing and doing some pretty silly things. As Smith says it can fuel bubbles — we don’t know why the price is so high but it keeps on going up so we assume it will continue to do so. Always aim to come up with a at least plausible theory of what is causing whatever you observe in the data before putting too much faith in the result.

Read: Gary Smith, 2014, Standard Deviations: Flawed Assumptions, Tortured Data and Other Ways to Lie With Statistics, The Overlook Press.