Big Data and Spurious Correlations
Suppose you read the following headline in a major newspaper article, what would you think?
Student Test Scores Tied to Number of Bathrooms in their Homes
Let us say, this article is also associated with a chart showing this relation
Look at those near perfect correlations. Should we start adding more bathrooms to help our children?
Except there is no such study but very close. The x-axis is actually income level of the family. While we see a nice positive correlation between income and test scores, Harvard Economics Professor Greg Mankiw warns us about the spurious correlation using the number of bathrooms as stand-in variable.
The problem with Big Data - including Petabytes scale data from all different disparate sources - is that decision makers may end up finding such spurious correlations that are not relevant to their business or worse may lead them to invest in the wrong option.
Forbes columnist, Douglas Merrill, writes
But who cares how much data you have? With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real.
What we see with Big Data hype is the propagation of the idea that increase in volume and variety of data behooves an enterprise to collect all of them and analyze all of them. This pushes decision makers to collect any and all data just because it is now possible to store them with technologies like Hadoop. Gil Press, who runs What is the Big Data blog, writes
I’m not sure how much this misguided excitement around big data is a clear and present danger to science right now. But the threat to sound business decisions is quite evident
It is important for decision makers to look beyond the Big Data hype and not opt to collect every bit of data just because it is there and just because they can do it. It is important to start Big Data projects with key decision in hand and an informed hypothesis about it. Enterprises already collect enough data whose value they have not unlocked.
Start with what you already have to test your hypothesis instead of storing Petabytes more data with its own spurious correlations.
See here for how you can make most out of your existing data with Symantec Enterprise Solution for Hadoop.