Pitfalls of Observational Big Data Analysis
The siren song of Big Data analysis is,
"Don't filter data before you collect, don't try to decide whether or not certain data is relevant, collect everything. Analysis of such large volumes of data is bound to find something interesting".
Let us look at an nice simple study reported recently about cyclists wearing helmets. This comes to us from an article in The Wall Street Journal. The main finding is,
"Bike helmets make men ride faster".
The question we need to ask about such a causation claim is how was the study conducted. The study falls in the category of Big Data analysis we see conducted with large volumes of unrelated data, just because it is available.
Data was collected daily at seven locations, each equipped with two cameras programmed to detect moving objects, isolate cyclists and calculate their speed. Cyclists were photographed from above and behind. Cycling speed of helmeted men averaged 11.9 miles an hour compared with 10.4 miles an hour for unhelmeted men.
As you can see this is a lot of data collection and can easily fall in the realm of Big Data. The problem with a claim on helmets and speed is that it is based on observational data and drawing a plausible conclusion from it. We do not know whether the difference is statistically significant and even if it did if there are other reasons to explain it. For instance, are those who tend to ride faster consistently wear helmets.
Take this in the context of business data analysis in finding your customer behavior for better marketing and product strategy. Like the researchers who photographed cyclists, enterprises have access to lots of data about their customers. Having access to petabytes of data may find interesting correlations but petabytes of data does not mean relevant business insights. With large data set even small differences will appear statistically significant. Such an analysis may point to completely different product strategy than what your customers are asking for.
The cautionary note here is, driving your business decisions based on data does not mean collecting petabytes of data. It is about applying your domain knowledge to form key hypotheses and test them against data you already have.