The biggest driver for adoption of Hadoop is its promise of unlocking value from an enterprise’s vast data store. Use cases that show incremental revenue from data analysis are very well publicized. Every organization strives to achieve that and wants to leverage the power of data analytics to drive its revenues. Promises aside, Hadoop storage has severe issues that calls into question its place in the enterprise datacenter.
- Increased storage and server sprawl – Hadoop cluster is built with numerous commodity hosts, each with its own direct attached storage. Just when datacenter architects have spent considerable time and resources consolidating their datacenters and reducing footprint through server consolidation, virtualization and private cloud, Hadoop requires them to build out a massively parallel system with hundreds or even thousands of compute nodes. Managing these numerous nodes and keeping them up to date with right software is a resource intensive task.
- Make that three times as much sprawl – With a commodity everything approach there is a trade-off on resiliency. To mitigate that, Hadoop stores three copies of everything it is asked to store. That is three times as much storage as data. Since compute and storage scale linearly there is a risk of over-provisioning compute capacity resulting in poor CPU utilization.
- Big Data, Big Moves –The first step in making this massively parallel architecture work is to feed it data. Big Data means Big moves, from where data is stored to the Hadoop cluster for analysis. In fact there are too many moves because Hadoop only supports batch processing. Moves are complex because of lack of support for standard tools. Combine this with three times replication of data, it is a lot of data moves that add to costs and complexity.
- Not highly available – Data is distributed across multiple nodes but there is only one NameNode in the cluster, Hadoop’s metadata server, that knows this data distribution. All applications must go through this single NameNode to access data. This makes NameNode both a performance bottleneck and a single point of failure. Who is there to restart NameNode when it fails in the middle of the night?
- No support for backup – Recommendations for mitigating the costs and complexity of data moves include using Hadoop cluster as the primary store for the data. The problem? No reliable backup solution for Hadoop cluster exists. Hadoop’s way of storing three copies of data is not same as backup. It does not provide archiving or point in time recovery.
These issues make it unattractive to have this elephant in an enterprise’s data center.
- Wouldn’t it be better to leverage existing infrastructure without adding to the datacenter sprawl?
- Wouldn’t it be better to avoid over-provisioning, both storage and compute capacity?
- Wouldn’t it be better to leave the data where it is and run analytics on it, avoiding expensive data moves?
- Wouldn’t be better to not sacrifice high availability and make Hadoop highly available without single point of failures?
- Wouldn’t it be better to use standard tools to backup data and support archiving and point in time recovery?