Video Screencast Help
Symantec Appoints Michael A. Brown CEO. Learn more.
Storage and Availability Management

Enterprise Hadoop - Five issues with Hadoop that need addressing

Created: 25 May 2012 • Updated: 28 May 2012 • 1 comment
Rags Srinivasan's picture
+5 5 Votes
Login to vote

The biggest driver for adoption of Hadoop is its promise of unlocking value from an enterprise’s vast data store. Use cases that show incremental revenue from data analysis are very well publicized.  Every organization strives to achieve that and wants to leverage the power of data analytics to drive its revenues. Promises aside, Hadoop storage has severe issues that calls  into question its place in the enterprise datacenter.

  1. Increased storage and server sprawl – Hadoop cluster is built  with numerous commodity hosts, each with its own direct attached storage. Just when datacenter architects have spent considerable time and resources consolidating their datacenters and reducing footprint through server consolidation, virtualization and private cloud,  Hadoop requires them to build out a massively parallel system with hundreds or even thousands of compute nodes. Managing these numerous nodes and keeping them up to date with right software is a resource intensive task.
  2. Make that three times as much sprawl – With a commodity everything approach there is a trade-off on resiliency. To mitigate that, Hadoop stores three copies of everything it is asked to store. That is three times as much storage as data. Since compute and storage scale linearly there is a risk of over-provisioning compute capacity resulting in poor CPU utilization.
  3. Big Data, Big Moves –The first step in making this massively parallel architecture work is to feed it data. Big Data means Big moves, from where data is stored to the Hadoop cluster for analysis. In fact there are too many moves because Hadoop only supports batch processing.  Moves are complex because of lack of support for standard tools.  Combine this with three times replication of data, it is a lot of data moves that add to costs and complexity.
  4. Not highly available – Data is distributed across multiple nodes but there is only one NameNode in the cluster, Hadoop’s metadata server, that knows this data distribution. All applications must go through this single NameNode to access data. This makes NameNode both a performance bottleneck and a single point of failure. Who is there to restart NameNode when it fails in the middle of the night?
  5. No support for backup – Recommendations for mitigating the costs and complexity of data moves include using Hadoop cluster as the primary store for the data. The problem? No reliable backup solution for Hadoop cluster exists. Hadoop’s way of storing three copies of  data is not same as backup.  It does not provide archiving or point in time recovery.

These issues make it unattractive to have this elephant in an enterprise’s data center.  

  • Wouldn’t it be better to  leverage existing infrastructure without adding to the datacenter sprawl?
  • Wouldn’t it be better to avoid over-provisioning, both storage and compute capacity?
  • Wouldn’t it be better to leave the data where it is and run analytics on it, avoiding expensive data moves?
  • Wouldn’t be better to not sacrifice high availability and make Hadoop highly available without single point of failures?
  • Wouldn’t it be better to use standard tools to backup data and support archiving and point in time recovery?

Comments 1 CommentJump to latest comment

tleffing1's picture

All very relevant points that I agree with totally.  Let's add some perspective to this how this is going  "backwards" in terms of the savings that have been achieved in reducing floorspace, power, and cooling demands with server virtualization.  Kiss goodbye any tax breaks that local power companies may have extended for being "green" within the data center. 

Cost savings in one area quite often have a way of being offset in another area - I refer to this as "pouring water on an anthill" and I've also heard it referred to as "squeezing a balloon".  Bottomline, the ROI needs to be looked at objectively and holistically.

For example, a 42U rack theoretically holds 21 servers at 2U each.  This is the "normal" config we are seeing our customers use for building HDFS clusters.  Of course depending on whether in-rack switches are used, and customer preferences for rack PDU's, this number could be slightly less.  We are seeing customers with clusters already in the multiple 100's and still growing!!!

Now take into account the fact that each of those "off the shelf" servers are being fully populated with internal disks (perhaps 16 or 18 depending on the server vendor...with Gig or 10 Gig ethernet), and now you're looking racks and racks of "storage servers" - each generating more heat, and requiring more cooling, than before due to the vast number of internal disks that require even more power and cooling than when they were just "application servers" that were virtualized .  Factor in the 3x storage requirement and you're looking at a huge number of servers associated with a Hadoop architecture, and each rack is going to generate a lot of heat that has to be dissipated so can your data center handle this?

Bottom line, are all the variables that drive cost being considered as customers look at this architecture?  Whether it's power and cooling, operational efficiency, minimizing risk and maximizing application availability, customers just need to look at this from all the angles.

+1
Login to vote