Options for Hadoop Storage - Looking Beyond Direct Attached Storage
In my previous article I wrote about the five factors that limit Hadoop's role in the enterprise datacenter. To recap, the limitations are,
- Increased storage sprawl
- Three times as much storage
- Costly data moves
- Single point of failure
- No support for backup
The first three issues stem directly from the architecture choice recommended for Hadoop clusters - use of many different compute nodes, each with its own embedded Direct Attached Storage (DAS). Enterprises choosing Hadoop are forced to make the trade-off of accepting these limitations in favor of getting the power of Hadoop analytics. Is DAS the only choice despite its limitations?
Hadoop storage has issues, and that these issues can be addressed by using more robust and scalable storage platforms to support Hadoop clusters.
Based on his research, Webster proposes a three stage approach to Hadoop storage
Stage 1: External high-performance storage arrays that still function as DAS
Stage 2: Address Hadoop's three copies problem by using an external storage array as primary copy, there by limiting size of DAS and cluster
Stage 3: Use SAN or NAS storage instead of DAS
This is a very good migration path if an organization has already gone down the path of building distributed cluster with multiple compute nodes with DAS. The template provides a way to fix the storage issues and regain the benefits they had to forgo with the adoption of DAS.
Looking at this staged approach, it is clear that an enterprise does not have to start at Stage 1 and move sequentially. Enterprises already have SAN storage and have built experience managing SAN storage. In other words they are already starting at Stage 3. Why move from there to DAS, incurring additional investment and making costly trade-offs only to return at a later time to where they started?
This does not mean simply running HDFS as it exists today on a file system that supports SAN storage. What this means is enterprises need a way to run Hadoop, utilizing SAN storage, taking advantage of all the benefits and without making trade-offs. This is exactly the solution we are building at Symantec, to enable enterprises run Hadoop on their existing infrastructure.
The solution is built on Cluster File System, the high-performance file system for fastest failover of applications and databases. Running Hadoop on Cluster File System enables,
- High Availability by providing access to meta-data from all the nodes in the cluster, eliminating single point of failure
- Workload distribution by providing high-performance access to data from all the nodes in the cluster, preserving task distribution of MapReduce
- Efficient storage utilization through de-duplication, compression and most importantly avoiding three copies problem
- Commoditize storage hardware through its support for most storage arrays in the market
- Simpler Backups by providing snapshots
Our position is, the Cluster File System based Hadoop solution enables enterprises take advantage of Big Data analytics without the trade-offs.
See here to learn more about the solution and sign up for early access.