Big Data Storage - Conversation with CTO of Storage and Availability Management Group
What does Big Data mean to enterprises?
How does this change their storage architecture?
Does the need for taking advantage of Big Data analytics mean a cluster of 1000s of compute nodes?
The answers create lot more confusion than clarity and it depends on who you ask. In an attempt to add clarity to the conversation I sat down with V.R.Satish, CTO of Storage and Availability Management Group at Symantec to get his thoughts on how enterprises should approach the Big Data storage problem.
Here are some excerpts from my conversation.
Rags: Satish, we keep hearing about the data deluge, growth and variety. How do you think enterprises must approach Big Data?
Satish: As always with any business decision, start with the problem we are trying to solve, focus on the value realization and how it will help you better serve your internal and external customers. How you approach the problem and what technologies you employ are secondary. Unfortunately what we see in blogs and media is the technology centric conversation that has crowded out business needs.
When you think about the business needs, I hear two key ones from enterprises. One, how do I manage Big Data and two, how do I apply analytics on it to drive my top-line. You mentioned data deluge, the first and foremost concern of datacenter architects is not to let Big Data problem become Big Management problem. We all love technologies and gadgets but let us not put these before business needs.
Rags: Now I am going to commit the same mistake of focusing on technology – Hadoop …
Satish: Let me stop you there and address the elephant in the room (pun intended) as it appears most, like yourself, are fixated on it. Hadoop is not Big Data. It is a combination of a a) MapReduce, a compute paradigm, design pattern if you will and b) HDFS, a storage architecture. These are two separate components with a well defined interface for how to get data into and out of the storage infrastructure. Unfortunately the two seem to be intertwined inextricably and that’s where the confusion starts.
With parallel and distributed computing design patterns, you are taking advantage of multiple different compute nodes working on the same problem with different parts of the data. Why should that be commingled with how the data is stored?
Rags: … but isn’t the choice of shared nothing storage architecture facilitates, or as some say essential to, parallel computing?
Satish: Absolutely not. Those who say that are confusing the need to deliver data efficiently to these compute nodes with the technology choice. Shared or not shared is not the issue as long as you are not compromising on I/O throughput and latency. Solving I/O is a physics problem too and it is already being addressed by disruptions like SSD and faster interconnects like Infiniband. Achieve parallel computing by adding multiple different compute heads and achieve higher I/O throughput by a combination of hardware and software.
I am not saying shared nothing storage architecture does not have a role. These are two different storage architectures. These have different applications based on needs. For a social media business it is possible to manage everything with shared nothing architecture, build their own IT team and in-house storage expertise to manage their 1000s of hosts. For enterprises, well let us just say , enterprises do not run social media.
Enterprises have more stringent needs – they need to adhere to standards, they cannot afford to have multiple different architectures, they need to ensure the data is always available, protected and backed-up. Big Data does not make these needs go away. Shared storage, with the intelligence in software like our Cluster File System, continues to be the best bet for enterprise Big Data needs.
Rags: And your thoughts on cost argument?
Satish: Then solve the cost problem and not run away from it to create a completely new set of problems.
Rags: So we established enterprises should start with Big Data business needs first and choose the best storage solution and not the other way. Any final thoughts to datacenter architects?
Satish: Challenge the notion that Big Data means 1000s of nodes. What enterprise workloads would really need such compute capacity? And if it did, what type of management problems do these architectures pose? Can you imagine keeping them all up to date with right OS patches, software versions, replacing failed drives, etc? I will end with what I said before, don’t let the Big Data problem become a Big Management problem.
See related posts: