Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.
Netting Out NetBackup

Global Deduplication Myths

Created: 25 Feb 2013 • 7 comments
Alex Sakaguchi's picture
+4 4 Votes
Login to vote

 

Don’t customers hate being misled?

I know I do.  Sometimes it can be innocent…you know, like maybe the sales person wasn’t as knowledgeable as he/she could’ve been.  Or perhaps they were new.  In any case, it behooves the customer to do some homework to make sure that they are not being misled, innocently or otherwise.

Your homework is done.

I came across a situation recently where a customer said that a vendor told them their solution could do global deduplication the same as Symantec, but cheaper.  My first thought was wow that’s a big deal.  As you may know, Symantec deduplication capabilities built into NetBackup and Backup Exec offer customers the flexibility of leveraging dedupe at the client, server, or target, and can efficiently dedupe across workloads like physical and virtual infrastructures seamlessly (See the V-Ray video here for more info).  On top of that, if your dedupe storage capacity on a single device is maxed out, Symantec can add another to increase capacity and compute resources, but to the customer, it would still appear as a single dedupe storage pool – global deduplication.

Anyhow, the customer asked if this was true.  Quite honestly, I too needed to do some homework to answer that question…what I found out was pretty disturbing.

First off, the vendor in question was not using the term “global deduplication” correctly, and what they were actually referring to was plain old deduplication, not even bordering on global yet, which I’ll get to in a minute.

According to the vendor’s documentation a customer would need to manually set a dedupe block size for all data sources in order to employ “global deduplication”.  Furthermore, the default and recommended size was 128KB.  For the record, global deduplication refers to the ability to deduplicate across dedupe storage devices so that no two devices contain the same blocks of data.  Here’s a generic definition from TechTarget:

“Global data deduplication is a method of preventing redundant data when backing up data to multiple deduplication devices.”

What the vendor is saying is that you can have multiple data sources (like VMware data, files system data, databases, etc.) feeding into a single dedupe storage pool, where the dedupe block size is set to 128KB, and those multiple data sources will dedupe against one another.  But that’s NOT global deduplication, that’s regular deduplication. 

Global deduplication in this example could be illustrated when the storage capacity that our 128KB chunk sized pool is reached and we need to stand up another.  Can the customer see both those storage devices as a single pool without any data redundancies across them or not?  If the answer is not, then the vendor cannot provide global dedupe capabilities.  And unfortunately, such was the case with our vendor in question.

The interesting thing was that even though this inquiry started as a result of a question on comparative global dedupe capabilities, I uncovered some other points of information that may cause you to think twice when purchasing from this vendor.

I’ve organized these into the chart below for ease of understanding:

Data Source/Workload

Recommended Block Size

File systems

128KB

Databases (smaller than 1TB)

128KB

VMware data

32KB

Databases (1-5TB in size)

256KB

Databases (larger than 5TB)

512KB

 

As you can see above, the vendor is recommending those specific dedupe block sizes to maintain an optimal dedupe efficiency level for each data source.  What this means is that:

  1. IF you want dedupe efficiency within data sources you have to manually configure and manage multiple dedupe storage pools (that’s a lot of management overhead by the way), and
  2. You’ll likely have duplicate data stored because your VMware data at 32KB is not going to dedupe with your files system data at 128KB, and lastly
  3. If you go ahead and use the same block size (128KB that the vendor recommends for their “global dedupe”), your dedupe efficiency is lost because 128KB is only optimal for file systems and databases smaller than 1TB, not for anything else.

This problem is defined as “content-aligned” deduplication.  Given that this particular vendor is unable to instead be “content-aware” and efficiently deduplicate source data without manual configuration of block sizes, there is certainly no hope for the vendor to claim global deduplication capabilities…unless the attempt is made to redefine the term.

A better way

With Symantec, the customer would not have to worry about this scenario at all.  It doesn’t matter if the data source is coming from a physical machine or virtual.  It doesn’t matter if the database is large or small, or if it’s just file system data.  Symantec is able to look deep into the backup stream and identify the blocks for which a copy is already stored, and store only the ones that are unique.  No block size limitations or inefficiencies between policies.   This means that you get the best in dedupe storage efficiency with the lowest management overhead.

Symantec calls its approach end-to-end, intelligent deduplication because we can deliver data reduction capabilities at the source, media server, or even on target storage (via our OpenStorage API).  We gain intelligence from content-awareness of the data stream for backup efficiency.  And of course, we deliver global deduplication capabilities.

More resources:

Symantec Deduplication

NetBackup Platform

Backup Exec Family

NetBackup Appliances

Comments 7 CommentsJump to latest comment

jadlip's picture

Hi Alex,

Could you please help me.. I have one new appliance.. I don't want configure this as a seprate disk pool, I want to extend my existing 5220 pool with new 5220.

Is it possible if yes how .

Thx

Thank you,

Pradeep Jadli

0
Login to vote
jadlip's picture

Hi Alex,

Could you please help me.. I have one new appliance.. I don't want configure this as a seprate disk pool, I want to extend my existing 5220 pool with new 5220.

Is it possible if yes how .

Thx

Thank you,

Pradeep Jadli

0
Login to vote
StefanosM's picture

EDIT: Ok I have to read more carefully.

Alex,

for wmware backups the Block Size must be configured at the backup proxy host. Is there any global configuration file that I can configure at the storage sever that override the clients configuration.

I will install a test MSDP server to test vms with 32K, but I do not want to change the proxy host, at this time.

Thanks

0
Login to vote
Andrew Madsen's picture

Jadlip,

The 5220 devices cannot be combined into a single pool. If you want a larger single pool (in the 5220 it is called an MSDP or Media Server Deduplication Pool) then you need to add a 24TB or 36TB shelf to the existing 5220.

What Alex is refering to is the 5020 or PDDO pool. It can expand (today) to 192TB.  

The above comments are not to be construed as an official stance of the company I work for; hell half the time they are not even an official stance for me.

0
Login to vote
jadlip's picture

Thanks Alex :)

Thank you,

Pradeep Jadli

0
Login to vote
Alex Sakaguchi's picture

Wow, thanks for all the comments.  And thanks to those that have answered questions too.  Sorry I'm checking back so long after those questions were posted.  Please feel free to message me directly if you still have questions.

Thanks,

Alex Sakaguchi

Product Marketing Manager, +NetBackup

0
Login to vote
Daniel Banche's picture

Hi Alex.

If I were to pool together multiple 64TB MSDP behind one node, the deduplicaiton would be globally calculated across all the MSDPs, wouldn't it? There wouldn't be duplicate copies of data residing on multiple MSDPs.

I guess the question is, how many 64TB MSDPs can be aggregated together with true global deduplication?

A very interesting blog.

0
Login to vote