Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.
Backup and Recovery Community Blog

Global Deduplication Myths

Created: 21 Feb 2013 • Updated: 22 Feb 2013 • 2 comments
Alex Sakaguchi's picture
0 0 Votes
Login to vote

Don’t customers hate being misled?

I know I do.  Sometimes it can be innocent…you know, like maybe the sales person wasn’t as knowledgeable as he/she could’ve been.  Or perhaps they were new.  In any case, it behooves the customer to do some homework to make sure that they are not being misled, innocently or otherwise.

Your homework is done.

I came across a situation recently where a customer said that a vendor told them their solution could do global deduplication the same as Symantec, but cheaper.  My first thought was wow that’s a big deal.  As you may know, Symantec deduplication capabilities built into NetBackup and Backup Exec offer customers the flexibility of leveraging dedupe at the client, server, or target, and can efficiently dedupe across workloads like physical and virtual infrastructures seamlessly (See the V-Ray video here for more info).  On top of that, if your dedupe storage capacity on a single device is maxed out, Symantec can add another to increase capacity and compute resources, but to the customer, it would still appear as a single dedupe storage pool – global deduplication.

Anyhow, the customer asked if this was true.  Quite honestly, I too needed to do some homework to answer that question…what I found out was pretty disturbing.

First off, the vendor in question was not using the term “global deduplication” correctly, and what they were actually referring to was plain old deduplication, not even bordering on global yet, which I’ll get to in a minute.

According to the vendor’s documentation a customer would need to manually set a dedupe block size for all data sources in order to employ “global deduplication”.  Furthermore, the default and recommended size was 128KB.  For the record, global deduplication refers to the ability to deduplicate across dedupe storage devices so that no two devices contain the same blocks of data.  Here’s a generic definition from TechTarget:

“Global data deduplication is a method of preventing redundant data when backing up data to multiple deduplication devices.”

What the vendor is saying is that you can have multiple data sources (like VMware data, files system data, databases, etc.) feeding into a single dedupe storage pool, where the dedupe block size is set to 128KB, and those multiple data sources will dedupe against one another.  But that’s NOT global deduplication, that’s regular deduplication. 

Global deduplication in this example could be illustrated when the storage capacity that our 128KB chunk sized pool is reached and we need to stand up another.  Can the customer see both those storage devices as a single pool without any data redundancies across them or not?  If the answer is not, then the vendor cannot provide global dedupe capabilities.  And unfortunately, such was the case with our vendor in question.

The interesting thing was that even though this inquiry started as a result of a question on comparative global dedupe capabilities, I uncovered some other points of information that may cause you to think twice when purchasing from this vendor.

I’ve organized these into the chart below for ease of understanding:

Data Source/Workload

Recommended Block Size

File systems

128KB

Databases (smaller than 1TB)

128KB

VMware data

32KB

Databases (1-5TB in size)

256KB

Databases (larger than 5TB)

512KB

 

As you can see above, the vendor is recommending those specific dedupe block sizes to maintain an optimal dedupe efficiency level for each data source.  What this means is that:

  1. IF you want dedupe efficiency within data sources you have to manually configure and manage multiple dedupe storage pools (that’s a lot of management overhead by the way), and
  2. You’ll likely have duplicate data stored because your VMware data at 32KB is not going to dedupe with your files system data at 128KB, and lastly
  3. If you go ahead and use the same block size (128KB that the vendor recommends for their “global dedupe”), your dedupe efficiency is lost because 128KB is only optimal for file systems and databases smaller than 1TB, not for anything else.

This problem is defined as “content-aligned” deduplication.  Given that this particular vendor is unable to instead be “content-aware” and efficiently deduplicate source data without manual configuration of block sizes, there is certainly no hope for the vendor to claim global deduplication capabilities…unless the attempt is made to redefine the term.

A better way

With Symantec, the customer would not have to worry about this scenario at all.  It doesn’t matter if the data source is coming from a physical machine or virtual.  It doesn’t matter if the database is large or small, or if it’s just file system data.  Symantec is able to look deep into the backup stream and identify the blocks for which a copy is already stored, and store only the ones that are unique.  No block size limitations or inefficiencies between policies.   This means that you get the best in dedupe storage efficiency with the lowest management overhead.

Symantec calls its approach end-to-end, intelligent deduplication because we can deliver data reduction capabilities at the source, media server, or even on target storage (via our OpenStorage API).  We gain intelligence from content-awareness of the data stream for backup efficiency.  And of course, we deliver global deduplication capabilities.

More resources:

Symantec Deduplication

NetBackup Platform

Backup Exec Family

NetBackup Appliances

Comments 2 CommentsJump to latest comment

asg2ki's picture

Hi Alex,

This is a very nice overview of the "global deduplication" situation across the tons of misleading information, however when it comes to the "content-awareness" question it would be nice if you can extend your thoughts a bit more with specific examples.

I've been into several situations specifically around NDMP based traffic coming from various vendor implementations, where the results were not always as good as one would expect, but that's typically the result of lacking "content-awareness" with the NDMP stream (i.e. ZFS in combination with any backup software vs. NetApp or EMC in combination with the same backup implementation). I know, this last sentence is pushing the discussion toward a more technical matter, but the sad fact is that none of the vendors would enlight the limitations wether or not we would speak "local" or "global deduplication". As a matter of fact your second diagram is also not covering fully the "global deduplication" term as such or at least as per my understanding "global" should be treated as "global across multiple dedup pools" rather than "global within single dedup appliance".

With the latter it does matter though when it comes to the block size especially if you are dealing with extremely large file systems containing millions of small files or extremely large databases. I've seen horrible deduplication results with the various vendor implementations (including NetBackup), when it comes to protecting "non-regular" datasets (my example with ZFS applies here) and so far the only solution that worked very well for me was NetBackup in combination with DataDomain appliances via OST plugin. While Symantec's deduplication is a clear winner if you are using it in combination with the various optimization plugins (especially when dedup operationas are handled directly from the source), still the DataDomain engine is far more superior and handles deduplication much more efficiently due to it's internal mechanism of chunking the source blocks. Yet neither Symantec nor DataDomain are dealing with true "global deduplication" which ideally should be handled with a single focal point in terms of distributed data hashing, but that's a completely different aspect.

Regards

+2
Login to vote
xp123321's picture

"global across multiple dedup pools" rather than "global within single dedup appliance", that's what i confused. i agreed with asg2ki comment.

0
Login to vote