Video Screencast Help
Search Video Help Close Back
to help
New in the Rewards Catalog: Vouchers for "Symantec Technical Specialist" and "Symantec Certified Specialist" exams.

Much ado about source-side deduplication

Updated: 21 May 2009
Peter_E's picture
0 0 Votes
Login to vote

George Crump recently wrote a post on Byte & Switch on source-side deduplication.  I thought an expansion on the topic would be useful given that George did not cover all the facts in his post.  Yes, I’m writing from a Symantec perspective, but my intention is to present a balanced and factual argument.

Do you have to switch backup applications for integrated deduplication?  NetBackup integrated its PureDisk deduplication engine into the NetBackup server to give customer the choice about how and where they perform deduplication.  As a result, there’s no need to replace the client on your existing machine if you’re already using NetBackup. 

Does switching to source-side deduplication mean you should replace your existing backup application as he suggests?  That would depend on the vendor you choose and where you want to use deduplication.  We have always advocated source-side deduplication for remote offices and virtual environments.  Symantec offers NetBackup PureDisk, part of the NetBackup Platform, as a way to do accomplish source-side deduplication.  This can be deployed alongside any existing backup application with minimal effort.  A web-based GUI and template-driven policies make managing your client-side dedupe process fairly easy and integration with NetBackup makes things like export to tape fairly easy. We have non-NetBackup customers using PureDisk alongside other backup applications and existing NetBackup customers using PureDisk clients for remote offices and virtual machines. 

The need for source-side in the data center is another question.  I’ve found that our customers want to use different ways of protecting data from client-side dedupe to snapshots to CDP based on the type of data and their RTO/RPO.  We integrated our PureDisk deduplication engine into the NetBackup server to increase flexibility for customer, not reduce it.  That’s gives them deduplication from within the backup server.  The integration to one platform approach is core to our strategy.  Take another technology - continuous data protection – we let customer manage this method of data protection through NetBackup using NetBackup RealTime.  I’ll get off my soapbox, but hopefully you begin to see why client-side deduplication should not replace every backup approach in the data center.

Does client-side deduplication present a heavy CPU load on the host?  Should it be a surprise to learn that the load on the host might depend on factors such as the type of CPU and memory available, the speed of the hosts' disks, and the type/volume of data?  I’m sure George meant to address this, but could not due to the brevity of his article.  In fact, the load of a dedupe client is typically less than a traditional client because the overall backup time is shorter.

Client-side deduplication creates roughly the same amount of CPU load as a traditional backup, perhaps a little more, but over a much shorter time period.  On average, we say about 20-25% of CPU cycles in an idle state with some higher peaks.  The first backup will be higher because there’s more new data.  So yes, if the change rate of your data is high – above 15% - then you might see a higher CPU load.  While the size of the file system has some affect, the type of data and average size of the files is equally important.  If you have variable chunk deduplication at 5kb per chunk then you’re going to have a lot more CPU activity.  By contrast, if you set a segment size higher, say 128kb, then you’ll have less CPU cycle.  So the recommendation – or questions to ask your client-side dedupe vendor:

Can you control the the dedupe segment size of (what’s the max and min)?

Do you support multi-streaming backups from a single-client?

 

For NetBackup PureDisk, the answer is YES.   Here’s a few numbers:

Single Client / Single Stream: 35 MB/sec for backup

Single Client / Multi-Stream: 40-50 MB/sec for backup

 

Can we go faster?  We find that it’s host DISK speed – not CPU – that causes slow-downs. So let’s look at restore speed on a host with FAST disk.

Single Client / Multi-Stream with FAST DISK: 76MB/sec for restores

 

So much for that client-side dedupe process being slow.  So I find that George’s claim about restore speeds typically at 3-5 mb/s to be not representative of all solutions.  Finally, one other point that George forgot to mention about how some source-side dedupe agents perform restores – at least NetBackup PureDisk. 

 

If you perform a restore to the same location the NetBackup Pure source-client will check to see whether any of the files you want to replace already exists.  So, if only 50% of the files in a directory that you wish to recover have been changed, PureDisk will only send 50% of the data across the wire for restore.  Finally, the CPU impact on restore is fairly minimal because a client is merely un-encrypting and un-compressing the data.

 

Where do I agree with George Crumb?  Client side deduplication is great for virtual environments, especially guest-backups.  And much to the chagrin of disk vendors, a lot of customer still want a robust tape-export capability from there deduplication system.  The NetBackup Platform delivers this without re-inventing the wheel.