How deduplication can contribute to the restore process….
A couple of days back, whilegiving a PureDisk lecture, I got into an interesting discussion about howdeduplication technologies pop up at different levels: deduplication at the source, at the target,at a gateway level, or built into the application. You name it; they are allavailable and focused on incredible numbers pushed by dedup enthusiasts.
In the case of the backupsolutions, the focus seems primarily on dedup savings during the backupprocess, where higher deduplication rates can deliver higher backupthroughputs, reduce backup time, and support much longer retention on disk. Allare correct and very efficient, don't get me wrong, as I'm one of thoseenthusiasts, but what about the restore process? That is where thediscussion touched an interesting subject...
Can deduplication be used toimprove the restore speed? The answer is Yes!
Take, for example, the remoteoffice scenario where local tape solutions are replaced with a disk-baseddeduplication solution. By efficiently moving data over a WAN connectionto a larger site or datacenter, where it is then stored on disk, adeduplication solution such as PureDisk takes away the burden of transport andmanagement of tapes between the remote sites and the central datacenter. Inthis scenario, you can use an intelligent agent that identifies the duplicatefiles and blocks at the source level, before transferring any data over thenetwork. NetBackup PureDisk delivers exactly such a client. The NetBackup PureDisk client globally deduplicatesall data at the source level, where high deduplication rates of 99% or more aremeasured for unstructured data.
Now on the restore side, however,you have to analyze your local restore requirements. Is it possible to achievethe restore SLA requirements by restoring data directly over the WAN connection,or is a smaller, local backup server still required to achieve faster restorespeeds? With NetBackup PureDisk, you can install a deduplication agent on theclient. You can send the client’sbackups either to a data center or to a local deduplication server, from which itis then replicated to a deduplication server in the data center.
In remote offices, where therestore happens over the WAN, the restore speed directly correlates with bandwidthavailability. To estimate the restore speed that you can attain with adeduplication agent, you need to know that file reconstruction occurs on theagent or client itself and not at the backup server. This means that therestore set can be transferred in an optimized format between the backup serverand client, and as such, faster restores can be achieved compared to atraditional approach.
The interesting part is thatdeduplication can be used here, to some extent, on the restore side. This isachieved by applying the hashing algorithms that the deduplication process usesto verify the uniqueness of the data during the backup, now also on the restoreside. Before it transfers a file, the NetBackup PureDisk agent checks to see ifthe file it needs to restore is still available on the client. If so, then aquick check can verify if that content is still exactly the same as the contentit is about to restore. Instead of the restoring the whole file, PureDisk onlyrestores the attributes or metadata (e.g.: ACL settings …) needed to reset thefile in its previous state. It does notrestore the file content.
This optimization is very usefulwhen you restore a directory, drive or even full client and want to reset thefiles to a previous state in time. Take for example a standard Windows 2003server (basic install). Tests have shown that for a typical, one-day versionrestore, only 60MB of data was transferred with the NetBackup PureDisk client,while the original restore set was 3GB. Inthis test, only 2% of the full data set was actually transferred. For all otherdata files, the metadata was just reset, and as such, a much faster restorespeed was achieved, compared to atraditional restore. The deduplicated restore uses less bandwidth, less I/O onthe client, and lower memory utilizationon the client. This could be compared to a sort of highly efficientglobal snapshot technology for disaster recovery, where only changed files arerestored.
Is an extra default optimization,NetBackup PureDisk also filters unique files smaller than 5MB within the samerestore job. This means that when you restore multiple copies of a single filethisresults in a single-time file transfer to the client instead of multiplecopy transfers. For the client, PureDisk’s small-file multiple-copy restoremethod optimizes the network load, I/O speed, CPU utilization, memoryutilization, and overall recovery time.
NetBackup PureDisk also supportscompression, which can reduce the transferred data by average, by another 50%,of course directly dependent on the compressibility of the data. If thebottleneck for recovery is restricted bandwidth, which is commonly the case forremote office restores, then these savings can provide dramatic improvements torecovery times.
Next to these three improvementsPureDisk also supports automatic multistreaming, again another great featurethat increases the restore speed. When restoring or backing up data over WANconnections, high latency can influence the performance drastically. Toovercome the time lost by the extra delays, the multistreaming feature providesan automatic load balancing of the restore set over multiple streams(configurable). By supporting multiple streams, one can easily saturate slowerWAN connections (e.g. T1). For scenarios in which network bandwidth is not thelimiting factor, but the client is, one can saturate, in most cases, the writespeed on the client by increasing the amount of restore streams.
Other uses cases such as virtualenvironments, in which not only bandwidth but also I/O, memory and CPU resourceare shared or scarce, also benefit hugely from these improvements. No wonderwhy PureDisk is such a big success in virtual environments! PureDisk technologies such as deduplication,compression and multistreaming can easily contribute to a faster restore andbetter utilization of the available resources.
The moral of the story (or in thiscase, my discussion) is that for all these particular use cases, we shouldclearly focus a little more on the restore aspect instead of just promoting thegreat backup speeds and dedup numbers!