In my current role at Symantec, I spend a lot of time talking to customers about their data protection strategies. It is interesting to note how much misinformation some of our competitors continue to give customers about Symantec’s deduplication technology. They continue to scour older product manuals to find information that is inaccurate and continue to use it against Symantec to create FUD in the minds of customers. It has gotten so bad that I am going to recommend one of our competitors to change their tag line from “Where information lives” to “Where MIS-information lives”. OK, jokes apart, I thought it would perhaps be worthwhile to blog about exactly how Symantec approaches deduplication so we can put an end to all this misinformation.
Deduplication has clearly come a long way in terms of sophistication and most leading deduplication vendors now offer a fairly sophisticated approach to deduplication which consists of a sub file level check for duplicate data. The deduplication algorithms can be broadly classified into Fixed Block or Variable Block. One could argue that single instance storage is also a form of deduplication, albeit not as sophisticated as it tends to deduplicate data at the file level and not a more granular sub file level.
While the variable block deduplication may yield slightly better deduplication rates than the fixed block deduplication approach, it does require you to pay a price. The price being the CPU cycles that must be spent in trying to determine the file boundaries. The variable block approach requires more processing than fixed block because the whole file must be scanned, one byte at a time, to identify block boundaries. If file data is randomized, further CPU cycles are needed to scan for block boundaries.
This performance problem gets magnified because the variable block deduplication method does a poor job of controlling block sizes. While average block sizes can be defined, the distribution of block sizes is very broad, and can range from one byte to the length of the entire file. Very small blocks are impractical from a deduplication standpoint because the cost of maintaining references to the very small blocks outweighs the benefits of omitting the duplicate data. Very large blocks are also impractical because their large size limits their usefulness to deduplicate similar files—approaching the inefficiency of the Single Instance Storage approach. As a result, special accommodations are needed to deliver useful results. This increases complexity and also slows performance. The poor block size control also means no application awareness.
Of course, I don’t expect the vendors who use variable block deduplication to tell you all this because they are busy selling you the promises of 50X or higher storage.
So exactly how is Symantec approaching deduplication? Let me make it clear, Symantec did PREVIOUSLY use fixed block approach for deduplication. This was 5 years back when Symantec first launched the technology in the market via the PureDisk product. However, over time, as the deduplication technology was integrated into NetBackup, it allowed us to take a more intelligent approach to deduplication than just blindly chopping the data to look for duplicate blocks.
The “intelligence” in Symantec’s approach comes from a fundamental understanding of the data being backed up. When NetBackup starts the backup on a certain data set, the metadata for the incoming backup stream is examined to understand what kind of data is being backed up. Based on the type of data being backed up, a certain block size is assigned to that particular backup—the block size is optimized based on the type of data being backed up. This approach works really well in terms of compressing the data because different file types are compressed to different degrees with different block sizes. And since the block size is already optimized to best compress that particular data type, CPU cycles don’t need to be wasted to determine the block boundaries. This approach provides a good balance between performance and resource utilization.
In summary, Symantec does not use the traditional fixed or variable block methods of deduplicating data. The deduplication process is called “intelligent deduplication”. Intelligent because it is content aware and the block sizes can be varied based on the type of data being backed up. This helps in efficiently deduplicating data without taking a huge toll on resources.