Deduplication, or dedup for short, is very common nowadays. BE has a dedup option since BE 2010. Microsoft now offers dedup'ed volumes with Server 2012 R2. There are a lot of dedup appliance from various vendors. How does dedup actually works? In this article, I will use some simple example to give you an idea how dedup works under the cover.
The purpose of dedup is to save space by not saving the same bit of data multiple times. There is no free lunch, so what is the trade-off? In order to check that the data is already saved, a lot more process power (CPU) and memory is required. A backup job involving dedup may take longer than a normal backup job if there is insufficient CPU or memory. Also, a restore will take longer because the data needs to be re-hydrated. I will explain this later.
Suppose Directory A has these files with these contents.
These files are backed up with a full backup. When these files are dedup'ed, the dedup process will chop them up into smaller chunks, compare these chunks with what has been stored and decide whether the chunks needs to be stored. Let's assume that the chunks are 5 characters long. So when FileA is dedup'ed, it is chopped up into
abcde 12345 67890
Since there is nothing in the dedup store, all 3 chunks would be stored.
When FileB is stored, it is chopped up into
ABCDE 12345 67890
ABCDE is stored, but the other two chunks are not stored because they are already in the dedup store when FileA is stored. The dedup engine will just put a pointer to say that these chunks are also needed by FileB.
In the dedup store after storing FileA and FileB, we have
abcde = FileA-FB1
ABCDE = FileB-FB1
12345 = FileA-FB1, FileB-FB1
67890 = FileA-FB1, FileB-FB1
where FB1 = Full Backup 1 and the "=" means is , so we save 10 byte of storage.
A user modifies FileA as follows
and create FileC with
If we do an incremental backup IB1, then these two files would be chopped up as follows
FileA : Aabcd e1234 56789 0
FileC : abcde 12345 67890 A
The dedup store will now look like this
abcde = FileA-FB1, FileC-IB1
ABCDE = FileB-FB1
12345 = FileA-FB1, FileB-FB1, FileC-IB1
67890 = FileA-FB1, FileB-FB1, FileC-IB1
A = FileC-IB1
Aabcd = FileA-IB1
e1234 = FileA-FB1
56789 = FileA-FB1
0 = FileA-FB1
As you can see from FileA, doing small modifications to a file does not mean that they will dedup well. The modified FIleA does not dedup at all.
A user now adds FileD with
and we do a full backup. FileD would be chopped up into
and the dedup store would look like
abcde = FileA-FB1, FileC-IB1, FileA-FB2, FileC-FB2
ABCDE = FileB-FB1, FileB-FB2
12345 = FileA-FB1, FileB-FB1, FileC-IB1,FileA-FB2, FileB-FB2
67890 = FileA-FB1, FileB-FB1, FileC-IB1,FileA-FB2, FileB-FB2
A = FileC-IB1, FileC-FB2
Aabcd = FileA-IB1, FileC-FB2
e1234 = FileA-FB1, FileC-FB2
56789 = FileA-FB1, FileC-FB2
0 = FileA-FB1, FileC-FB2
qwert = FileD-FB2
yuiop = FileD-FB2
At this stage, 124 bytes has been backed up, but only 47 bytes are actually stored in the dedup folder, so the dedup ratio is 124:47 = 2.6:1. This is the space savings offered by dedup. We did not count the overheads, like pointers, etc., so the dedup ratio would be a bit less.
Some points are worth noting.
1) Go back and take a look at the modifications to FileA and FileC. You can see that both files has only one extra byte added to it, but the result, in dedup terms, is very different. The modified FileA hardly dedup'ed, but the modified FileC dedup'ed just fine. This example is to illustrate that small modifications do not mean that the modified file will still dedup well.
2) Other than the two new files, the second full backup FB2 does not store any more data into the dedup folder. From the standpoint of the dedup folder, it is like an "incremental" backup of the first full backup, FB1. You might then say, why not just do full backups? When you do full backups, all the files needs to be sent to the media server (assuming server-side dedup) and each file would have to be processed by the dedup engine to determine which, if any, of the chunks of data needs to be stored. So there is still a place for incremental and differential backups.
3) FileA and FileB has some common chunks of data, so there is already some dedup with the first initial backup. More often than not, there is very poor dedup in the initial few backups because it is unlike that a similiar chunk of data already exists in the dedup folder. The dedup ratio normally increases after you have been storing data into the dedup folder and it becomes more likely to find a similiar chunk in the dedup folder.
4) When storing data into the dedup folder, it does not matter where the data come from. It could be from another server, an Exchange backup or from some other resources. As long as it matches a chunk of data in the dedup folder which can be initially stored from a SQL backup of another server, it will not be stored.
5) Some data, like compression and encryption, dedup badly because the compression and encryption processes randomise the data. It is less likely that randomised data would be able to find matching chunks in the dedup folder. If you want to compress or encrypt your data, do not turn on compression or encryption in your job. Instead turn on compression or encryption in the dedup engine. See these documents on how to do so.
For best practices for the BE dedup option, take a look at this
Client-side dedup vs. Server-side dedup
By default, BE uses server-side dedup. For server-side dedup, the file to be backed up is sent to the media server and then processed. It is chopped up into chunks, a checksum of these chunks are calculated and then compared to the the checksums of chunks already in the dedup folder. If there is a match, a pointer is created to the existing chunk and the chunk from the file is not stored. If the checksum does not match, then the chunk from the file is stored. As you can imagine, this is a lot of processing and memory used which is why in the best practices, there are recommendations for minimum memory.
To spread the load, you can use client-side dedup. In this case, the file to be backed up is chopped up at the remote server and the checksum calculated. This checksum is then sent to the media server. The media server will then compare this to the checksums of existing chunks in the dedup folder. If it matches, the chunk is not sent to the media server. If it does not match, then the chunk is sent to the media server for storage in the dedup folder. This reduces the load on the media server and the network.
You cannot do client-side dedup when you are backing up a VMware host because there is no remote agent in he VMware host.
OST or Open Storage Technology is a standard created by Symantec for dedup appliances. Think of these appliances as NAS/SAN storage with built-in dedup capabilities. If these are OST-compliant then they have an OST plug-in which you install on the media server so that BE knows that these appliance are capable of dedup. You would also need to have the dedup option on the media server.
There are a couple of advantages for using OST appliance
1) The dedup processing is now done by the appliance and the load on the media server is reduced.
2) You can use multiple OST appliances with a single media server. When you are using the BE dedup folder, you are restricted to only one dedup folder.
3) If you connect the OST appliance to the remote server, you can use client-side dedup to backup from the remote server straight to the appliance bypassing the media server.
In spite of these advantages, you might be tempted to save on the dedup licence and just use these appliances like ordinary disk since anything that you throw at the appliance will be dedup'ed. However, you would not get the the full advantage of these appliances. Going back to the example of the first full backup FB1. When you just backup to the appliance as if it is an ordinary disk, BE will create a .bkf file on the appliance. Internally, there would be other data which is put in by BE for its own purposes, e.g. the checksums for verification. Internally, the .bkf file with FileA and FileB may look like this.
where the special characters denote the other data put in to the .bkf file by BE. When this .bkf file is chopped up, it will look like this
%#abc de123 45678 90$%^ &ABCD E1234 56789 0(*&^
so you can see there is no dedup at all.
If you were to use the appliance as an OST appliance, then there would be dedup as shown earlier because each file is dedup'ed separately.
If you use CASO which for BE 2012 and BE 2014 is now part of ESO, you can share dedup folders/OST appliances between media servers. If two media servers with dedup folders/OST appliances duplicate backup sets from its dedup folder/OST appliance to the dedup/OST appliance owned by the other media server, only the additional data chunks are sent from one media server to the other. This minimises the bandwidth requirement for the link between the two media servers and is the recommended way of transfering backup sets from one location to another. You can further throttle the bandwidth used by optimised duplication if you want. See this document on how to do so.
Before you do your first optimised duplication, you might want to seed the other dedup so that off the bat, there are less data to transfer over the WAN link. See this document
In Part 2, I will discuss restores from dedup'ed data and what happens when dedup'ed backup sets are deleted.