Two Ways to Deduplicate PST Data for Enterprise Vault
When you're doing a PST migration one of the things that you must take in to consideration is the duplication of data. A large percentage of users are likely to have copies of PST files with just small deltas between them. For example let's say we do the following:
- Create a PST called Archive2013.pst in January 2013.
- During January we add data to the PST.
- During February we add data to the PST.
- At the end of February, with Outlook closed, we take a copy of the PST and put in a folder called 'PST Backup'.
- During March we add data to the PST.
- During April we add data to the PST, but we also delete some data, that we've decided was for a long since finished project
- During May we add data to the PST.
- During June we add data to the PST, and we delete a bit more of the data, for a different project that we don't need any longer.
- At the end of June, with Outlook closed, we take another copy of the PST and put in a folder called 'PST Backup'. But we don't overwrite the first copy, we keep that, and we put a new copy in there instead. So now we have:
- Archive2013.pst < working file
- Archive2013-1.pst < copy created at the end of February
- Archive2013-2.pst < copy created at the end of June
- In July, August and starting September we continue to add data to the Archive2013 PST file
Now let's say in mid-September we start PST migration for this user.
Almost all of the PST migration products, and tools, and Enterprise Vault itself will find *three* PST files. The current active one, and the two backup copies. This is great - consistency across migration approaches !
The problem here is that across the three PST files the majority of the data is the same.
Solving the Problem - Natively
Enterprise Vault has for a long time had a fantastic single instance storage model. This means that every item that is added to Enterprise Vault has a hash calculation performed on it, and if that hash already exists on an item in Enterprise Vault then the new item is not added, it is instead shared across to the already existing item. This reduces storage needs. Think of all those situations where a meeting request is sent to 30 people with a document attached to it. That document doesn't need to be stored full-fidelity 30 times - it will be stored just once.
In Enterprise Vault 8 this was further enhanced and the model was called Optimized Single Instance Storage. Amongst the changes introduced was the ability to single instance across Vault Stores in a Vault Store Group, and, rather than having the shared information stored in files on disk, instead it's now down by database references.
So one way that you can resolve the issue of duplicate data in PST files is to simply let Enterprise Vault do it's thing, and let it do the de-duplication for you. That's good, in many ways, but also it does place an additional load on the Enterprise Vault servers which might be busy with other activities like archiving, retrieval and building Vault Cache data for users. Depending on the migration settings it can also lead to duplicate shortcuts in the end-users mailbox after the migration, which of course might cause confusion to the users operating on the mailbox. This duplicate data does go a bit further too, in that it will be downloaded as Vault Cache data if that is configured for end-users.
Solving the Problem - Third Party
Another way to resolve this type of issue is to use a third party migration product. PST FlightDeck from QUADROtech has a number of modules that process data before the data is sent on to Enterprise Vault for long term storage. One of those modules is a de-duplication module, which will process PST files for a user, item by item, finding and filtering out those duplicate items we've been talking about. The result is that a purer stream of PST files is sent to Enterprise Vault for long term storage. In addition the per item checking/filtering also means that corrupt items and corrupt PST files can also be more readily identified, and worked on, again before sending them onwards to Enterprise Vault.
De-duplicating the data before it is sent to Enterprise Vault means that there won't be any duplicate shortcuts when the data is archived by Enterprise Vault (if shortcuts are created, by the migration policy) and Vault Cache data won't need to be built for these 'extra' items. It also means that the data is more readily accessible, searchable, and usable by end-users going forwards.
Whether or not a third party solution like PST FlightDeck is for you, and is suitable for your migration (and don't get me wrong, we think it is !) one of the things that you have to consider is the duplicate data-PST issue. There will almost be no migration ever that can be performed which doesn't have some duplication of data. Do you want to fire all this data into Enterprise Vault and let it handle the de-duplication, and single instancing? Or do you want to pre-process it and have 'clean' data only going in to the long term archive of Enterprise Vault?
Let me know in the comments how you've tackled this sort of problem in the past?