Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

How collections and sparse collections work

Created: 30 Jan 2010 • Updated: 03 Feb 2010 | 12 comments
JesusWept3's picture
+4 4 Votes
Login to vote

One feature of Enterprise Vault is the use of Collections, where Enterprise Vault will collect multiple items into a Collection which is a Microsoft Cabinet (CAB) file. The main use for doing this is to help with backup times.

For instance if you have 1 x 10MB cab file, this will backup quicker than say 100 x 10KB files, one thing to note however, is that the CAB files are not compressed, meaning that if you extract a 10MB CAB file, it will result in 10MB of DVS Files. The reason for this is that DVS files are already highly compressed, and when you attempt to compress something that’s already compressed, it results in a bigger file.

Enterprise Vault Collections are configured on the Collections tab, where you can figure when the collections run, how big the CAB files can be, and how old the items have to be before they can be collected in to a CAB file.

Note that when you enable Collections, you cannot disable them. The best you can do is either make the age of the files to collect so old that nothing would get archived, or you can limit the amount of time that the collections process can run (i.e setting the start AND end time to be at 11:00AM).

A word of caution on the second method though, when an item is retrieved from a CAB file, it is put in its original location and named as an ARCHDVS (or ARCHDVSSP, ARCHDVSCC etc on EV8(, those files are not automatically deleted after the user has finished reading the email.

Instead, it is the Collections process itself that goes behind and deletes the ARCHDVSxx files after a certain period of time, if the collections period is set too short or has 0 seconds to run, then then archdvs files cannot be cleaned up and you will end up duplicating space unnecessarily.

Where are CAB Files stored?
Collections themselves are stored in different places dependent on your version of Enterprise Vault.

Enterprise Vault 2007 and below:
The following folder structure is used to store DVS files, and the Collections are placed in the “Day” folder.

Files are stored in a yyyy\mm\dd\hh format. For example
E:\Enterprise Vault Stores\Journal Vault\ptn1\2010\01\30\17\<saveset>.dvs

The above would symbolize an item archived at 5pm on 30th January 2010
The CAB files are stored in the \dd\ section..so it may look like
E:\Enterprise Vault Stores\Journal Vault\ptn1\2010\01\30\Collection12345.cab

In Enterprise Vault 8 however, the locations are stored in a little different format.
it stores it in \yyyy\mm-dd\LETTER

Example:
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\A\074\<saveset>.dvs

The above would suggest an item is archived on 11th January 2010.
However rather than storing in an additional hour folder as it used to in EV2007, it now uses parts of the file name of the DVS.

In this example we have a file name called A07465CEEC2320A040210B08E3549781.DVS, the name is based on the Transaction ID assigned to the item, it takes the first letter of the transaction ID (A) and then creates folders that use the next three numbers or letters of the transaction ID.

Another example, if an item called 107DC3824ADB33CDABCE5C15B7B46BD1.DVS and it was archived on January 11th 2010, it would be located in the following location:
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\1\07D

On Enterprise Vault collection files are stored in the first letter of the transaction id’s location.
For instance the collection file may be stored here
E:\Enterprise Vault Stores\Journal Vault\2010\01-11\1\Collection12345.cab

What happens when I delete an item or run storage expiry?
When items are added to a CAB file, they will remain there until a process called Sparse Collections is run, which involves extracting valid savesets and then deleting the cab, those savesets are then re-collected at a later date.

When an item is deleted, Enterprise Vault simply cannot delete an item with in a CAB file (this actually applies to any compressed file such as ZIP or RAR) therefor you get in to a situation where items are deleted from the Databases and indexes, but still remain in the CAB files.

So what occurs is that Enterprise Vault does a look up of all the items in a CAB and determines which ones are still valid, if there is only a certain percentage of items that truly exist in the CAB file, then EV extracts all the items, and the cab file is deleted.

So how does Enterprise Vault know which cab files to check?
Well when a collection file is created, there are two SQL Columns populated in the Collections table.
One is called RefCount and one is called TotalCount.

When a Collection is first created, it takes a count of how many items are stored, and sets the refcount and totalcount to the same number, so if 100 items are stored , both refcount and totalcount will be set to 100.

Then, when an item is deleted or expired from that collection, it will reduce the number of the refcount, but the totalcount will remain the same.

So if 50 items are deleted that belong in that CAB file, then refcount will be set to 50, and the totalcount will remain at 100. When the Refcount hits 0, this means that none of the DVS files within that CAB file exist in the database or the indexes, thus the CAB file and all its contents can be deleted.

But what happens if you have a refcount of 1 and a totalcount of 100? This 1 item that still exists in EV is stopping the other 99 items from being removed from disk and freeing up storage. So what happens is the Sparse collections process.

The last items are extracted to their original location, the refcount is set to 0 and then EV deletes the CAB file. By default, Enterprise Vault will initiate the sparse collections when the refcount is 15% of the the totalcount.

So if you have 100 items stored in a cab, as soon as the refcount hits 15 items or lower, it will extract and then delete the cab file. So if you every run a storage expiry, make sure you run your collections process after so that you can reclaim disk space immediately.

Comments 12 CommentsJump to latest comment

Bruce Crankshaw's picture

Interesting read , explains the process nicely thx :)

0
Login to vote
chhabrak's picture

Very informative..Thanks

0
Login to vote
Jayasimha's picture

You mentinoned the directory sturcture where DVS files are placed symbolyses when item was archived. But in my environment I could see directory structure starting with 1899 year. Then why these are created when EV never existed at that time?

Thanks
Jayasimha

0
Login to vote
TonySterling's picture

Prior to EV 8 the directory structure date was based on the actual item.  Rouge years like the 1899 you see come from malformed or spam mail items that have erroneous date/time on them.

+1
Login to vote
Jayasimha's picture

Hi Tony,

In EV8, is the directory structure based on when the message is archived? Or is it depens on the "original time message is created" attribute of the mail?

Thanks
Jayasimha

0
Login to vote
TonySterling's picture

It is based on when the message is archived.

0
Login to vote
Stumpy's picture

Our cab files, since upgrading to version 8, appear to be much smaller than before, probably 1/10th of the size - even though the cab files size has been set to 20Mb. This is having a large impact on our backup performance.

 

Is there any way of forcing a re-cab process where the smaller cabs can be extracted and re-built into larger cab files?

0
Login to vote
John Santana's picture

Cool, many thanks fro the great article Jesus !

Kind regards,

John Santana
IT Professional

--------------------------------------------------

Please be nice to me as I'm newbie in this forum.

0
Login to vote
AbdulKadir's picture

Amazing article Jesus. When does the temporary files (ARCHDVSSP) get deleted? Is there any process to initiate it? Because I have loads of ARCHDVSSP temporary files eating up the hard drive space.

Thank you in advance.

0
Login to vote
DeadEyedJacks's picture

@AbdulHadir,

The same collection process removes aged ARCH files 24 hours after they were last accessed. 

Authorised Symantec Consultant on Archiving and eDiscovery ASC, STS, SCS, SSE+

Microsoft, NetApp and VMware certified professional MCTS, MCSE, MCSA, NCDA, NCIE-BR, VCP, VTSP

0
Login to vote
GertjanA's picture

Hello all, and JW3 especially,

You write in this excellent article:

Quote: But what happens if you have a refcount of 1 and a totalcount of 100? This 1 item that still exists in EV is stopping the other 99 items from being removed from disk and freeing up storage. So what happens is the Sparse collections process.

The last items are extracted to their original location, the refcount is set to 0 and then EV deletes the CAB file. By default, Enterprise Vault will initiate the sparse collections when the refcount is 15% of the the totalcount.

End quote.

The sentence " the last items are extracted to their original location...."

Are those items having an extention DVS or ARCHDVS? I assume (as they need to be 'recabbed'), it is DVS, but just needing to be sure.

Thank you, Gertjan, MCSE, MCITP,MCTS, SCS, STS
Company: www.t2.nl

www.quadrotech-it.com

www.symantec.com/vision

+1
Login to vote
John Santana's picture

ok, so if the EV vault store is converted into collection and then the .CAB files written to tape using Symantec Netbackup, can user still retrieve the archived items after the tape is recalled ?

The old data was archived while it was still on EV 9 and the new EV server running in production is EV 10.

Kind regards,

John Santana
IT Professional

--------------------------------------------------

Please be nice to me as I'm newbie in this forum.

0
Login to vote