Video Screencast Help

Need validation on deduplication of email msgs in Clearwell

Created: 23 Sep 2013 • Updated: 25 Sep 2013 | 5 comments
This issue has been solved. See solution.

I have our legal team questioning whether documents are getting deduplicated after several custodian evault containers are collected, then processed in Clearwell.  An example is that there are two separate documents in the corpus even though they are the same email message.  One appears as being a message from Custodian A's sent items - Custodian A is the sender.  The second appears as a message in Custodian B's inbox - Custodian B is the recipient.  In my mind, these are not identical therefore not deduplicated because one document is from a sender and the second document is from the recipient.  Is this assumption correct?

This is further complicated by the sender's email address being identified as username@internal_domain instead of username@<mycompanyname>.com.  Does the interpretation of the email address impact whether these messages are identified as unique therefore not deduplicated?

Operating Systems:

Comments 5 CommentsJump to latest comment

Daly Whyte's picture


The sender e-mail address is used as part of the hash calculation, this may be why they're not de-duplicated as the hashes will differ if the sender e-mail is different.

For absolute certainty, it would be best to raise a case for someone to confirm.

SMAF's picture

I have a document from the pre-Symantec days that thoroughly describes deduplication. It is most likely that this has changed since that time - version 5.x I think. I would request the "new" document from your sales rep or SE contact.

Essentially what it is says is that it uses the sender address, the To/From/CC/BCC address lists in sorted order, the sent date/time (UTC), the subject, the full text of the content (alphanumeric only), count of the enclosed emails and the attachment properties (name, size, MD5 hash).

The name of the document is:  CW_De-Duplication_Overview_101409.pdf
The metadata says it was created in 2009's old as the hills.

You should ask for the latest copy of this document and do not depend on my summary above for any purposes whatsoever. And before you ask, no, I can't forward the document to you. It has a do-not-redistribute footer.



SMAF's picture

And...another thing....

It wasn't quite clear from your description if you were pulling from the Custodian's Personal Vaults or if you were pulling from the Journal Vault (my terms). They are two different things.

The configuration of your Journal process is important for you to know. Generally the Journal Vault is just one big bucket of mail. There's no concept of Custodian or this is the "Sent" email and this is the "Received" email. There can be de-duplication on intake based on the Exchange server or some other policy. If two people are on the same Exchange server only one copy of the email is journaled. If one person is on one Exchange server and the other one is on another one, two copies get journaled. So Clearwell de-duplication is one thing but the Journal process can also deduplicate. It can also be set not to Journal certain kinds of mail. To really understand what's happening end-to-end you need to understand your Journal process as well.

PS...I am not on the Enterprise Vault team at my company, but they are my best friends! I'm on the DA/Clearwell team.



AMEC-SLH's picture

Hi Susan, these messages were collected from vaulted email, not Journal.  We don't do Journal vaulting in our company.  But thanks for the information.

In the meanwhile I've spoken to Clearwell tech support who confirmed that the two messages are in fact unique and therefore not deduplicated.  Good to get the validation. 

Thanks for all the posts!