Video Screencast Help
Search Video Help Close Back
to help

Exact Data Match (EDM) Indexing Proximity Logic Demystified

Created: 05 Oct 2011 | Updated: 14 Oct 2011 | 2 comments
Language Translations
Keith Reynolds - ExchangeTek's picture
0 0 Votes
Login to vote

There's always a lot of confusion regarding the Proximity Logic for matching on multiple data elements with an Exact Data Match rule.  I hope to demystify how this works with this article.  I myself have always been a little unsure of this, so this post is backed by some basic testing that I have performed to get definitive validation of how token proximity works.

The current definition for the proximity setting on a detection server, EDM.SimpleTextProximityRadius, states:

Number of tokens to the left and to the right of the current token that are evaluated together when the proximity check is enabled.

The default value for this setting is 35.

A "token", as it is spoken about here, is not simply a character or space, which seems to be the common misconception.  Many people tend to interpret this as "If data element 1 is within 35 characters of data element 2, a match will be detected, but if those elements are more than 35 characters apart, no match will be detected".  This is not the case, however.  A "token", in simple terms, can be thought of as a word or other string of characters separated by common delimiters such as a space or a tab.  So what this is really saying is that "if there 35 or more tokens to the left and 35 or more tokens to the right of the current token that is being read, then this is not a match".

Consider an EDM profile that contains Credit Card Numbers and Last Name, with a rule that is looking for both of those elements.  If the data that is being evaluated looked like this:

"My last name is Smith and my current credit card number is 6011456734231982"

...then when this data is being evaluated, the detection engine will eventually look at the token "credit", and find that there is a Last Name that is 4 tokens to the left, and a Card Number that is 4 tokens to the right of the token being evaluated (the word "credit"), and as a result it will detect a match. 

I find that a more simple way to think about this, rather than counting from the center token as illustrated above, is to multiply the EDM.SimpleTextProximityRadius by two and subtract one (to account for the current token being evaluated) , and use that number as the number of tokens between matching elements. 

Max Tokens Between Elements to Detect a Match = (EDM.SimpleTextProximityRadius * 2) - 1 = (35 * 2) - 1 = 69

So with a default setting of 35 for this parameter, if there are 70 tokens between Last Name and Card Number, a match will not be detected.  If there 69 tokens or less between these elements, a match will be detected.

I hope you find this helpful in understanding this parameter better, and its effect on detection using Exact Data Match profiles.  I welcome any comments or feedback.

~Keith

Comments 2 CommentsJump to latest comment

Syed Hussain -Compliance Devil's picture

Good one

 

Thanks,

-Syed Hussain

 

If a post solves your problem, please flag it as solved. If you like an item, please give it a thumbs up vote.
0
Login to vote
velvin's picture

A question here...

Let's say our EDM is looking for 3 pieces of data. If we have EDM.SimpleTextProximityRadius set to default 35, does this mean all 3 pieces of data need to be within 35 token length from the first to the last or does the token count reset when it goes from first data to the second one it finds? Your x2-1 formula seems to point to 35 tokens from first to last data but wanted more clarfication.

The problem we have is that we have patient names as two separate fields in EDM (first name, last name), this way, DLP picks up the name as qualifying data however it is written (Doe, John ; John doe ; etc...). If we were only using the name in EDM, obviously, we would shorten the EDM.SimpleTextProximityRadius value to something like 3 tokens but we also use EDM for medical record#, DOB, SS#, zip code etc...

So as you can see, it is quite possible that data we're looking for can be spread out beyond 35 tokens but keeping the token value at 35 or higher keeps producing false positives when a file containing many names and similar data are sent (ex: spreadsheet, contact list, etc) and with so many names, it is bound to pick up a person's first name and another person's last name qualify it as one of our EDM fields for patient name.

We have other methods to exclude false positives but it would be great if anyone can share some additional insight around using EDM and EDM.SimpleTextProximityRadius field.

 

 

0
Login to vote