Video Screencast Help

keyword proximity matching

Created: 21 May 2012 | 3 comments

Hello,

 

I try to tune some policies using keyword proximity matching as precisely as possible. When DLP raises an incident it doesnt show word distance so i cant check how this distance is really computed, so does somebody knows if keyword proximity matching count all types of word (like 1 caracter long, non alphanumerical characters,....) or not ?

 regards

 

Comments 3 CommentsJump to latest comment

xlloyd's picture

According to the admin guide, the distance is 10 words. Not sure if this is configurable or not. Here's an excerpt from the guide:

 

Note: The word distance (proximity value) is exclusive of detected keywords.
Thus, a word distance of 10 allows for a proximity window of 12 words.
If this post has helped you, please vote up or mark as solution
Daniel K.'s picture

There are two sections within the use of keyphrases that define rule within Symantec DLP.  One is a list of keywords and the other is proximity. Keywords are good for very unique keywords but should not be used for common words as the results will unnecssarily inflate and obscure match count. Use proximity for highest recall and precision of keyphrases.

Proximity is defined by two sets of expression lists (A and B).  You can have as many combinations of expression lists as you like in a single ruleset. Also consider ANDing keyword expression lists for even greater precision.

I personally have tested distance at 50 words for high recall but usually rely on distance between 10 and 25. There are tools that can be used to identify keyword proximity combinations like nearest neighbor or n-gram recommendations. Any future release should consider recommending words and distances.

Max value is 99.  The higher the number the greater the recall.

30 is the default value for EDM.

Some proximity rulesets may require that you only look in one direction (forward/reverse). If this is the case they can be defined with a regular expression.

I rely on proximity more than any other method with 11.x of SDLP because it has the highest combination of recall and precision in a single method and it overcomes the significant drawback of using plain old keyword lists that get counted for every single occurence.

 

kishorilal1986's picture

 Hi Stephan,

you can use keyword proximity to exclude matching words within a specified distance by using the "Content Matches Keyword" rule as a detection exception. In this case any occurrence of the words "confidential" and "information" within 10 words of each is excepted from matching.

Note:
 The word distance (proximity value) is exclusive of detected keywords. Thus, a word distance of 10 allows for a proximity window of 12 words. 

The maximum distance between keywords is 999, as limited by the three-digit length of the "Word distance" field. The word distance is exclusive of detected keywords. For example, a word distance of 10 allows for a range of 12 words, including the two words comprising the keyword pair.

  • Repeat the process to add additional keyword pairs.
  • The system connects multiple keyword pair entries the OR Boolean operator, meaning that the detection engine evaluates each keyword pair independently.