Video Screencast Help

Help with Regular Expressions

Created: 21 Mar 2014 | 2 comments
Ravish Shah's picture

Hi Everyone,

We have a requirement to implement DLP policy to monitor and trigger an incident on "First/Last name + Phone Number + Email address + Postal code"

We have successfully created and tested Regular Expressions to detect this confidential information. However, we are noticing a performance hit when analyzing messages with the policy (using Regex) turned on during our testing. CPU usage jumps to about 70% or more when ingesting our test messages at batches of ~200 messages.

As I am not an Expert in writing Regular expressions, can someone please review and let me know if there are any improvements that can be done that can help with the performance issue ?  Can these be optimized ? If so, how ? 

1) Email address: 
(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b 

2) Phone number: 
\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b 

3) Postal code : (Canadian) 
\b[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]\b 

4) First and Last name: 
\b[A-Z][-'a-zA-Z]+,?([\s\|]|\s{2,})[A-Z][-'a-zA-Z]{0,19}\b 

Also, if we create a Custom Data Identifier (using exact same pattern as used in Regex), will it have any performance benefit instead of using Regular expressions directly ?

Thanks for your help in advance. 

Operating Systems:

Comments 2 CommentsJump to latest comment

Jsneed's picture

Symantec isn't extremely forthcoming about the regex engine used so that these types of optimizations can be accomplished.  We have resorted to trial and error to help optimize our expressions.  I do know that the /b directive seems to cause problems for us.  Looking at your regex a normal space could probably be used on both sides of your expressions instead of /b

 

DLP Solutions's picture

Ravish,

It looks like you are trying to do matching based on a customer or employee list. Why not create an EDM profile? It will be more accurate and less false positives.

Also the items you are trying to create a regex for is going to give you Thousands of False positives:

  1. Email Address is in EVERY single email and you will trigger on a ton of them since the reply threads are in the email and they have email addresses as well.
  2. Phone number is in Every footer of an Email
  3. First Name and Last name is so arbitrary and can trigger on just about anything. This is the exact reason on why to create an EDM

FYI.. Question marks (?) are not valid in the DLP Regex. Look at the online help. Also there are multiple types of Regex (jave, and other based versions) So you need to think about which one is right. This is probably why your CPU is running away.

The best way to learn how to use Regex is to look at the exisitng Data Identifiers, they have a ton of information on how to create and use Regex. Especially the Drivers License ones.

As far as using a Regex or a using it as a Data Identifer will not cause a different performance issue. Though creating a Data Identifier gives you better matching for you have the opton of doing a "unique match count' with a Data Identifier and NOT with a basic Regex.

So I would use a DI most of the time.

Hope this makes sense.

If this solves your questions please marked as solved.

Ronak

Please make sure to mark this as a solution

to your problem, when possible.