Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.

How to implement Exact Data Matching

Created: 11 Mar 2014 • Updated: 11 Mar 2014
Language Translations
Lion Shaikh's picture
0 0 Votes
Login to vote

Exact Data Matching (EDM) detects content you want to protect that is stored in structured or tabular format. For example, you can use EDM to detect confidential customer information from a database. Or, you can use EDM to detect sensitive financial information from a spreadsheet.

To implement EDM, you identify the structured data you want to protect. You index the data source using the Enforce Server administration console. During the indexing process, the system fingerprints the data by accessing and extracting the text-based content, normalizing it, and securing it using a nonreversible hash. For precise continuous detection, you can schedule indexing on a regular basis so the data is always current. You configure the Content Matches Exact Data From policy detection rule to match individual pieces of the profiled data. For increased accuracy you can configure the rule to match combinations of data fields from a particular record. Once the data source is indexed and the policy is deployed, the detection engine can detect the data in structured or unstructured format.

Consider the following example.

Your company maintains an employee database that contains five columns:

First Name

Last Name

SSN

Date of Hire

Salary

Each row in the database contains information for one employee. You export the records to a data source file. Each record is on a separate line. A comma, tab, or pipe character delimits each data item. For example, one row in the data source file contains Bob,Smith,123-45-6789,05/26/99,$42500. You index the data source file and create an Exact Data Profile. When you configure the profile, you map the data elements (columns) you want to protect. You then configure the EDM policy rule that references the Exact Data Profile. In this example, the rule matches if a message contains the First Name, Last Name, and SSN together. At runtime, the detection engine reports an incident if it detects "Raj, Malhotra, 333-65-7841" in any inbound message. But, a message containing "Sam, Malhotra, 333-65-7841" does not match because that record is not in the profile. A message that contains "Raj, Malhotra, 625-111-7545" also does not match because the number is not the social security number.

To detect data exactly, Symantec Data Loss Prevention requires a special indexed version of the data. An index is a secure file (or set of files). It contains hashes of the exact data values from each field in your data source, along with information about those data values. The index does not contain the data values themselves, so it is secure.

Indexes consist of one or more secure, binary .rdx files, each with space to fit into random access memory (RAM) on the detection server(s). For a large data source file, Symantec Data Loss Prevention may break the data into several .rdx files. In production, the system converts input content into hashed data values using the same algorithm it employs for indexes. It then compares data values from input content to those in the appropriate .rdx files, identifying matches.

By default, Symantec Data Loss Prevention stores index files in C:\Vontu\Protect\index (on Windows) or in /var/Vontu/index (on Linux) on the Enforce Server and on all detection servers. When the policy is active, Symantec Data Loss Prevention deploys the index to the detection server and the detection server loads the index into RAM.

Implementing Exact Data Matching :
 

To implement EDM, you create the Exact Data Profile, index the data source, and define one or more EDM detection rules to match the profiled data exactly.

Procedure Step 1 : Create the data source file:
 

A] Export the source data from the database (or other data repository) to a tabular text file.

About Data Owner Exception:

The data owner exception (DOE) feature enables data owners to send or receive their own data the system would otherwise prevent from delivery or receipt.

To implement the data owner exception feature, you must include either or both of the following fields in your data source file:

Email address

Domain address

Note: To implement DOE and except data owners from detection, you must explicitly include each user's email address or domain address in the Data Profile. Each expected domain (for example, symantec.com) must be explicitly added to the Data Profile. The system does not automatically match on subdomains (for example, fileconnect.symantec.com). Each subdomain must be explicitly added to the Data Profile.
 

Once you have configured the Exact Data Profile that includes either of these data elements, you can flag either field as the data owner. At runtime if the sender or recipient of the data is the owner, the condition does not trigger a match. The result is that the data is delivered or received.

If you previously implemented DOE manually using configuration files, you must reconfigure these exceptions to run on the latest Enforce Server.

B] If you want to except data owners from matching, you need to include specific data items in the data source file.

About implementing profiled Directory Group Matching :

Symantec Data Loss Prevention lets you detect the exact identities of data users, message senders, and recipients based on a profiled directory server or database.

Symantec Data Loss Prevention provides two static Directory Group Matching methods. Both methods require the use of an Exact Data Profile with specific data fields.

i) Sender/User Matches Directory From Exact Data Profile :

Group-related attributes may include an IP address, email, Windows user name, business unit, department, manager, title, employment status. Other attributes may be whether that employee has provided consent to be monitored, or whether the employee has access to sensitive information.
 

ii) For the Recipient Matches Directory From Exact Data Profile :
 

You can index a list of recipients email addresses and author policies based on this indexed data. For example, you can write a detection rule that requires the message sender to be from the customer service department to violate the policy. Or, you could write a detection exception that is not violated if the recipient of an email is on an approved list.

C] If you want to match identities for profiled Directory Group Matching (DGM), you need to include specific data items in the data source files. :

Procedure Step 2 : Prepare the data source file for indexing :
 

Remove irregularities from the data source file.

Preparing the exact data source file for indexing:

Once you create the exact data source file, you must prepare it so that you can efficiently index the data you want to protect.

When you index an exact data profile, the Enforce Server keeps track of empty cells and any misplaced data which count as errors. For example, an error may be a name that appears in a column for phone numbers. Errors can constitute a certain percentage of the data in the profile (five percent, by default). If this default error threshold is met, Symantec Data Loss Prevention stops indexing. It then displays an error to warn you that your data may be unorganized or corrupt. Symantec Data Loss Prevention checks for errors only if the data source has at least a thousand rows.

To prepare the exact data source for efficient EDM indexing:

A] Make sure that the data source file is formatted as follows:

i)If the data source has more than 200,000 rows, verify that it has at least two columns of data. One of the columns should contain reasonably distinct values. For example, credit card numbers, driver's license numbers, or account numbers (as opposed to first and last names, which are relatively generic).

ii) Verify that you have delimited the data source using commas, tabs, or pipes ( | ). If the data source uses commas as delimiters, remove any commas that do not serve as delimiters. For example, if a value in the address column is 346 Guerrero St., Apt. 2, delete the comma after Guerrero St.

Note: The pound sign (#), equals sign (=), plus sign (+), semicolon (;) and colon (:) characters are also treated as separators.
 

iii) Verify that data values are not enclosed in quotes.

iv) Remove single-character and abbreviated data values from the data source. (For example, remove the column name and all values for a column in which the possible values are Y and N.) Optionally, remove any columns that contain numeric values with less that five digits, as these can cause false positives in production.

v) Verify that numbers, such as credit card or social security, are delimited internally by dashes, or spaces, or none at all. Make sure that you do not use a data-field delimiter (for example, a comma) as an internal delimiter in any such numbers; for example: 123-45-6789, or 123 45 6789, or 123456789, but not 123,45,6789.

vi) Eliminate duplicate records, which can cause duplicate matches in production.

vii) Eliminate spaces in data values by separating the data into two or more fields. For example, the name Joe Brown, may appear in input content with the middle name or initial; for example: Joe R Brown, Joe R. Brown, or Joe Robert Brown. If the value Joe Brown appears in a single field in your data source, Symantec Data Loss Prevention detects only the literal string Joe Brown. It does not detect other variants of the name. To ensure that the system detects name variants, divide the name into two fields: a first-name field and a last-name field. You may also want to remove any relatively unimportant text that is separated by a space. For example, for a data value of Mary Jo, you may want to remove Jo entirely. In addition, some data values with inherent spacing, such as San Francisco and New York, may not be critical to your matching criteria, and therefore can be left as they are.

viii) Eliminate duplicate records, which can cause duplicate incidents in production.

ix) Do not index common values. EDM works best with values that are unique. You need to think about the data you want to index (and thus protect). Is this data truly valuable? If the value is something common, it is not be useful as an EDM value. For example, suppose you want to look for "states." Since there are only 50 states, if your exact data profile has 300,000 rows, the result is a lot of duplicates of common values. Symantec Data Loss Prevention indexes all values in the exact data profile, regardless of if the data is used in a policy or not. It is good practice to use values that are less common and preferably unique to get the best results with EDM.

B] Once you have prepared the exact data source file, proceed with the next step in the EDM process: load the exact data source file to the Enforce Server for profiling the data you want to protect.

Procedure Step 3 : Upload the data source file to the Enforce Server:
 

You can copy or upload the data source file to the Enforce Server, or access it remotely.

Uploading exact data source files to the Enforce Server :
After you have prepared the data source file for indexing, load it to the Enforce Server so the data source can be indexed.

Listed here are the three options you have for making the data source file available to the Enforce Server. Consult with your database administrator to determine the best method for your needs.

To make the data source available to the Enforce Server :

A] If you have a large data source file (over 50 MB), copy it to the "datafiles" directory on the host where Enforce is installed.
i) On Windows this directory is located at DLP_home\Protect\datafiles (for example, C:\Vontu\Protect\datafiles).

ii) On Linux this directory is located at /var/Vontu/datafiles.

This option is convenient because it makes the data file available by reference by a drop-down list during configuration of the Exact Data Profile. If it is a large file, use a third-party solution (such as Secure FTP) to transfer the data source file to the Enforce Server.

Note: Ensure that the Enforce user (usually called "protect") has modify permissions (on Windows) or rw permissions (on Linux) for all files in the "datafiles" directory.
 

B] If you have a smaller data source file (less than 50 MB), upload the data source file to the Enforce Server using the Enforce Server administration console (Web interface). When creating the Exact Data Profile, you can specify the file path or browse to the directory and upload the data source file.
Note: Due to browser capacity limits, the maximum file size that you can upload is 2 GB. However, uploading any file over 50 MB is not recommended since files over this size can take a long time to upload. If your data source file is over 50 MB, consider copying the data source file to the "datafiles" directory using the first option.
 

C] In some environments it may not be secure or feasible to copy or upload the data source file to the Enforce Server. In this situation you can index the data source remotely using the Remote EDM Indexer Utility.

The Remote EDM Indexer is a utility that converts a comma-separated value, or tab-delimited, data file to an Exact Data Matching index. The utility is similar to the local EDM Indexer used by the Enforce Server. However, the Remote EDM Indexer is designed for use on a computer that is not part of the Symantec Data Loss Prevention server configuration.

Using the Remote EDM Indexer to index a data source on a remote machine has the following advantages over using the EDM Indexer on the Enforce Server:

It enables the owner of the data, rather than the Symantec Data Loss Prevention administrator, to index the data.

It shifts the system load that is required for indexing onto another computer. The CPU and RAM on the Enforce Server is reserved for other tasks.

The SQL Preindexer is often used with the Remote EDM Indexer. The SQL Preindexer is used to run SQL queries against SQL databases and pass the resulting data to the Remote EDM Indexer.

This utility lets you index an exact data source on a computer other than the Enforce Server host. This feature is useful when you do not want to copy the data source file to the same machine as the Enforce Server. As an example, consider a situation where the originating department wants to avoid the security risk of copying the data to an extra-departmental host. In this case you can use the Remote EDM Indexer.

D] Proceed with the next step in the EDM process: configuring the Exact Data Profile and indexing the data source.

Procedure Step 4 : Create an Exact Data Profile:
 

The Exact Data Profile specifies the data source, the indexing parameters, and the indexing schedule.

The Manage > Data Profiles > Exact Data > Add Exact Data Profile screen is the home page for managing and adding Exact Data Profiles. An Exact Data Profile is required to implement an instance of the Content Matches Exact Data detection rule.

An Exact Data Profile specifies the data source, the indexing parameters, and the indexing schedule. Once you have created the EDM profile, you index the data source and configure one or more detection rules to use the profile and detect exact content matches.

Procedure Step 5 : Map the data fields.
 

You map the source data fields to system or custom data types that the system validates. For example, a social security number data field needs to be nine digits.

Column headings in your data source are useful for visual reference. However, they do not tell Symantec Data Loss Prevention what kind of data the columns contain. You use the Field Mappings section of the Add Exact Data Profile screen to specify mappings between fields in your data source. You can also use this screen to specify fields that Symantec Data Loss Prevention recognizes in its policy templates. The Field Mappings section also gives you advanced options for specifying custom fields.

Consider the following example use of field mappings. Your company wants to protect employee data, including employee social security numbers. You create a policy based on the Employee Data Protection template. The policy requires an exact data index with fields for social security numbers and other employee data. Prepare your data source and then create an exact data profile. Specify that the social security number field in the data source maps to the "Social Security Number" system field of the policy template.

After you have added and configured the data source file and settings, the Manage > Data Profiles > Exact Data > Add Exact Data Profile screen lets you map the fields from the data source file to the Exact Data Profile you are configuring.

To enable error checking on a field in a data source or to use the index with a policy template that uses a system field, you must map the field in the data source to the system field. The Field Mappings section lets you map the columns in the original data source to system fields in the Exact Data Profile.

Procedure Step 6 : Index the data source, or schedule indexing:
 

When you configure an Exact Data Profile, you can set a schedule for indexing the data source.

Before you set up a schedule, consider the following:

If you update your data sources occasionally (for example, less than once a month), there is no need to create a schedule. Index the data each time you update the data source.

Schedule indexing for times of minimal system use. Indexing affects performance throughout the Symantec Data Loss Prevention system, and large data sources can take time to index.

Index a data source as soon as you add or modify the corresponding exact data profile, and re-index the data source whenever you update it. For example, consider a scenario whereby every Wednesday at 2:00 A.M. you update the data source. In this case you should schedule indexing every Wednesday at 3:00 A.M. Do not index data sources daily as this can degrade performance.

Monitor results and modify your indexing schedule accordingly. If performance is good and you want more timely updates, for example, schedule more frequent data updates and indexing.

Scheduling Exact Data Profile indexing:

When you configure an Exact Data Profile, you can set a schedule for indexing the data source (Submit Indexing on Job Schedule).

Before you set up a schedule, consider the following recommendations:

If you update your data sources occasionally (for example, less than once a month), there is no need to create a schedule. Index the data each time you update the data source.

Schedule indexing for times of minimal system use. Indexing affects performance throughout the Symantec Data Loss Prevention system, and large data sources can take time to index.

Index a data source as soon as you add or modify the corresponding exact data profile, and re-index the data source whenever you update it. For example, consider a scenario whereby every Wednesday at 2:00 A.M. you update the data source. In this case you should schedule indexing every Wednesday at 3:00 A.M. Do not index data sources daily as this can degrade performance.

Monitor results and modify your indexing schedule accordingly. If performance is good and you want more timely updates, for example, schedule more frequent data updates and indexing.

The Indexing section lets you index the Exact Data Profile as soon as you save it (recommended) or on a regular schedule as follows:

Procedure Step 7 : Configure and tune one or more EDM detection conditions :
 

Once you have defined the Exact Data Profile and indexed the data source, you configure one or more Content Matches Exact Data conditions in policy detection rules. The EDM condition is not available for policy exceptions.

By adjusting the EDM.MatchCountVariant setting for the detection server, you can configure how EDM matches are counted.

As an example, consider a database profile with the following three records:

Kathy, Stevens, 123-45-6789, 1111-1111-1111-1111

Kathy, Stevens, 123-45-6789, 2222-2222-2222-2222

Kathy, Stevens, 123-45-6789, 3333-3333-3333-3333

If the policy rule is set up to match any 3 of 4 and someone sends a message with the following line:

Kathy, Stevens, 123-45-6789

The matches are counted as follows:

EDM.MatchCountVariant=1: 3 (number of database profile records matched)

EDM.MatchCountVariant=2: 1 (number of unique token sets matched)

EDM.MatchCountVariant=3: 1 (number of inclusive token sets matched)

If someone sends a message with the following 2 lines:

Kathy, Stevens, 123-45-6789, 1111-1111-1111-1111

Kathy, Stevens, 123-45-6789

The matches will be counted as follows:

EDM.MatchCountVariant=1: 3 (number of database profile records matched)

EDM.MatchCountVariant=2: 2 (number of unique token sets matched)

EDM.MatchCountVariant=3: 1 (number of inclusive token sets matched, the first token set includes the second one).