As a recent article
on Data Loss Prevention observed, the next release of Symantec’s Data Loss Prevention suite will contain a major innovation called Vector Machine Learning, the market’s first machine learning technology.
Machine learning, an established branch of artificial intelligence that has been used in everything from anti-spam engines to Google algorithms for translating text, is now being applied to DLP for content analysis.
Continue reading to learn how Vector Machine Learning (VML) can help organizations protect their intellectual property amid steadily increasing amounts of unstructured data, such as documents, spreadsheets, emails, and product design files.
Historically, DLP has relied on two categories of detection technology to find unstructured data: Fingerprinting and Describing. While effective in protecting much of an organization’s sensitive information, both methods have limitations when addressing unstructured data and intellectual property such as product formulas, sales and marketing reports, and source code.
VML technology is designed to overcome the limitations of current detection technologies by learning to identify sensitive data in new or never-seen-before documents.
It begins with training. In the training stage, both positive and negative examples of sensitive data are provided to the VML software. Positive examples could be documents containing proprietary source code; negative examples could be an open source project downloaded from the Web. Both training sets are necessary to extract the key features that go into generating a statistical model, or VML profile, which will then be used during detection.
For detection, the VML profile is used as part of a policy to classify any unknown document or message. If the data is similar to the positive example documents, then an “incident” is generated.
During detection the VML profile assigns a “similarity score” to the unknown document or message as part of classification. A similarity score of 10 indicates that the examined data looks exactly like the example documents supplied in training. A score of zero indicates the data examined looks nothing like the example data from training. This information serves as feedback to help fine-tune the profile and improve its accuracy over time.
Here’s an example of how VML can protect sensitive data: A company’s sales reports are likely to change frequently and exist in various formats, such as Excel, Word, or email documents. By collecting examples of these kinds of reports for training, VML can create a profile that would be able to identify and enforce protection policies for the distribution of new sales reports each week regardless of their format.
Data sets that are good candidates for protection using VML technology include:
- Source code. Protect proprietary source code for a product, trading models, or actuary algorithms.
- Reports and forms. Monthly or weekly sales reports, loan applications, and resumes.
- Legal contracts. Licensing, partnerships, and sales agreements.
- HIPAA and HITECH. Protected Health Information in the form of insurance claims, billing and procedure codes, emails to patients.
- ITAR (International Traffic in Arms Regulations). Intellectual Property and unstructured data that may be restricted.
According to Forrester Research, proprietary knowledge and corporate secrets are now, on average, twice as valuable as customer data.¹ Is it any wonder, then, that thieves are increasingly targeting IP?
Vector Machine Learning from Symantec marks the introduction of a new model for DLP detection. VML is specifically designed to overcome the limitations of today’s detection technologies by identifying the subtle differences between sensitive and non-sensitive data. Ultimately, VML enables organizations to more easily define, detect, and protect their IP.
¹ “The Value of Corporate Secrets,” Forrester Research, March 2010