I have been learning how Vontu completes the similarity matching of documents, but I still don't know the principle. Therefore I come for help.
Typically one would create an IDM of documents which has sensitive data. For example, create an IDM of a company's Merger & Acquisition documents. Then, you can create a policy that looks for similar documents. Say these original M&A documents are 60% template (they all look the same) and the other 40% is the real meat of the document. You can say in the policy to look for documents that are X% similar to the originals.
The idea being, you aren't going to try and find exact matches, but ones that are similar to others. CAD drawings are another good example.
Hope that helps,
If this post has helped you, please vote up or mark as solution to help others looking for the same data.
Thanks for your reply. It's very useful for me.
I recommend testing IDM on various datasets as the results may be quite surprising. You need to have some level of confidence to be able to measure the results and write effective policies accordingly.
Try fingerprinting the SDLP documents and generate a number of test files to help you identify desireable results.
Condier using the SDLP 11.1 Admin Guide is 1231 pages in length. Copy 500 pages to a new document. Copy 25 pages, Copy 300 pages and edit 4 or 5 paragraphs in the middle of the text. Export PDF file to RTF, doc or txt, etc...
How many pages would have to be copied into a target document to meet or exceed 10% similarity?
What percentage similarity rating would be identified if you converted the PDF file to MSWord?
What sort of severity or threshold would you set for a 40% similar document? 90%?
What should an incident handler do with a document or documents that are similar but not exact?
I like IDM and I think it is a valuable detection method but by no means should it be relied on as a sole or primary method for data loss protection or control. There are unfortunate limitations (cannot be used on the endpoint) Once you answer the questions above you will be in a much better place to create policies and controls.
If you have access to a large amount of documents that are very similar in content you may want to consider using VML. The use of VML is worthy of another post/forum.