EMEA Data Loss Prevention User Group

 View Only
  • 1.  Separating detection of text-based and image-based PDFs in DLP

    Posted Dec 07, 2016 06:29 AM

    Hi

    I've run into an issue with separating deteciton of PDFs that contain machine readable text, and PDFs that only contain an image.

    I've examined the metadata and the only obvious difference is text-based PDFs have a 'font' section, and the code of the PDF itself contains a string TEXTON or TEXTOFF depending on the type.

    I've set a test policy to pick up references to font, TEXTON or TEXTOFF but DLP doesn't see them - but it can see the standard metadata such as Author:, Creator: etc - has anyone been able to solve this?

    Thanks in advance!

    Tim



  • 2.  RE: Separating detection of text-based and image-based PDFs in DLP

    Trusted Advisor
    Posted Dec 07, 2016 07:56 AM

    hi tim,

     I used an other method to detect "analyzable" document, I create a compound rule :

    - detect PDF

    +

    - a simple regexp (like \w+) or very usual keyword (not included in stopwords list) (this will not match if there is only images in pdf document)

     and so i could add other criteria in the rule.

     Regards



  • 3.  RE: Separating detection of text-based and image-based PDFs in DLP

    Posted Dec 07, 2016 11:21 AM

    Thanks Stephane - I tried the regexp \w+ but it picks up the metadata headers even though the PDF is image only - e.g. 'Keywords :' - any thoughts?



  • 4.  RE: Separating detection of text-based and image-based PDFs in DLP

    Broadcom Employee
    Posted Dec 07, 2016 11:57 AM

    Likely the way to take care of this is to create a custom file identifier for PDFs with text and then those that dont. Before going though that I would ask more about the use case. Is is that you are worried about things that need OCR done on them? In that case it might be a use case for our Form Matching technology if the form it is on is standardized and could therefore be indexed.



  • 5.  RE: Separating detection of text-based and image-based PDFs in DLP

    Posted Dec 08, 2016 06:48 AM

    Thanks John - we have some printer/scanners that produce an OCR enabled PDF, and some printer/scanners that don't.  I need to detect them separately, and I already have metadata enabled in the detection servers, so I can see the standard metadata headers in the scanned PDF.



  • 6.  RE: Separating detection of text-based and image-based PDFs in DLP

    Posted Dec 10, 2016 10:37 AM

     I guess turning metadata detection off is one solution (along with the regex for a word character). But i'm sure you have it enabled for a reason.. Alternatively, you could use a regex to look for characters that indicate typical sentences/paragraphs format. For example;

    #Word ending with a full stop, exclamation point or question mark, a space (or new line) then another word (beginning with a capital letter)

    \s\w+[.?!]\s[A-Z]+\s

    You'd assume this would match multiple times in a document, so you could make the threshold 2 or 3+ to make sure you don't get a red herring in the metadata. If you do and it's isolated to a particular property, just add an exception for that particular pattern or keyword aka (Exclude "Keywords:" or "\w+:\s")

    Dean

     

     



  • 7.  RE: Separating detection of text-based and image-based PDFs in DLP

    Posted Dec 10, 2016 10:41 AM

    Something that came to me after I submitted above reply, but I haven't tested. If using Vector Machine Learning (VML) and you gave it positive examples of text-based PDFs and and negative examples of image-based PDFs, it may be a very good option also.