EMEA Data Loss Prevention User Group

View Only

Back to discussions

Expand all | Collapse all

Separating detection of text-based and image-based PDFs in DLP

1. Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Tim Harrison
Posted Dec 07, 2016 06:29 AM

Reply Reply Privately
Hi

I've run into an issue with separating deteciton of PDFs that contain machine readable text, and PDFs that only contain an image.

I've examined the metadata and the only obvious difference is text-based PDFs have a 'font' section, and the code of the PDF itself contains a string TEXTON or TEXTOFF depending on the type.

I've set a test policy to pick up references to font, TEXTON or TEXTOFF but DLP doesn't see them - but it can see the standard metadata such as Author:, Creator: etc - has anyone been able to solve this?

Thanks in advance!

Tim
2. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Trusted Advisor

stephane.fichet
Posted Dec 07, 2016 07:56 AM

Reply Reply Privately
hi tim,

I used an other method to detect "analyzable" document, I create a compound rule :

- detect PDF

+

- a simple regexp (like \w+) or very usual keyword (not included in stopwords list) (this will not match if there is only images in pdf document)

and so i could add other criteria in the rule.

Regards
3. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Tim Harrison
Posted Dec 07, 2016 11:21 AM

Reply Reply Privately
Thanks Stephane - I tried the regexp \w+ but it picks up the metadata headers even though the PDF is image only - e.g. 'Keywords :' - any thoughts?
4. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Broadcom Employee

John Gruhn
Posted Dec 07, 2016 11:57 AM

Reply Reply Privately
Likely the way to take care of this is to create a custom file identifier for PDFs with text and then those that dont. Before going though that I would ask more about the use case. Is is that you are worried about things that need OCR done on them? In that case it might be a use case for our Form Matching technology if the form it is on is standardized and could therefore be indexed.
5. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Tim Harrison
Posted Dec 08, 2016 06:48 AM

Reply Reply Privately
Thanks John - we have some printer/scanners that produce an OCR enabled PDF, and some printer/scanners that don't. I need to detect them separately, and I already have metadata enabled in the detection servers, so I can see the standard metadata headers in the scanned PDF.
6. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Dean_Thomson
Posted Dec 10, 2016 10:37 AM

Reply Reply Privately
I guess turning metadata detection off is one solution (along with the regex for a word character). But i'm sure you have it enabled for a reason.. Alternatively, you could use a regex to look for characters that indicate typical sentences/paragraphs format. For example;

#Word ending with a full stop, exclamation point or question mark, a space (or new line) then another word (beginning with a capital letter)

\s\w+[.?!]\s[A-Z]+\s

You'd assume this would match multiple times in a document, so you could make the threshold 2 or 3+ to make sure you don't get a red herring in the metadata. If you do and it's isolated to a particular property, just add an exception for that particular pattern or keyword aka (Exclude "Keywords:" or "\w+:\s")

Dean
7. RE: Separating detection of text-based and image-based PDFs in DLP

0 Recommend
Dean_Thomson
Posted Dec 10, 2016 10:41 AM

Reply Reply Privately
Something that came to me after I submitted above reply, but I haven't tested. If using Vector Machine Learning (VML) and you gave it positive examples of text-based PDFs and and negative examples of image-based PDFs, it may be a very good option also.

EMEA Data Loss Prevention User Group

Separating detection of text-based and image-based PDFs in DLP

Tim HarrisonDec 07, 2016 06:29 AM

stephane.fichetDec 07, 2016 07:56 AM

Tim HarrisonDec 07, 2016 11:21 AM

John GruhnDec 07, 2016 11:57 AM

Tim HarrisonDec 08, 2016 06:48 AM

Dean_ThomsonDec 10, 2016 10:37 AM

Dean_ThomsonDec 10, 2016 10:41 AM

1. Separating detection of text-based and image-based PDFs in DLP

2. RE: Separating detection of text-based and image-based PDFs in DLP

3. RE: Separating detection of text-based and image-based PDFs in DLP

4. RE: Separating detection of text-based and image-based PDFs in DLP

5. RE: Separating detection of text-based and image-based PDFs in DLP

6. RE: Separating detection of text-based and image-based PDFs in DLP

7. RE: Separating detection of text-based and image-based PDFs in DLP