Data Loss Prevention

 View Only
Expand all | Collapse all

DLP 15.1 Network Discover and Prevent - .PDF data extraction with out OCR server?

  • 1.  DLP 15.1 Network Discover and Prevent - .PDF data extraction with out OCR server?

    Posted Feb 16, 2019 04:40 PM

    Does anyone know if Symantec DLP 15.1 Network Discover Scans can extract data from .pdf files without an OCR server?   We often get the Text Extraction Failed for PDFs during Data at Rest Scans ( CIFS ) and were wondering if an OCR server is needed to process .PDFs during data at rest scans.   Bellow, is an example of of one of the errors we get from the "ContentExtractionHost_FileReader.log".

     

    | WARN  | cehost | Service [9084] | [5500] | Text extraction failed: type = 'pdf', container='0', encrypted='0', Exception thrown from : TextExtractionRequestExecutor.cpp(146) | CEService.cpp (190)
     



  • 2.  RE: DLP 15.1 Network Discover and Prevent - .PDF data extraction with out OCR server?

    Posted Feb 18, 2019 04:07 AM

    Hi Neil,

     

    It depends on how the PDF was created as to if OCR is required or not, with PDFs there's two possible types, a world document or a native PDF which has actual text that DLP can read and identify and policy violations,

     

    The second, the reasoning for OCR, is for example if you have a paper document and you scan this to your emails, it arrives as a .pdf but in reality its purely an image of the physical document and doesn't have identifiable text, therefore OCR comes into play to review it,

     

    I hope this helps,