Data Loss Prevention

 View Only
  • 1.  CEH warning - Text extraction failed

    Posted Jun 06, 2013 03:13 AM

    Hello,

    I use Network Monitor server (version 11.6.2) on the RedHat Linux system. In the log file ContentExtractionHost_FileReader.log I see many identical warnings: < Text extraction failed: type = 'pdf' >.

    For example:

    06/06/13 10:34:12 | WARN  | cehost | Service [5755] | [1113405760] | Text extraction failed: type = 'pdf', container='0', encrypted='0', Exception thrown from : TextExtractionRequestExecutor.cpp(124) | CEService.cpp (162)
    06/06/13 10:34:12 | WARN  | cehost | Service [5755] | [1218304320] | Text extraction failed: type = 'pdf', container='0', encrypted='0', Exception thrown from : TextExtractionRequestExecutor.cpp(124) | CEService.cpp (162)
    06/06/13 10:34:13 | WARN  | cehost | Service [5755] | [1249773888] | Text extraction failed: type = 'pdf', container='0', encrypted='0', Exception thrown from : TextExtractionRequestExecutor.cpp(124) | CEService.cpp (162)

    What could be the cause of that?

     

    ---
    Best regards, Artem.

     



  • 2.  RE: CEH warning - Text extraction failed

    Posted Jun 11, 2013 11:03 AM

    Artem,

    ContentExtractor is responsible for taking files that are not text-based and extracting out the content in a format the detection engine can read. PDF files are a bit interesting since there are SO MANY different applications and ways to generate them and they are prone to being formatted incorrectly. More than likely CE is failing because the PDF is malformed.

    Tracking it further than that gets fairly complex. You would need to capture some of the traffic coming into DLP, turn up FileReader logging high enough to see file names, correlate one of those errors to a FileReader event and then go extract that file from your traffic capture. From there you could do a simple inspection of the file (I've seen on multiple occasions where these files won't even open up in Adobe Reader) and also could run that through Filter manually to see what the result is.

    Hopefully that's helpful - if you want to launch a full investigation and have any questions about the steps I outlined, let me know!

    Tim Deese



  • 3.  RE: CEH warning - Text extraction failed

    Posted Jun 17, 2013 02:06 AM

    Hello Tim,

    I changed log level.
    In the file /opt/Vontu/Protect/config/FileReaderLogging.properties I set:

    java.util.logging.FileHandler.level = FINE
    ...
    com.vontu.monitor.level = FINE


    In the file /opt/Vontu/Protect/config/log4cxx_config_filereader.xml I set:

            <category name="cehost" >
                    <priority value ="fine" />
                    <appender-ref ref="cehostAppender"/>
            </category>


    I don't know enough it or no, but in the file /var/log/Vontu/debug/ContentExtractionHost_FileReader.log I see filenames. For example:

    06/17/13 08:59:58 | WARN  | cehost | Verity [31717] | [1105328448] | Failed to extract subfile '??????-?????? ?.?????, ??. ??????????, ?. 78.pdf', Index: 0, Error code: 4 | src/VerityImplInternal.c (497)
    06/17/13 09:01:02 | WARN  | cehost | Verity [32058] | [1087748416] | Failed to extract subfile '??????-?????? ?.?????, ??. ??????????, ?. 78.pdf', Index: 0, Error code: 4 | src/VerityImplInternal.c (497)
    06/17/13 09:06:19 | WARN  | cehost | Verity [32058] | [1251891520] | Failed to extract subfile 'ÐС-75179,Ð
                                                                                                               06/17/13 09:06:19 | WARN  | cehost | Verity [32058] | [1251891520] | Failed to extract subfile '74173 Ð
                06/17/13 09:31:19 | WARN  | cehost | Verity [2229] | [1243666752] | Failed to extract subfile '??????-?????? ?.?????, ??. ??????????, ?. 78.pdf', Index: 0, Error code: 4 | src/VerityImplInternal.c (497)
    06/17/13 09:45:44 | WARN  | cehost | Verity [3806] | [1126074688] | Failed to extract subfile 'IL681 ????? ??????????.pdf', Index: 0, Error code: 4 | src/VerityImplInternal.c (497)

    And it isn't easy to find real filenames. I think the filenames has cyrillic encoding and I can't see the original filenames.

     

    ---
    Best regards, Artem.



  • 4.  RE: CEH warning - Text extraction failed

    Posted Jun 17, 2013 09:18 AM

    It might be extracting encrpted content which making it fails to extract.