Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.
Symantec Intelligence

The PDF Exploit: Same Crime, Different Face

Created: 06 Apr 2011 • Updated: 06 Apr 2011
Paul Wood's picture
+2 2 Votes
Login to vote

Posted on behalf of Jason Zhang and Joseph Rabaiotti, Malware Research Analysts, Symantec.cloud

 

Portable document format (PDF) is one of the most commonly used file formats with which to exchange electronic documents across platforms and applications. Because of its popularity, it has been heavily used in both targeted and non-targeted attacks, as reported by MessageLabs Intelligence Monthly Report (PDF) in February 2011 and a blog post in January 2011. According to the report, PDFs now account for a larger proportion of document-based targeted attacks; in 2009 approximately 52.6% of targeted attacks used PDF exploits, compared with 65.0% in 2010.

In 2011, we have seen no sign of slowing down of this trend, more recently the attacks have widened to include sophisticated non-targeted malware. Figure 1 below, shows the proportion of all email-based malware that comprised a PDF attachment for both targeted and non-targeted attacks.

 

Figure 1

Figure 2 below shows some examples of recent non-targeted PDF-based attacks that were stopped only by Symantec’s Skeptic™ technology. As we can see, the emails used a variety of social engineering techniques to entice recipients into opening the attached PDF files.

 

Figure 2

One particular example of a more advanced attack surfaced on 12 December 2010 in relatively small numbers, but intensified with two to three waves of attacks each month between January and March. In this blog, we’ll take a closer look at how this attack was constructed and how it was designed to evade detection by antivirus technology.

 

Hidden Obfuscated JavaScript Hunting

The property overview for a typical variant of the exploited PDF files recently blocked by Skeptic™ is shown in figure 3, below. It shows that the PDF version which the file conforms is 1.3 and it comprises 95 objects and other dictionary entries.

It is worth noting that there is no JavaScript in the raw PDF file (0 occurrence of entry ‘/JS’ and ‘/JavaScript’ as shown in the figure), which implies that the JavaScript may be hidden in some other object(s). The reason we particularly pay attention to JavaScript is because it is the most commonly used method by malicious files to exploit PDF vulnerabilities or to inject malicious code into memory.

 

Figure 3

 We also note that the PDF file contains an interactive form, as defined by the dictionary entry /AcroForm (highlighted in the figure above) with an XML Forms Architecture (XFA) resource array as shown in figure 4, below.

 

Figure 4

The XFA array comprises a number of objects. A closer look at the objects shows that the template data packet (referenced by object ‘90 0’ as highlighted in the figure above) has large portion of binary data, as compared to other small text objects. It implies that the template data packet might be suspicious. Next, our analysis will be focused on this object as illustrated in figure 5, below.

 

Figure 5

As we can see in figure 5, above, the object has an entry FlateDecode in the /Filter field and /Predictor 12 in the /DecodeParmsfield. This implies that the data has been compressed with cascading filters.  In this scenario, the data is first encoded with PNG "Up" prediction (type 12), then compressed using the ‘Flateencoding technique.

It is not surprising that malware writers try to leverage obscure combinations of filters to hide malicious data from antivirus detection. But PNG “Up” encoding is designed for image files and hence is not usually found in non-image content; this is probably why a large number of antivirus engines appear to have trouble decoding the malicious object.

A sample from a recent attack was scanned by av-test.org, which utilizes many the major antivirus scanners. The results showed that only 3 (out of 36) engines were able to detect the exploit. This test was carried out on March 18th, 2011.

The next step involved reverse engineering of the encoded data, through which we successfully revealed the hidden and highly obfuscated JavaScript which was embedded in the template object, as shown in figure 6, below.

 

Figure 6

Further analysis shows that the obfuscated JavaScript actually exploits a known vulnerability (CVE-2006-3459) where an invalid ‘DotRange’ value in a tagged image file format (TIFF) image generated by the JavaScript will corrupt the TIFF parser in certain unpatched versions of a popular PDF reading application.

Next we will illustrate how shellcode is used to overwrite memory for heap spray in JavaScript when parsing the generated TIFF image.

 

What the JavaScript Does

The purpose of the JavaScript was revealed via a combination of running it through Mozilla’s SpiderMonkey and a process of manual de-obfuscation. Here are the functions performed by the JavaScript:

  1. Determine the current version of the PDF application by calling ‘app.getViewerVersion()’ and construct the correct exploited TIFF file and shellcode
  2. Spray the shellcode into memory
  3. Send the exploited TIFF file back to the XFA object for display

The TIFF file is constructed from base64-encoded strings concatenated together. The bulk of the TIFF file is the same, but the short shellcode section at the end is version-specific. Decoding of the base64 is not done by the JavaScript, but happens implicitly when the data is fed back to the XFA object.

 

Shellcode

The JavaScript contains two encoded shellcode variants, one for each targeted version of Adobe Reader.  Portions of these encoded strings are shown in figure 7, below.

 

Figure 7

When the JavaScript is manually de-obfuscated, the encoding algorithm used for the shellcode strings (and other data) becomes clear and this may be used to decode the shellcode and the other portions of the exploit.

Each shellcode variant, when decoded, consists of a 792 byte block comprising 68 bytes of ROP (Return-Oriented Programming) shellcode followed by a further 724 bytes of executable shellcode. The ROP section is necessary because the application has taken precautions to prevent execution of code in regions reserved for data; because ROP makes use of already-existing code, it is not affected by this and may be used to prepare an executable region for more conventional shellcode.  The ROP shellcode seen here is very similar to previously-seen examples, and probably represents an evolution from them.

 

The ROP Shellcode

This actually begins in the TIFF file, which contains around 48 bytes of ROP “gadgets” and associated data: the 68 bytes at the beginning of the shellcode block represent a continuation of this (the image in figure 8, below shows the ROP section at the end of the generated TIFF file).

 

Figure 8

ROP shellcode can be thought of as a “replacement stack” containing “gadget” addresses and function arguments. A gadget is a fragment of machine code from a currently-loaded DLL, which happens to end in a ‘RET’ instruction. ROP works by chaining together such gadgets to achieve the desired objective, so by looking up the code of each gadget in the DLL, it is possible to determine what the shellcode does.

In this case, the ROP shellcode circumvents data execution prevention measures by using the API function ‘CreateFileMapping()’. The normal use of this function is to map part of an existing file into memory. But it can also be used to create a “fake” mapped file that refers to part of the paging file and has no physical existence in the file system. This is what the ROP shellcode does, after which it copies the executable portion of the shellcode to this location.  Because the mapped region has ‘execute’ permissions, there is now nothing to prevent the rest of the shellcode from executing.

Figure 9, below shows how the variant employing ROP gadgets from icucnv34.dll works (the one for icucnv36.dll is identical apart from the gadget addresses).

 

Figure 9

 

The Executable Shellcode

The executable shellcode begins with a standard XOR decryption loop which decrypts the subsequent code, as shown in figure 10.

 

Figure 10

If this process is carried out in a hex editor, the embedded URL may be clearly seen at the end of the shellcode block, such as in figure 11, below. Each sample seemed to use a unique subdomain, possibly generated at random. This may have been to allow the authors to track infections, or perhaps to make the requests harder for simple filtering to block. 

 

Figure 11

Once the decryption has finished, the shellcode uses a fairly obscure API function, ‘URLDownloadToCacheFileA(),’ to download a file from the URL. As in most shellcode, API function addresses are looked up via hash values, thus avoiding the need to store function names in the shellcode. In this shellcode, the function addresses are looked up as they are needed, as opposed to looking them all up at once and storing them in a table, as can be seen in figure 12.

 

Figure 12

Somewhat mysteriously, the executable shellcode has 96 bytes from the beginning of an executable MZ header embedded in it, as highlighted in figure 13.

 

Figure 13

It is possible that the payload file as downloaded is missing this header, and that the shellcode is supposed to restore the missing part before executing the payload. However, the shellcode as examined does not appear to implement this.

There is a loop which looks as if it is supposed to do the copying, but it just calls FlushViewOfFile() multiple times. FlushViewOfFile() is not a data copying function, and, to rule out some clever use of an unknown side-effect, tests with actual files have shown that no data actually gets copied at this point. So this is either badly-written shellcode or a placeholder for subsequent functionality.

Once the payload executable is downloaded to the local user’s Temporary Internet Files directory – the ShellExecute() API function is then called to execute the downloaded payload, highlighted in figure 14.

 

Figure 14

This completes the execution of the shellcode, which now terminates the thread it has hijacked (but, it seems, not the entire process used by the application).

At the time of writing the payload executables were no longer available from the URLs contained in the shellcode. However, during analysis, we reverse-engineered all aspects of the exploited PDF to the point where we were able to create our own versions for test purposes. Thus, we were able to test the functionality of the PDF in a safe manner by changing the shellcode’s URL so that it pointed to an executable file (the Windows calculator) on a small web server running locally.

Launching the PDF file containing the modified shellcode caused the executable to run, thus confirming the data obtained via static analysis, as shown in figure 15, below.

 

Figure 15

It is notable that, unlike previous PDF exploits, there is no payload embedded in the PDF (because it is downloaded instead). This makes the shellcode effectively “self-contained” and harder to detect, particularly as all of the malicious content is inside one XFA object.

 

Symantec.cloud Protects You From The Storm

Symantec.Cloud has protected our customers from all such attacks. MessageLabs Intelligence analysis reveals that Skeptic™ has successfully blocked tens of thousands of such PDF-based attacks in the last few months. Figure 16 shows the proportion of PDF-based attacks that were blocked by traditional antivirus technology, compared with that blocked by Skeptic™ in the cloud.

 

Figure 16

The degree of success that traditional antivirus technology has in terms of blocking this new, more sophisticated form of PDF attack is fairly limited, and as shown in figure 17, has only been effective at blocking 14.7% of such attacks during the past six months. One clear message to take from all of this is that a defence-in-depth strategy utilizing cloud-based technology such as Skeptic™ is perhaps the safest approach.

 

 

Traditional Technology

Skeptic™ Only

October 2010

83.2%

16.8%

November 2010

99.4%

0.6%

December 2010

28.7%

71.3%

January 2011

17.4%

82.6%

February 2011

10.3%

89.7%

March 2011

3.4%

96.6%

Average

14.7%

85.3%

Figure 17