OpenAV: Developing Open Source AntiVirus Engines
by Costin Riau
According to its Web site, the OpenAntivirus Project is “a platform for people seriously interested in antivirus research, network security and computer security to communicate with each other, to develop solutions for various security problems, and to develop new security technologies.” Among these technologies are ScannerDaemon, VirusHammer and PatternFinder, which are “a first implementation of a GPLed virus scanner written in Java.”
This article will take a look at the OpenAntivirus AV engine, assess its progress so far, and offer some suggestions of how the developers can continue to develop it. While some of the commentary in the following sections may be fairly critical, the purpose of this paper is not to flame the OpenAV project or its developers but, on the contrary, to salute their efforts. Hopefully, this article and the comments herein will make a significant contribution to the development of a viable, working open source antivirus product.
Antivirus Development Through the Ages
Many years ago, there seemed to be an opportunity for antivirus development for just about anyone. Leading researchers working at many large AV companies today started out designing their own small antivirus engines. Actually, this was so popular (and in some cases, even profitable) that eight years ago, almost every programmer with some assembler skills had written at least one virus detection routine, and, in some cases, even disinfection and a full report on the inner workings of the virus. But as the antivirus world emerged from a loose-knit community to an established, competitive industry, the gateway grew smaller, so less and less people managed to reach that realm.
There were a number of reasons for this, not the least of which was the huge growth in the number and types of viruses. In November 2002, I attended the Antivirus Asia Researchers Conference in Seoul. Between the many presentations on the status of antivirus protection around the world, I clearly remember one that placed the approximate number of known viruses in existence at about 70,000. Coming back from the conference, I asked two colleagues working for two different AV companies how many viruses and trojans they thought were currently in existence. The first estimate was 100,000, the second was 61,000. Personally, I'd say 70,000 is a good estimation. Compare this to the number of viruses and trojans believed to have existed in 1994, between 4500 and 7000, and it is safe to assume that over the past eight years the number of known viruses and trojans has increased at least ten times.
With that in mind, it's no surprise that creating a detection engine able to find a minimal number of all known viruses (for the sake of argument, let’s say 90%) is at least ten times more complicated than it used to be in the good old days of one-man antivirus products. Now, let's say that in 1994, it would have taken about twelve months to create an antivirus engine from scratch, and make it capable of detecting 90% of the approximately 7000 known viruses. Multiply that by the increase in the number of known viruses, one can conclude that anyone trying to develop a new antivirus product based on a brand new engine will probably have assured themselves a nice way to spend the next ten years. Of course, this doesn’t even take into account the fact that by the time the next ten years have passed, there will be even more new viruses to deal with. Any project attempting to develop an antivirus engine from scratch today will most likely be doomed to fail, unless of course it is financed by a very rich, possibly eccentric, millionaire willing to pay for ten years of development.
On the other hand, you can try making it open source.
As most readers undoubtedly know, open source refers to any program whose source code is made available for viewing or modification by users or other members of the community. Thus, open source software is developed as a public collaboration and made freely available. It is also always evolving in ways that are suggested by its users. The most well known example of open source software is Linux.
Without a doubt, a great deal of the success Linux is experiencing comes from its open source structure. The idea that you can download the operating system from the Internet, install it, and recode it to modify features that do not work well is impossible to resist. Users can even share the changed code with the original developers and, except for the cases when the suggested changes are not much of an improvement, the entire community of users can benefit from the changes.
This brings us to the idea of the OpenAntivirus Project. The idea of an open source antivirus project is not new at all; actually, four years ago I myself discussed the idea of making public the source code of the antivirus engine I was working on. However, what is new here is the implementation. While in my case, it remained just an idea, the people from OpenAntivirus are making history by making the idea a reality.
However, while the OpenAV group may have a good idea at hand, they have some way to go before their product is viable, as the developers freely admit. This is entirely understandable; after all, antivirus engines appear every day. But how many of them are really good? How many of them are just file sweepers detecting only a handful of viruses. The sad truth is that the answer is none, or close to it, which is why I usually test every new engine to see where they stand and to see how close they are to being reliable, operational AV engines.
To assess the detection rate of an AV engine, one can check it against viruses that are currently circulating in the wild or against the larger set of so-called "zoo" viruses. Of course, not everyone can have clean virus collections for testing AV products; fortunately, there are other ways of checking how well a new product fares in terms of detection. For instance, the Virus Test Center at the University of Hamburg is still (if not very often) performing its usual tests with antivirus products on most popular operating platforms. Their tests include both “in the wild” (ItW) and "zoo" viruses, as well as tests for detecting malware inside archives or packed files, which are increasingly common forms of distribution.
Detecting Macro Viruses
In October 2002, the Virus Test Center released the results of a test (test 2002-10) that included results for OpenAntivirus on a Linux platform, although, due to its portability, OpenAV could have been easily be tested on other platforms as well. Unfortunately, as the youngest product in the test, OpenAV doesn't score very well compared to its more established “competitors”. It scored perfect zeroes in all the tests with packed objects. This didn’t come as a surprise: as a result of prior investigation of OpenAV, I knew it doesn't have any unpacking plugins. Moreover, in the VTC test, OpenAV missed all the ItW macro viruses and detected only five of the 7000+ files in the "zoo" macro collection. On a brighter note, it managed to detect 24.6% of the files from the ItW script test set, although the rest of the products tested all scored a perfect 100%.
The first conclusion we can draw from these figures is that, as stated by the project developers in the quote that prefaced this article, OpenAV is still at its very beginnings. Low detection rates in the macro test set could be fixed by implementing a proper OLE2 parser, then implementing parsing and handling of Word6/7, 97, 2000 as well as for similar Excel, PowerPoint and Access versions. (I'm sure many of you that are reading this article are smiling now, because the above operations occupied a good many years of your life.) Fortunately, macro viruses are a dying class anyway, so they are becoming less and less common. The same is true for script viruses, but even if implementing proper detection for those is much easier than for macros, by the time detection for them is ready, there may be no more script viruses ItW to handle.
Detecting DOS and Win Binary Viruses
Before proceeding further, I should point out that the aVTC test 2002-10 did not include any DOS, Windows, Linux or other forms of executable viruses. Since OpenAV doesn't seem to excel in detecting macro and script viruses, we can assume its main strength lies in the detection of DOS and Win binary viruses. In order to assess how OpenAV detects these viruses, I decided to conduct my own test.
For the test sample I selected some viruses from the Virus Top Twenty for November 2002 as released by Kaspersky Labs (the company for whom the author works) in the month of November 2002. The full Virus Top Twenty is as follows:
Next, by removing all script and macro viruses, we'll be obtaining the following shorter list. Since there are no exact versions specified for every virus, in the cases of missing variants, I took the first known member in that family, the ".A" variant. Therefore, the following list therefore reflects the final test set:
Since these represent some of the most widespread viruses currently in circulation, users should never rely on an antivirus product that doesn't detect them all. At the very least, this relatively short list of 11 samples should be handled by any product that claims to provide some sort of protection against today's threats.
For testing, I've used the latest available signatures from the OpenAntivirus Web site, which were dated October 29, 2002 (09:36:10 AM) and the VirusHammer 0.1.1 interface, which was the latest available version at the time of writing of this article.
A test with the popular "EICAR test file" showed that the product was working and scanning a larger collection of new and older ItW viruses proved that some viruses, such as versions of "VBS/LoveLetter", "VBS/Fireburn" or "Win32/Nimda@mm" were clearly detected. Unfortunately, in tests I ran with my test set, none of the samples were detected or flagged by OpenAV in any way.
What Can Be Done?
There is no doubt the OpenAV is, as the authors like to say “in a very early, mostly pre-alpha state” version. The core set of technologies behind the detection built into OpenAV is also fairly primitive and, unfortunately, it is not built on a modular architecture. So what can be done to make this AV engine more effective?
The first serious step in future development would be to separate detection for the various types of malware into modules, and to implement better handling than the brute force string scanning approach that is currently used. By creating a module for scanning PEs, one for DOS executables, one for DOS batch viruses, one for Visual Basic scripts, and so on, the project can be split into parts that can be easily shared among the project members. This might also help a bit with the scan speed.
In addition to these improvements, a virtual file system should be implemented in order to allow a common method of handling archives, packed files, and other types of native file systems such as raw FAT disks or Linux Ext2 partitions. All IO functions have to use the virtual file system methods, which will go through the associated unpacking plugin depending on the case. For instance, Adrian Marinescu's VB 1999 paper "Open Architectures in today's AV products" would be a good starting point in designing a more flexible architecture for the product.
Next, it is obvious that even with the current architecture, OpenAV suffers from a lack of virus definitions or signatures. This is hardly surprising, as the main source of virus signatures would be a good collection of viruses, which may be very hard for individuals with such limited antivirus experience or resources (i.e., individuals who do not work for AV companies) to obtain. That said, it's worth noting that the OpenAV Project includes a tool designed to brute force patch virus samples in order to find the exact signature a commercial antivirus product uses to detect it. More precisely, the "PatternFinder" tool from the OpenAV package will patch bytes in a given virus sample and try to guess the signature used by various AV products to detect the virus.
Without commenting on the legal or moral implications of using the work of other researchers, for which the producing company no doubt invested many resources and time, I believe that such methods are neither accurate nor reliable. Furthermore, this approach will obviously not work with encrypted and polymorphic viruses; in fact, it would only be adding obsolete information to the database due to extraction of "signatures" that can only be found in a singular sample of the virus.
Eventually, an essential (sine qua non) component for OpenAV would be a heuristic module, at least for the most widespread type of malware at the respective moment. I almost dare not mention the possibility of developing a code emulator, which would be a good starting point in detection of complex polymorphic viruses for which signatures are useless. However, such a task would probably be beyond the capabilities an open source project. Nevertheless, the challenge is out!
Anyway, let's take a look at all these suggested improvements one by one.
Implementing a Modular Engine
As mentioned in the previous paragraph, separating the various engine components according to their function is the main step in the creation of any modern antivirus engine. Let's take a look at the following generic antivirus engine template, which is implemented in one form or another in most of today's products:
Figure 1: Modular Antivirus Engine Template
The main component of the engine, the AV kernel is responsible for most of the low-level operations such as loading and initializing plugins, directing the scan logic depending on the various settings (such as unpack archives, heuristics on/off, etc.), talking to the upper-level user interface, providing low-level functions for all the plugins, and maintaining a plugin stack for the currently scanned object.
The Virtual File System Interface (VFSI) handles all IO functions performed on a certain object, which can be either a native file on disk, a file inside an archive, or something such as the memory image of a running process. One special module inside the VFSI deserves special attention, and that is the module that implements all OS native file IO operations. Whenever the UI asks the kernel to scan an OS native file system, the kernel will first push this special module in the VFS stack, and use it for all subsequent operations with files or directories. Whenever something like an archive is encountered, the associated unpacking plugin is initialized, and pushed further on the VFS stack so that all file operations with the virtual objects inside the archive will first go through the proper unpack routines. In theory, this cycle can happen ad infinitum, but most products will limit the number of nested archives to avoid stack or memory consumption.
The third important module in our template is the virus detection plugins interface, which manages all registered scan plugins. For every known malware type, an associated detection plugin will handle all specific operations such as parsing the various container fields, viral body identification, and, eventually, disinfection. It may be possible that two different detection plugins use a large amount of common code (for example, searching a buffer for a signature), in which case the common code can be stored in a third module and called from there, or the two modules can be merged into a single larger scope plugin. Personally, I prefer the first method, especially because it makes debugging easier and, related to the previous case, because an update to the string searching algorithm will not require recompilation of all scan plugins that depend on it.
In general, the structure discussed above can be used to properly design and develop any competitive product, and, in fact, it has already been used many times according to my knowledge. Adapting OpenAV to this modus operandi would no doubt require lots of work and many sleep-deprived nights, but that is true for any of the top class AV products available today.
Increasing the Number of Known Viruses
The starting point towards obtaining a better detection rate in any AV product is the creation of a properly organized virus collection. I will not cover this subject in full here because there are already many excellent papers available on this subject, notably Vesselin Bontchev's article Analysis and Maintenance of a Clean Virus Library, which is generally recognized as the de facto collection management guide. Once this goal has been reached, access to the virus collection will have to be restricted to a core of OpenAV researchers, which will deal with the database maintenance and sharing of information with the other players in the business.
Between the many advantages of putting some effort into producing the above mentioned well-sorted virus collection, detection quality checks will no doubt become much easier, especially after a major code change into any of the critical points of the product, such as the code emulator.
Besides that, it is also important to mention the initialization of a good new/unknown samples handling flow, which will allow handling of new ItW viruses in a better way, minimizing the time required to produce an update in case of emergency or people from different corners of the world working on the same sample redundantly.
The last point in my suggested list of improvements (besides the code emulator) is the addition of some form of heuristics, even in a very primitive form. Since more and more new viruses reported to be ItW at the time of writing are coded either in Visual Basic or Visual C, the project team should focus on these two platforms at first, then after some basic VB/VC heuristic module is in place, switch to scripts (VBS/JS), Office macros and then depending on the status of infection reports at that time, DOS heuristics or more exotic forms of malware such as .NET viruses.
No doubt, as I was saying all of the modules mentioned above will benefit from a good code emulation flow, but that may also be developed “on the fly” to suit the needs of the heuristic analyzer and later, to permit detection of more complex polymorphic and encrypted viruses.
The popularity of the OpenAV project is no doubt on the increase since it was initiated two years ago. Interestingly, there are already other independent products that depend on the OpenAV signature database (such as ClamAV) and benefiting from the support of many other popular Unix security software developers. Therefore, I suspect we are likely to see even more projects based on the OpenAV core in the near future.
Finally, as a general conclusion, there is no doubt that with a little bit more efforts, this could turn into a major player of the security business. Maybe not today or tomorrow or the next month, but slowly... in time
2. Marinescu, Adrian: "Open Architectures in today's AV products"
6. Bontchev, V.: "Analysis and Maintenance of a Clean Virus Library"
Virus Test Centre Assessment of OpenAV
This article originally appeared on SecurityFocus.com -- reproduction in whole or in part is not allowed without expressed written consent.