ABOUT SYMANTEC

Academic Papers

Attack Attribution

  • A Strategic Analysis of Spam Botnets Operations
    We present in this paper a strategic analysis of spam botnets operations, i.e., we study the inter-relationships among botnets through their spam campaigns, and we focus on identifying similarities or differences in their modus operandi. The contributions of this paper are threefold. First, we provide an in-depth analysis which, in contrast with previous studies on spamming bots, focuses on the long-term, strategic behavior of spam botnets as observed through their aggregate spam campaigns. To that end, we have analyzed over one million spam records collected by Symantec.cloud (formerly Message Labs) through worldwide distributed spamtraps. Secondly, we demonstrate the usefulness of emerging attack attribution methodologies to extract intelligence from large spam data sets, and to correlate spam campaigns according to various combinations of different features. By leveraging these techniques relying on data fusion and multi-criteria decision analysis, we show that some tight relationships exist among different botnet families (like Rustock/Grum or Lethic/Maazben), but we also underline some profound differences in spam campaigns performed by other bots, such as Rustock versus Lethic, Bagle or Xarvester. Finally, we use the very same attribution methodology to analyze the recent Rustock take-down, which took place on March 17, 2011. As opposed to previous claims, our experimental results show that Bagle has probably not taken over Rustock’s role, but instead, we found some substantial evidence indicating that part of Rustock activity may have been offloaded to Grum.
    By: Olivier Thonnard* and Marc Dacier* (Symantec Research Labs)

    Published in: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS 2011), Perth, Australia, 1-2 September 2011

    Please obtain a copy of this paper from the ACM.
    O. Thonnard and M. Dacier, 2011.
  • An Analysis of Rogue AV Campaigns
    Rogue antivirus software has recently received extensive attention, justified by the diffusion and efficacy of its propagation. We present a longitudinal analysis of the rogue antivirus threat ecosystem, focusing on the structure and dynamics of this threat and its economics. To that end, we compiled and mined a large dataset of characteristics of rogue antivirus domains and of the servers that host them.

    The contributions of this paper are threefold. Firstly, we offer the first, to our knowledge, broad analysis of the infrastructure underpinning the distribution of rogue security software by tracking 6,500 malicious domains. Secondly, we show how to apply attack attribution methodologies to correlate campaigns likely to be associated to the same individuals or groups. By using these techniques, we identify 127 rogue security software campaigns comprising 4,549 domains. Finally, we contextualize our findings by comparing them to a different threat ecosystem, that of browser exploits. We underline the profound difference in the structure of the two threats, and we investigate the root causes of this difference by analyzing the economic balance of the rogue antivirus ecosystem. We track 372,096 victims over a period of 2 months and we take advantage of this information to retrieve monetization insights. While applied to a specific threat type, the methodology and the lessons learned from this work are of general applicability to develop a better understanding of the threat economies.
    By: Marco Cova (University of California, Santa Barbara), Corrado Leita (Symantec Research Labs), Olivier Thonnard (Royal Military Academy, Belgium), Angelos Keromytis (Columbia University), and Marc Dacier (Symantec Research Labs)

    Published in: Proceedings of the 13th Symposium on Recent Advances in Intrusion Detection (RAID 2010), Ottawa, Canada, September 2010

    Please obtain a copy of this paper from Springer Verlag LNCS.
    M. Cova, C. Leita, O. Thonnard, A. Keromytis, and M. Dacier, 2010.
  • Honeypot traces forensics: the observation view point matters
    In this paper, we propose a method to identify and group together traces left on low interaction honeypots by machines belonging to the same botnet(s) without having any a priori information at our disposal regarding these botnets. In other terms, we offer a solution to detect new botnets thanks to very cheap and easily deployable solutions. The approach is validated thanks to several months of data collected with the worldwide distributed Leurr´e.com system. To distinguish the relevant traces from the other ones, we group them according to either the platforms, i.e. targets hit or the countries of origin of the attackers. We show that the choice of one of these two observations viewpoints dramatically influences the results obtained. Each one reveals unique botnets. We explain why. Last but not least, we show that these botnets remain active during very long periods of times, up to 700 days, even if the traces they left are only visible from time to time.
    By: Van-Hau Pham (Institute Eurecom) and Marc Dacier (Symantec Research Labs)

    Published in: Proceedings of the 3rd International Conference on Network and System Security, Gold Coast, Australia, October 2009

    Please obtain a copy of this paper from the publisher listed above.
    V. Pham and M. Dacier, 2009.
  • Addressing the attack attribution problem using knowledge discovery and multi-criteria fuzzy decision-making
    In network traffic monitoring, and more particularly in the realm of threat intelligence, the problem of attack attribution refers to the process of effectively attributing new attack events to (un)-known phenomena, based on some evidence or traces left on one or several monitoring platforms. Real-world attack phenomena are often largely distributed on the Internet, or can sometimes evolve quite rapidly. This makes them inherently complex and thus difficult to analyze. In general, an analyst must consider many different attack features (or criteria) in order to decide about the plausible root cause of a given attack, or to attribute it to some given phenomenon. In this paper, we introduce a global analysis method to address this problem in a systematic way. Our approach is based on a novel combination of a knowledge discovery technique with a fuzzy inference system, which somehow mimics the reasoning of an expert by implementing a multi-criteria decision-making process built on top of the previously extracted knowledge. By applying this method on attack traces, we are able to identify large scale attack phenomena with a high degree of confidence. In most cases, the observed phenomena can be attributed to so-called zombie armies - or botnets, i.e. groups of compromised machines controlled remotely by a same entity. By means of experiments with real-world attack traces, we show how this method can effectively help us to perform a behavioral analysis of those zombie armies from a long-term, strategic viewpoint.
    By: Olivier Thonnard, Wim Mees (Royal Military Academy, Belgium); and Marc Dacier (Symantec Research Labs)

    Published in: Proceedings of KDD’09, 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Workshop on CyberSecurity and Intelligence Informatics, Paris, France, June 2009

    Conference Best Paper Award

    Please obtain a copy of this paper from the publisher listed above.
    O. Thonnard, W. Mees, and M. Dacier, 2009.
  • Behavioral Analysis of Zombie Armies
    Zombie armies - or botnets, i.e., large groups of compromised machines controlled remotely by a same entity - pose today a significant threat to national security. Recent cyber-conflicts have indeed demonstrated that botnets can be easily turned into digital weapons, which can be used by cybercriminals to attack the network resources of a country by performing simple Distributed Denial-of Service (DDoS) attacks against critical web services. A deep understanding of the long term behavior of botnet armies, and their strategic evolution, is thus a vital requirement to combat effectively those latent threats. In this paper, we show how to enable such a long-term, strategic analysis, and how to study the dynamic behaviors and the global characteristics of these complex, large-scale phenomena by applying different techniques from the area of knowledge discovery on attack traces collected on the Internet. We illustrate our method with some experimental results obtained from a set of worldwide distributed server honeypots, which have monitored attack activity in 18 different IP subnets for more than 640 days. Our preliminary results highlight several interesting findings, such as i) the strong resilience of zombie armies on the Internet, with survival times going up to several months; ii) the high degree of coordination among zombies; iii) the highly uneven spatial distribution of bots in a limited number of "unclean networks”, and iv) the large proportion of home users' machines with high-speed Internet connections among the bot population.
    By: Olivier Thonnard (Royal Military Academy, Belgium), Wim Mees (Eurecom), and Marc Dacier (Symantec Research Labs)

    Published in: Proceedings of the Cyber Warfare Conference (CWCon), Cooperative Cyber Defense Center Of Excellence (CCD-COE), Tallinn, Estonia, June 2009

    Please obtain a copy of this paper from the publisher listed above.
    O. Thonnard, W. Mees, and M. Dacier, 2009.
  • Actionable Knowledge Discovery for Threats Intelligence Support Using a Multi-dimensional Data Mining Methodology
    This paper describes a multi-dimensional knowledge discovery and data mining (KDD) methodology that aims at discovering actionable knowledge related to Internet threats, taking into account domain expert guidance and the integration of domain-specific intelligence during the data mining process. The objectives are twofold: i) to develop global indicators for assessing the prevalence of certain malicious activities on the Internet, and ii) to get insights into the modus operandi of new emerging attack phenomena, so as to improve our understanding of threats. In this paper, we first present the generic aspects of a domain-driven graph-based KDD methodology, which is based on two main components: a clique-based clustering technique and a concepts synthesis process using cliques' intersections. Then, to evaluate the applicability of this approach to our application domain, we use a large dataset of real-world attack traces collected since 2003. Our experimental results show that significant insights can be obtained into the domain of threat intelligence by using this multi-dimensional knowledge discovery method.
    By: Olivier Thonnard (Royal Military Academy, Belgium) and Marc Dacier (Symantec Research Labs)

    Published in: Proceedings of the IEEE Data Mining Workshops, 2008, (ICDMW 2008), December 2008

    Please obtain a copy of this paper from the publisher listed above.
    O. Thonnard and M. Dacier, 2008.

Cryptography

  • Data Loss Prevention Using an Ephemeral Key
    With the advent of cloud storage, smartphones, MP3 music players, and removable flash devices, data is more mobile than ever before. However, with this newfound mobility come the issues of how to determine whether data may be too sensitive to leave a user's device, and, if it is appropriate to save the data to remote storage, how best to secure it for the long term. Data loss prevention applications perform this job, typically by redirecting potentially sensitive saved files to a secure local storage quarantine, scanning them, and then doing a final copy to remote storage if the scan passes policy. The problem with local storage quarantine is the additional overhead required to essentially serially write the file twice---once to local storage and finally once to the remote storage destination. This paper presents an alternate method for doing data loss prevention using an ephemeral cryptographic key. By using an ephemeral key, encrypted data can be safely scanned in situ on the remote storage destination and securely removed if inappropriate. This direct technique results in better efficiency and lower latency than a circuitous local storage quarantine. An added benefit of using an ephemeral key for data loss prevention is that the encrypted file can be secured afterward to the persistent keys of multiple recipients with a minimum of additional post-processing.
    By: William J. Blanke (Enterprise Security Group, Symantec Corporation)

    Published in: Proceedings of the International Workshop on Security and Performance in Cloud Computing (SPCLOUD 2011), Istanbul, Turkey, July 2011

    Please obtain a copy of this paper from the IEEE.
    W. Blanke, 2011.

Data Mining

  • GRAPHITE: A Visual Query System for Large Graphs
    We present Graphite, a system that allows the user to visually construct a query pattern, finds both its exact and approximate matching subgraphs in large attributed graphs, and visualizes the matches. For example, in a social network where a person's occupation is an attribute, the user can draw a ‘star' query for "finding a CEO who has interacted with a Secretary, a Manager, and an Accountant, or a structure very similar to this.” Graphite uses the G-Ray algorithm to run the query against a user-chosen data graph, gaining all of its benefits, namely its high speed, scalability, and its ability to find both exact and near matches. Therefore, for the example above, Graphite tolerates indirect paths between, say, the CEO and the Accountant, when no direct path exists. Graphite uses fast algorithms to estimate node proximities when finding matches, enabling it to scale well with the graph database size. We demonstrate Graphite's usage and benefits using the DBLP author-publication graph, which consists of 356K nodes and 1.9M edges. A demo video of Graphite can be downloaded at <a href="http://www.cs.cmu.edu/~dchau/graphite/graphite.mov" target="_blank">http://www.cs.cmu.edu/~dchau/graphite/graphite.mov</a>.
    By: Duen Horng Chau**, Christos Faloutsos, Hanghang Tong, Jason I. Hong (Carnegie Mellon University); Brian Gallagher and Tina Eliassi-Rad (Lawrence Livermore National Laboratory)

    Published in: Proceedings of the ICDM 2008, Pisa, Italy, December 2008

    Please obtain a copy of this paper from the publisher listed above.
    D. Chau, C. Faloutsos, H. Tong, J. Hong, B. Gallagher, and T. Eliassi-Rad, 2008.
  • What to Do When Search Fails: Finding Information by Association
    Sometimes people cannot remember the names or locations of things on their computer, but they can remember what other things are associated with them. We created Feldspar, the first system that fully supports this associative retrieval of personal information on the computer. Feldspar's contributions include (1) an intuitive user interface that allows users to find information by interactively and incrementally specifying multiple levels of associations as retrieval queries, such as: "find the file from the person who I met at an event in May”; and (2) algorithms for collecting the association information and for providing answers to associative queries in real-time. A user study showed that Feldspar is easy to use, and suggested that it might be faster than conventional browsing and searching for these kinds of retrieval tasks. Feldspar could be an important addition to search and browsing tools.
    By: Duen Horng Chau**, Brad Myers, and Andrew Faulring (Carnegie Mellon University)

    Published in: Proceedings of the CHI'2008: Human Factors in Computing Systems, Florence, Italy, April 2008

    Please obtain a copy of this paper from the publisher listed above.
    D. Chau, B. Myers, and A. Faulring, 2008.
  • Feldspar: A System for Finding Information by Association
    We present Feldspar, the first system that allows people to find personal information on the computer by specifying chains of other information that it is associated with, emulating the information retrieval process of human associative memory. Feldspar's contributions include (1) a user interface for constructing retrieval queries that consist of multiple levels of associations, such as "find the folder containing the email attachment from the person I met at an event" and (2) algorithms for collecting the association information and for providing answers to associative queries in real-time. A user study showed that Feldspar is easy to use, and is superior to conventional browsing and searching for these kinds of retrieval tasks. We have reported Feldspar's implementation and evaluation in our long paper at CHI 2008. Here, our discussion focuses on the design and implementation ideas that we have tried, or are currently investigating. We hope this will stimulate further discussions and help inspire more design ideas.
    By: Duen Horng Chau**; Brad Myers, and Andrew Faulring (Carnegie Mellon University)

    Published in: Proceedings of the CHI 2008 Workshop on Personal Information Management: PIM 2008, Florence, Italy, April 2008

    Please obtain a copy of this paper from the publisher listed above.
    D. Chau, B. Myers, and A. Faulring, 2008.

High-Availability Systems

  • Guaranteeing Eventual Coherency across Data Copies, in a Highly Available Peer-to-Peer Distributed File System
    Peer-to-peer systems use redundant data replicas to maintain high availability and improve on user response times. The consistency and coherency of these data replicas need to be maintained over time in face of various system changes like node joins, failures and network outages. In this paper we present a robust approach to guarantee eventual coherency of replicas in a multi-location large scale peer-to-peer distributed file system. We use a combination of data pull and push mechanisms, and a last coherent time stamp on each replica. These mechanisms ensure that no user read operation ever retrieves data that is older than a configurable upper bound in time.
    By: Bijayalaxmi Nanda*, Anindya Banerjee* (Symantec Research Labs), and Navin Kabra (PuneTech.com)

    Published in: Proceedings of the 10th International Conference on Distributed Computing and Networking (ICDCN 2009), Hyderabad, India, 3-6 January, 2009

    Please obtain a copy of this paper from Springer LNCS.
    Bijayalaxmi Nanda, Anindya Banerjee, and Navin Kabra, 2009.
  • Fast Memory State Synchronization for Virtualization-based Fault Tolerance
    Virtualization provides the possibility of whole machine migration and thus enables a new form of fault tolerance that is completely transparent to applications and operating systems. While initial prototypes show promise, virtualization-based fault-tolerant architecture still experiences substantial performance overhead especially for data-intensive workloads. The main performance challenge of virtualization-based fault tolerance is how to synchronize the memory states of the Master and Slave in a way that minimizes the end-to-end impact on the application performance. This paper describes three optimization techniques for memory state synchronization: fine-grained dirty region identification, speculative state transfer, and synchronization traffic reduction using active slave, and presents a comprehensive performance study of these techniques under three realistic workloads, the TPC-E benchmark, the SPECsfs 2008 CIFS benchmark, and a Microsoft Exchange workload. We show that these three techniques can each reduce the amount of end-of-epoch synchronization traffic by a factor of up to 7, 15 and 5, respectively.
    By: Maohua Lu** (Stony Brook University) and Tzi-cker Chiueh (Symantec Research Labs)

    Published in: Proceedings of the DSN 2009 – The 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Estoril, Portugal, June 2009

    Please obtain a copy of this paper from the publisher listed above.
    M. Lu and T. Chiueh, 2009.
  • Accurate and Efficient Inter-Transaction Dependency Tracking
    A reparable database management system has the ability to automatically undo the set of transactions that are corrupted by a human error or malicious attack. The key technical challenge to building repairable database management systems is how to accurately and efficiently keep track of inter-transaction dependencies due to data sharing through a database or through an application. In this paper, we present the design, implementation and evaluation of the inter-transaction dependency tracking mechanisms used in a repairable database management system called Blastema, which adds fast repairability in a portable way to a commercial DBMS, Oracle 9.2.0. Compared with other repairable DBMSs, Blastema eliminates false positive dependencies using fine-grained inter-transaction dependency tracking, and is able to successfully recognize two major sources of false negative dependencies, phantom dependencies and dependencies among transactions due to application data sharing. With these advanced inter-transaction dependency tracking mechanisms, Blastema significantly improves the availability of modern DBMSs by facilitating and sometimes even automating the damage repair process after a human error or a malicious attack. Performance measurements on a fully operational Blastema prototype run under the TPC-C benchmark show that the average run-time throughput penalty of the proposed inter-transaction dependency tracking mechanisms is less than 18%.
    By: Tzi-cker Chiueh* and Shweta Bajpai (Stony Brook University)

    Published in: Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), Cancun, Mexico , April 2008

    Best Industrial paper

    Please obtain a copy of this paper from the publisher listed above.
    T. Chiueh and S. Bajpai, 2008.

Human Computer Interaction and Phishing

  • PhorceField: A Phish-Proof Password Ceremony
    Many widely deployed phishing defense schemes, such as SiteKey, use client-side secrets to help users confirm that they are visiting the correct website before entering their passwords. Unfortunately, studies have demonstrated that up to 92% of users can be convinced to ignore missing client-side secrets and enter their passwords into phishing pages. However, since client-side secrets have already achieved industry acceptance, they are an attractive building block for creating better phishing defenses. We present PhorceField, a phishing resistant password ceremony that combines client-side secrets and graphical passwords in a novel way that provides phishing resistance that neither achieves on its own. PhorceField enables users to login easily, but forces phishers to present victims with a fundamentally unfamiliar and onerous user interface. Victims that try to use the phisher’s interface to enter their password find the task so difficult that they give up without revealing their password. We have evaluated PhorceField’s phishing resistance in a user study in which 21 participants used PhorceField for a week and were then subjected to a simulated phishing attack. On average, participants were only able to reveal 20% of the entropy in their password, and none of them revealed their entire password. This is a substantial improvement over previous research that demonstrated that 92% of users would reveal their entire password to a phisher, even if important security indicators were missing.

    PhorceField is easy to deploy in sites that already use client-side secrets for phishing defense—it requires no client-side software and can be implemented entirely in javascript. Banks and other high value websites could therefore deploy it as a drop-in replacement for existing defenses, or deploy it on an “opt-in” basis, as Google has done with its phone-based “2-step verification” system.
    By: Michael Hart*, Claude Castille, Manoj Harpalani, Jonathan Toohill, and Rob Johnson (Stony Brook University)

    Published in: Proceedings of the 27th Annual Computer Security Applications Conference (ACSAC 2011), Orlando, Florida, December, 2011

    Please obtain a copy of this paper from the ACM.
    M. Hart*, C. Castille, M. Harpalani, J. Toohill, and R. Johnson, 2011.

Malware Analysis

  • Field Data Available at Symantec Research Labs: The Worldwide Intelligence Network Environment (WINE)
    The data sets available today are often insufficient for conducting representative experiments or rigorous empirical research. The Worldwide Intelligence Network Environment (WINE) aims to fill this gap by providing access to sampled data feeds, which are used internally at Symantec Research Labs, and by promoting rigorous experimental methods. WINE allows researchers to define reference data sets, for validating new techniques or for conducting empirical studies, and provides the metadata needed for understanding the results. WINE archives these reference data sets in order to facilitate repeatable experiments and to enable meaningful comparisons against the prior art. Moreover, the field data included in WINE will likely provide key insights across a broad spectrum of disciplines, such as software reliability, computer security, machine learning, networking, economics, or visual analytics.
    By: Tudor Dumitras (Symantec Research Labs)

    Published in: online proceedings of the Second Exascale Evaluation and Research Techniques Workshop (ASPLOS-EXERT 2011), Newport Beach, CA, March 2011

    Invited Talk

    Please obtain a copy of this paper online or from the ACM.
    T. Dumitras, 2011.
  • Exploiting diverse observation perspectives to get insights on the malware landscape
    We are witnessing an increasing complexity in the malware analysis scenario. The usage of polymorphic techniques generates a new challenge: it is often difficult to discern the instance of a known polymorphic malware from that of a newly encountered malware family, and to evaluate the impact of patching and code sharing among malware writers in order to prioritize analysis efforts.

    This paper offers an empirical study on the value of exploiting the complementarity of different information sources in studying malware relationships. By leveraging real-world data generated by a distributed honeypot deployment, we combine clustering techniques based on static and behavioral characteristics of the samples, and we show how this combination helps in detecting clustering anomalies. We also show how the different characteristics of the approaches can help, once combined, to underline relationships among different code variants. Finally, we highlight the importance of contextual information on malware propagation for getting a deeper understanding of the evolution and the "economy" of the different threats.
    By: Corrado Leita (Symantec Research Labs), Ulrich Bayer (Technical University Vienna), and Engin Kirda (Institute Eurecom)

    Published in: Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), Chicago, Illinois, USA, June 2010

    Please obtain a copy of this paper from the IEEE.
    C. Leita, U. Bayer, and E. Kirda, 2010.
  • Automatic Generation of String Signatures for Malware Detection
    Scanning files for signatures is a proven technology, but exponential growth in unique malware programs has caused an explosion in signature database sizes. One solution to this problem is to use string signatures, each of which is a contiguous byte sequence that potentially can match many variants of a malware family. However, it is not clear how to automatically generate these string signatures with a sufficiently low false positive rate. Hancock is the first string signature generation system that takes on this challenge on a large scale. To minimize the false positive rate, Hancock features a scalable model that estimates the occurrence probability of arbitrary byte sequences in goodware programs, a set of library code identification techniques, and diversity-based heuristics that ensure the contexts in which a signature is embedded in containing malware files are similar to one another. With these techniques combined, Hancock is able to automatically generate string signatures with a false positive rate below 0.1%.
    By: Kent Griffin*, Scott Schneider*, Xin Hu*, and Tzi-cker Chiueh* (Symantec Research Labs)

    Published in: Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection (RAID 2009), Saint-Malo, France, September 23-25, 2009, pp. 101-120, Springer LNCS 5758.

    Please obtain a copy of this paper from Springer.
    K. Griffin, S. Schneider, X. Hu, and T. Chiueh, 2009.
  • An Experimental Study of Diversity with Off-The-Shelf AntiVirus Engines
    Fault tolerance in the form of diverse redundancy is well known to improve the detection rates for both malicious and non-malicious failures. What is of interest to designers of security protection systems are the actual gains in detection rates that they may give. In this paper we provide exploratory analysis of the potential gains in detection capability from using diverse AntiVirus products for the detection of self-propagating malware. The analysis is based on 1599 malware samples collected by the operation of a distributed honeypot deployment over a period of 178 days. We sent these samples to the signature engines of 32 different AntiVirus products taking advantage of the VirusTotal service. The resulting dataset allowed us to perform analysis of the effects of diversity on the detection capability of these components as well as how their detection capability evolves in time.
    By: Ilir Gashi, Vladimir Stankovic (City University, London); Corrado Leita*, and Olivier Thonnard (Royal Military Academy, Belgium)

    Published in: Proceedings of the 8th IEEE Symposium on Network Computing and Applications, (NCA 2009), Cambridge, MA, July 9-11, 2009

    Conference Best Paper Award

    Please obtain a copy of this paper from the publisher listed above.
    I. Gashi, V. Stankovic, C. Leita, and O. Thonnard, 2009.
  • Execution Trace-Driven Automated Attack Signature Generation
    In its most general form, an attack signature is a program that can correctly determine if an input network packet sequence can successfully attack a protected network application. Filter rules used in firewall and network intrusion prevention systems (NIPS) are an abstract form of attack signature. This paper presents the design, implementation, and evaluation of an automated attack signature generation system called Trag, that automatically generates an executable attack signature program from a victim program’s source and a given attack input. Trag leverages dynamic data and control dependencies to extract relevant code in the victim program, accurately identifies variable initialization statements that are not executed in the given attack, is able to generate attack signatures for multi-process network applications, and reduces the size of attack signatures by exploiting responses from victim programs. Experiments with a fully working Trag prototype show that Trag’s signatures can indeed prevent attacks against multiple production-grade vulnerable server/web applications, such as apache, wu-ftpd and MyBullentinBoard, with up to 65% reduction in size when compared with the victim program. In terms of performance overhead, the additional latency as observed from the client-side is no more than 25 msec for multi-process web applications, while the overall throughput remains unaffected.
    By: Susanta Nanda and Tzi-cker Chiueh (Symantec Research Labs)

    Published in: Proceedings of the 24th Annual Computer Security Applications Conference (ACSAC 2008), Anaheim, CA, December 2008

    Please obtain a copy of this paper from the publisher listed above.
    S. Nanda and T. Chiueh, 2008.
  • Large scale malware collection: lessons learned
    In order to assure accuracy and realism of resilience assessment methods and tools, it is essential to have access to field data that are unbiased and representative. Several initiatives are taking place that offer access to malware samples for research purposes. Papers are published where techniques have been assessed thanks to these samples. Definition of benchmarking datasets is the next step ahead. In this paper, we report on the lessons learned while collecting and analyzing malware samples in a large scale collaborative effort. Three different environments are described and their integration used to highlight the open issues that remain with such data collection. Three main lessons are offered to the reader. First, creation of representative malware samples datasets is probably an impossible task. Second, false negative alerts are not what we think they are. Third, false positive alerts exist where we were not used to see them. These three lessons have to be taken into account by those who want to assess the resilience of techniques with respect to malicious faults.
    By: Julio Canto (Hispasec Sistemas, Spain); Marc Dacier (Symantec Research Labs); Engin Kirda, and Corrado Leita* (Eurecom)

    Published in: Proceedings of the IEEE SRDS 2008, Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems, Napoli, Italy, October 2008

    Please obtain a copy of this paper from the publisher listed above.
    J. Canto, M. Dacier, E. Kirda, and C. Leita, 2008.
  • A Study of the Packer Problem and Its Solutions
    An increasing percentage of malware programs distributed in the wild are packed by packers, which are programs that transform an input binary’s appearance without affecting its execution semantics, to create new malware variants that can evade signature-based malware detection tools. This paper reports the results of a comprehensive study of the extent of the packer problem based on data collected at Symantec and the effectiveness of existing solutions to this problem. Then the paper presents a generic unpacking solution called Justin (Just-In-Time AV scanning), which is designed to detect the end of unpacking of a packed binary’s run and invoke AV scanning against the process image at that time. For accurate end-to-unpacking detection, Justin incorporates the following heuristics: Dirty Page Execution, Unpacker Memory Avoidance, Stack Pointer Check and Command-Line Argument Access. Empirical testing shows that when compared with SymPack, which contains a set of manually created unpackers for a collection of selective packers, Justin’s effectiveness is comparable to SymPack for those binaries packed by these supported packers, and is much better than SymPack for binaries packed by those that SymPack does not support.
    By: Fanglu Guo, Peter Ferrie, and Tzi-cker Chiueh (Symantec Research Labs)

    Published in: Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection (RAID 2008), Boston, MA, September 2008

    Please obtain a copy of this paper from the publisher listed above.
    F. Guo, P. Ferrie, and T. Chiueh, 2008.
  • Detecting Known and New Salting Tricks in Unwanted Emails
    Spam and phishing emails are not only annoying to users, but are a real threat to internet communication and web economy. The fight against unwanted emails has become a cat-and-mouse game between criminals and people trying to develop techniques for detecting such unwanted emails. Criminals are constantly developing new tricks and adopt the ones that make emails pass spam filters. We have developed a systems that enables the detection of certain common salting tricks that are employed by criminals. Salting is the intentional addition or distortion of content. In this paper we describe a framework to identify email messages that might contain new, previously unseen tricks. To this end, we compare the simulated perceived email message text generated by our hidden salting simulation system to the OCRed text we obtain from the rendered email message. We present robust text comparison techniques and train a classifier based on the differences of these two texts. In simulations we show that we can detect suspicious emails with a high level of accuracy.
    By: Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel (Fraunhofer IAIS, Germany); Marie-Francine Moens (Katholieke Universitiet, Belgium); and Brian Witten (Symantec Research Labs)

    Published in: Proceedings of International Conference on Email and Anti-Spam (CEAS), Mountain View, CA, August 2008

    Please obtain a copy of this paper from the publisher listed above.
    A. Bergholz, G. Paass, F. Reichartz, S. Strobel, M. Moens, and B. Witten, 2008.
  • A Forced Sampled Execution Approach to Kernel Rootkit Identification
    Kernel rootkits are considered one of the most dangerous forms of malware because they reside inside the kernel and can perform the most privileged operations on the compromised machine. Most existing kernel rootkit detection techniques attempt to detect the existence of kernel rootkits, but cannot do much about removing them, other than booting the victim machine from a clean operating system image and configuration. This paper describes the design, implementation and evaluation of a kernel rootkit identification system for the Windows platform called Limbo, which prevents kernel rootkits from entering the kernel by checking the legitimacy of every kernel driver before it is loaded into the operating system. Limbo determines whether a kernel driver is a kernel rootkit based on its binary contents and run-time behavior. To expose the execution behavior of a kernel driver under test, Limbo features a forced sampled execution approach to traverse the driver’s control flow graph. Through a comprehensive characterization study of current kernel rootkits, we derive a set of run-time features that can best distinguish between legitimate and malicious kernel drivers. Applying a Naive Bayes classification algorithm on this chosen feature set, the first Limbo prototype is able to achieve 96.2% accuracy for a test set of 754 kernel drivers, 311 of which are kernel rootkits.
    By: Jeffrey Wilhelm and Tzi-cker Chiueh (Symantec Research Labs)

    Published in: Proceedings of the 10th International Symposium on Recent Advances in Intrusion Detection (RAID 2007), Queensland, Australia, September 2007

    Please obtain a copy of this paper from the publisher listed above.
    J. Wilhelm and T. Chiueh, 2007.
  • Malware Evolution: A Snapshot of Threats and Countermeasures in 2005
    Speed, stealth, and purpose of malware threats and countermeasures are evolving quickly. This chapter describes these three facets of current malware threats, and describes a few countermeasures emerging to better address such threats.
    By: Brian Witten and Carey Nachenberg (Symantec Corporation)

    Published in: Malware Detection (Advances in Information Security), Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, Cliff Wang (eds.), Springer, 2006

    Please obtain a copy of this paper from the publisher listed above.
    B. Witten and C. Nachenberg, 2006.

Memory Management

  • Conservative Garbage Collection for General Memory Allocators
    This paper explains a technique that integrates conservative garbage collection on top of general memory allocators. This is possible by using two data structures named malloc-tables and jump-tables that are computed at garbage collection time to map pointers to beginning of objects and their sizes. This paper describes malloc-tables and jump-tables, an implementation of a malloc/jump-table based conservative garbage collector for Doug Lea's memory allocator, and experimental results that compare this implementation with Boehm-Demers-Weiser GC, a state-of-the-art conservative garbage collector.
    By: Gustavo Rodriguez-Rivera, Michael Spertus,* and Charles Fiterman (Geodesic Systems, now Symantec)

    Published in: Proceedings of the 2nd International Symposium on Memory Management (ISMM '00), Minneapolis, Minnesota, United States, October 15-16, 2000. pp. 71-79.

    Please obtain a copy of this paper from the ACM.
    G. Rodriguez-Rivera, M. Spertus, and C. Fiterman, 2000.
  • A Non-Fragmenting Non-Moving, Garbage Collector
    One of the biggest disadvantages of non-moving collectors compared to moving collectors has been their limited ability to deal with memory fragmentation. In this paper, we describe two techniques to reduce fragmentation without the need for moving live data. The first technique reduces internal fragmentation in BiBoP (Big-Bag-of-Pages) like allocators. The second technique reduces external fragmentation using virtual memory culls available in most modern operating systems. It can also reduce the size of the heap after periods of great activity in long lived applications. These techniques have been successfully used in Geodesic Systems' Great Circle, a commercially-available conservative garbage collector. This paper describes these techniques, their implementation, and some experimental results.
    By: Gustavo Rodriguez-Rivera, Michael Spertus,* and Charles Fiterman (Geodesic Systems, now Symantec)

    Published in: Proceedings of the 1st International Symposium on Memory Management (ISMM '98), Vancouver, British Columbia, Canada, October 17-19, 1998, pp. 79-85.

    Please obtain a copy of this paper from the ACM.
    G. Rodriguez-Rivera, M. Spertus, and C. Fiterman, 1998.

Natural Language Processing (NLP)

  • Evaluation of MT systems to translate user generated content
    This paper presents the evaluation results of a study conducted to determine the ability of various Machine Translation systems in translating User-Generated Content, particularly online forum content. Four systems are compared in this paper, focusing on the English > German and English > French language pairs, including a system called VICTOR, which is based on the Moses and IRSTLM toolkits. After describing some of the characteristics of these systems, the methodological framework used during a medium scale evaluation campaign is described. A careful analysis of both human and automated scores show that one system is overall significantly better than the other three systems for the English > German language pair, but that very little difference exists for specific post types (such as questions and solutions). The results are also much more balanced for the English > French language pair, suggesting that all systems could be useful in a multi-system deployment scenario. Our results also show that human scores and automated scores do not consistently correlate, penalizing certain systems more than others. Finally, we also show that the quality and coverage of the source posts impacts systems and language pairs differently.
    By J. Roturier and A. Bensadoun

    Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.244-251.

    Download paper
    J. Roturier and A. Bensadoun, 2011
  • Domain adaptation in statistical machine translation of user-forum data using component-level mixture modeling
    This paper reports experiments on adapting components of a Statistical Machine Translation (SMT) system for the task of translating online user-generated forum data from Symantec. Such data is monolingual, and differs from available bitext MT training resources in a number of important respects. For this reason, adaptation techniques are important to achieve optimal results. We investigate the use of mixture modelling to adapt our models for this specific task. Individual models, created from different in-domain and out-of-domain data sources, are combined using linear and log-linear weighting methods for the different components of an SMT system. The results show a more profound effect of language model adaptation over translation model adaptation with respect to translation quality. Surprisingly, linear combination outperforms log-linear combination of the models. The best adapted systems provide a statistically significant improvement of 1.78 absolute BLEU points (6.85% relative) and 2.73 absolute BLEU points (8.05% relative) over the baseline system for English–German and English–French, respectively.
    By: P. Banerjee, S. Kumar Naskar, J. Roturier, A. Way, & J. van Genabith

    Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.285-292.

    Download paper
    P. Banerjee, S. Kumar Naskar, J. Roturier, A. Way, & J. van Genabith, 2011
  • Qualitative analysis of post-editing for high quality machine translation
    In the context of massive adoption of Machine Translation (MT) by human localization services in Post-Editing (PE) workflows, we analyze the activity of post-editing high quality translations through a novel PE analysis methodology. We define and introduce a new unit for evaluating post-editing effort based on Post-Editing Action (PEA) - for which we provide human evaluation guidelines and propose a process to automatically evaluate these PEAs. We applied this methodology on data sets from two technologically different MT systems. In that context, we could show that more than 35% of the remaining effort can be saved by introducing of global PEA and edit propagation.
    By: F. Blain, J. Senellart, H. Schwenk, M. Plitt, & J. Roturier

    Published in: Proceedings of MT Summit XIII: the Thirteenth Machine Translation Summit, 19-23 September 2011, Xiamen, China; pp.164-171.

    Download paper
    F. Blain, J. Senellart, H. Schwenk, M. Plitt, & J. Roturier, 2011
  • Improving the Post-Editing Experience Using Translation Recommendation: A User Study
    We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM.
    By: Y. He, Y. Ma, J. Roturier, A. Way, and J. van Genabith

    Published in: Proceedings of AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, Colorado, October 31 – November 4, 2010. 10pp.

    Download paper
    Y. He, Y. Ma, J. Roturier, A. Way, and J. van Genabith, 2010
  • Source Text Characteristics and Technical and Temporal Post-Editing Effort: What is Their Relationship?
    This paper focuses on the relationship between source text characteristics (ambiguity, complexity and style compliance) and machine-translation post-editing effort (both temporal and technical). Post-editing data is collected in a traditional translation environment and subsequently plotted against textual scored produced by a range of systems. Our findings show some strong correlation between ambiguity and complexity scores and technical post-editing effort, as well as moderate correlation between one of the style guide compliance scores and temporal post-editing effort.
    By: M. Tatsumi and J. Roturier

    Published in: JEC 2010: Second joint EM+/CNGL Workshop, “Bringing MT to the user: Research on integrating MT in the translation industry”, AMTA 2010, Denver, Colorado, November 4, 2010. pp.43-51.

    Download paper
    M. Tatsumi and J. Roturier, 2010
  • TMX Markup: A Challenge When Adapting SMT to the Localisation Environment
    Translation memory (TM) plays an important role in localisation workflows and is used as an efficient and fundamental tool to carry out translation. In recent years, statistical machine translation (SMT) techniques have been rapidly developed, and the translation quality and speed have been significantly improved as well. However, when applying SMT technique to facilitate post-editing in the localisation industry, we need to adapt SMT to the TM data which is formatted with special mark-up. In this paper, we explore some issues when adapting SMT to Symantec formatted TM data. Three different methods are proposed to handle the Translation Memory eXchange (TMX) markup and a comparative study is carried out between them. Furthermore, we also compare the TMX-based SMT systems with a customized SYSTRAN system through human evaluation and automatic evaluation metrics. The experimental results conducted on the French and English language pair show that the SMT can perform well using TMX as input format either during training or at runtime.
    By: J. Du, J. Roturier, and A. Way

    Published in: Proceedings of the 14th Annual Meeting of the European Association for Machine Translation, St. Raphael, France. 8pp.

    Download paper
    J. Du, J. Roturier, and A. Way, 2010
  • A Novel Statistical Pre-Processing Model for Rule-Based Machine Translation System
    This paper introduces a new statistical preprocessing model for Rule-Based Machine Translation (RBMT) systems. We train a Statistical Machine Translation (SMT) system using monolingual corpora. This model can transform a source input to an RBMT system into a more target-language friendly or RBMTsystem friendly “pivot” language. We apply this proposed model to translation from English to Chinese in a pilot project. Automatic evaluation scores (BLEU, TER and GTM) show that this pre-processing model can increase the quality of the output of the RBMT system, especially with an increase in the size of the training corpus. This model is applicable to language pairs which differ in grammar and language structures.
    By: Y. Sun, S. O’Brien, M. O’Hagan, and F. Hollowood

    Published in: Proceedings of the 14th Annual conference of the European Association for Machine Translation, 27-28 May 2010, Saint-Raphaël, France. ed.Viggo Hansen and François Yvon; 8pp.

    Download paper
    Y. Sun, S. O’Brien, M. O’Hagan, and F. Hollowood, 2010
  • Mining the Correlation Between Human and Automatic Evaluation at Sentence Level
    Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However, as humans are the end-user of MT output, human judgment is the benchmark to assess the usefulness of automatic evaluation metrics. While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this study. The results show that the threshold for human evaluators to agree with the judgments of automatic metrics varies with the automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human evaluators or not.
    By: Y. Sun

    Published in: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation, 17-23 May 2010, Valletta, Malta; pp.1726-1730.

    Download paper
    Y. Sun, 2010
  • How to Treat GUI Options in IT Technical Texts for Authoring and Machine Translation
    This paper focuses on one aspect of controlled authoring in a localization and Machine-Translation context: the treatment of GUI options, which abound in the procedural sections of IT technical documentation. GUI options are technical terms that refer to the Software User Interface. The length and complexity of GUI options is a major problem for numerous NLP task, including MT. GUI options which have not been identified by NLP applications typically lead to erroneous analyses of sentences. However, few authors have focused on the identification and tagging of GUI options in IT documentation. This paper delineates an approach based on a controlled language checker that benefits both the human authoring process and Machine Translation.
    By: J. Roturier and S. Lehmann

    Published in: JIAL Volume 1. pp. 40-59.

    Download paper
    J. Roturier and S. Lehmann, 2009
  • Deploying Novel MT Technology to Raise the Bar for Quality: Key Advantages and Challenges
    This paper is a case study of the deployment of new MT technology aiming at improving the overall Post-Editing (PE) experience. One of the main challenges in having MY output post-edited within a localization workflow is being able to meet ever increasing quality expectations, especially from a terminology perspective. In Symantec’s localization process we constantly seek to overcome this obstacle by refining our MT output in a number of ways. Several layers of customization have therefore been added over the years to our commercial MT system (such as User Dictionaries, Translation Stylesheets and Automated Post-Editing). Despite obtaining substantial quality gains with these techniques, improvements are still sought to minimize PE effort. The deployment of a novel technology, based on SYSTRAN’s new hybrid approach, is presented in this paper. This paper focuses on the technical and linguistic challenges associated with the integration of this new technology into the existing MY workflow.
    By: J. Roturier

    Published in: Proceedings of MT Summit XII, August 26-30, 2009, Ottawa, Ontario, Canada. pp. 1-8.

    Download paper
    J. Roturier, 2009
  • Correlation Between Automatic Evaluation Metric Scores, Post-Editing Speed, and Some Other Factors
    This paper summarizes the results of a pilot project conducted to investigate the correlation between automatic evaluation metric scores and post-editing speed on a segment by segment basis. Firstly, the results from the comparison of various automatic metrics and post-editing speed will be reported. Secondly, further analysis is carried out by taking into consideration other relevant variables, such as text length and structures, and by means of multiple regression. It has been found that different automatic metrics achieve different levels and types of correlation with post-editing speed. We suggest that some of the source text characteristics and machine translation errors may be able to account for the gap between the automatic metric scores and post-editing speed, and may also help with understanding human post-editing process.
    By: M. Tatsumi

    Published in: Proceedings of MT Summit XII, pp. 332-339.

    Download paper
    M. Tatsumi, 2009
  • Comparison of Alternatives to Strict Source Control: A Case Study with -ing Words
    This paper presents the results of a case study focusing on the design and development of specific Controlled Language rules, which address readability and translatability issues generated by –ing words. This paper reviews several development and deployment iterations, whereby users’ feedback was taken into account for future development phases. We show that two complementary techniques may be used: the generation flags for rule violations (and subsequent human validation) and the fully automatic introduction of additional elements in input text (pre-processing). We compare and contrast these two techniques in this paper, showing that the precision of both our machine-oriented rules and human-oriented rules reaches at least 80%.
    By: N. Aranberri and J. Roturier

    Published in: Proceedings of MT Summit XII, pp. 332-339.

    Download paper
    N. Aranberri and J. Roturier, 2009

Networking

  • Predictive Methods for Improved Vehicular WiFi Access
    With the proliferation of WiFi technology, many WiFi networks are accessible from vehicles on the road making vehicular WiFi access realistic. However, several challenges exist: long latency to establish connection to a WiFi access point (AP), lossy link performance, and frequent disconnections due to mobility. We argue that people drive on familiar routes frequently, and thus the mobility and connectivity related information along their drives can be predicted with good accuracy using historical information - such as GPS tracks with timestamps, RF fingerprints, and link and network-layer addresses of visible APs. We exploit such information to develop new handoff and data transfer strategies. The handoff strategy reduces the connection establishment latency and also uses pre-scripted handoffs triggered by change in vehicle location. The data transfer strategy speeds up download performance by using prefetching on the APs yet to be encountered. Experimental performance evaluation reveals that the predictability of mobility and connectivity is high enough to be useful in such protocols. In our experiments with a vehicular client accessing road-side APs, the handoff strategy improves download performance by roughly a factor of 2 relative to the state-of-the-art. The data transfer strategy further improves this performance by another factor of 2.5.
    By: Pralhad Deshpande (Stony Brook University), Anand Kashyap* (Symantec Research Labs), Chul Sung, and Samir R. Das (Stony Brook University)

    Published in: Proceedings of the 7th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '09), Kraków, Poland, June 22-25, 2009, pp. 263-276.

    Please obtain a copy of this paper from the ACM.
    P. Deshpande, A. Kashyap, C. Sung, and S. Das, 2009.
  • The eBay graph: How do online auction users interact?
    Online auctions have revolutionized the ability of people to buy and sell items without middlemen, and sales reaching more than $57 billion every year on eBay alone. The user interactions at online auctions form a network of interactions akin to a social network. Unlike other online social networks, online auction networks have not been studied so far. In this paper, we model and characterize the structure and evolution of the network of user interactions on eBay. Specifically, we use over 54 million eBay transaction feedback that users leave for each other. A distinguishing feature of the graph is the rich meaning that nodes and edges can have (for example, an edge could be positive or negative, posted by a buyer or a seller) in contrast to other graphs such as the topology of the Internet. First, we provide the vital statistics of the emerging graph, which we call eBay graph. We observe that the graph exhibits both significant differences and similarities to commonly studied graphs. Second, we study the feedback behavior of users: feedback is not always reciprocal, and negative feedback is scarce (less than 1%), but few negative feedbacks could discourage new users. Finally, we develop an intuitive model that captures key properties of the graph in a visual and memorable way. Our work can be seen as a first step in understanding online auction graphs, which could enable us to detect trends, abnormalities and ultimately fraudulent behavior.
    By: Yordanos Beyene, Michalis Faloutso (University of California, Riverside), Duen Horng Chau**, and Christos Faloutsos (Carnegie Mellon University)

    Published in: IEEE Global Internet Symposium 2008, copublished in proceedings of the 27th Conference on Computer Communication (INFOCOM 2008), Phoenix AZ, April 2008

    Please obtain a copy of this paper from the publisher listed above.
    Y. Beyene, M. Faloutso, D. Horng Chau, and C. Faloutsos, 2008.
  • RapidUpdate: Peer-Assisted Distribution of Security Content
    We describe RapidUpdate, a peer-assisted system tailored to the specific needs of distributing security content. Its unique features include being able to distribute small files while still offloading a vast majority of the distribution bandwidth, using central planning in order to maximize efficiency and meet distribution deadlines, and allowing peers to participate fully in the system even if behind a firewall or NAT device. We first describe the protocol and server scheduling algorithms, then utilize a full implementation of the RapidUpdate client and topology server to show the system meets its goals. As security software vendors face the burden of rapidly increasing content distribution load, we believe that this peer-assisted system has the potential to lower cost while increasing quality of service.
    By: Denis Serenyi and Brian Witten (Symantec Research Labs)

    Published in: Proceedings of the 7th International Workshop on Peer-to-Peer Systems (IPTPS 2008), Tampa Bay, FL, February 2008

    Please obtain a copy of this paper from the publisher listed above.
    D. Serenyi and B. Witten, 2008.
  • Comparison of QoS Guarantee Techniques for VoIP over IEEE802.11
    Wireless LAN
    An emerging killer application for enterprise wireless LANs (WLANs) is voice over IP (VoIP) telephony, which promises to greatly improve the reachability and mobility of enterprise telephony service at low cost. None of the commercial IEEE 802.11 WLAN-based VoIP products can support more than ten G.729-quality voice conversations over a single IEEE 802.11b channel on real-world WLANs, even though the physical transmission rate is more than two orders of magnitude higher than an individual VoIP connection’s bandwidth requirement. There are two main reasons why these VoIP systems’ effective throughput is significantly lower than expected: VoIP’s stringent latency requirement and substantial per-WLAN-packet overhead. Time-Division Multiple Access (TDMA) is a well-known technique that provides per-connection QoS guarantee as well as improves the radio channel utilization efficiency. This paper compares the effective throughput of IEEE 802.11, IEEE 802.11e and a software-based TDMA (STDMA) protocol that is specifically designed to support WLAN-based VoIP applications, on the same commodity IEEE 802.11 WLAN hardware. Empirical measurements from a VOIP over WLAN testbed show that the numbers of G.729-quality voice conversations that IEEE 802.11, IEEE 802.11e and STDMA can support over a single IEEE 802.11b channel are 18, 22 and 50, respectively.
    By: Fanglu Guo* and Tzi-cker Chiueh* (Stony Brook University)

    Published in: Proceedings of the 15th Annual Multimedia Computing and Networking Conference (MMCN 2008), San Jose, CA, January 2008

    Please obtain a copy of this paper from the publisher listed above.
    F. Guo, and T. Chiueh, 2008.

Program Analysis

  • A System for Generating Static Analyzers for Machine Instructions
    This paper describes the design and implementation of a language for specifying the semantics of an instruction set, along with a run-time system to support the static analysis of executables written in that instruction set. The work advances the state of the art by creating multiple analysis phases from a specification of the concrete operational semantics of the language to be analyzed.
    By: Junghee Lim* (University of Wisconsin-Madison) and Thomas Reps (GammaTech, Inc., NY)

    Published in: Proceedings of the International Conference on Compiler Construction (CC), Budapest, Hungary, March 2008

    Won Best Paper Award at ETAPS, 2008

    Please obtain a copy of this paper from the publisher listed above.
    J. Lim and T. Reps, 2008.
  • Fast Bounds Checking Using Debug Register
    The ability to check memory references against their associated array/buffer bounds helps programmers to detect programming errors involving address overruns early on and thus avoid many difficult bugs down the line. This paper proposes a novel approach called Boud to the array bounds checking problem that exploits the debug register hardware in modern CPUs. Boud allocates a debug register to monitor accesses to an array or buffer within a loop so that accesses stepping outside the array’s or buffer’s bound will trigger a breakpoint exception. Because the number of debug registers is typically small, in cases when hardware bounds checking is not possible, Boud falls back to software bounds checking. Although Boud can effectively eliminate per-array-reference software checking overhead in most cases, it still incurs a fixed set-up overhead for each use of an array within a loop. This paper presents the detailed design and implementation of the Boud compiler, and a comprehensive evaluation of various performance tradeoffs associated with the proposed array bounds checking technique. For the set of real-world network applications we tested, including Apache, Sendmail, Bind, etc., the latency penalty of Boud’s bounds checking mechanism is between 2.2% to 8.8%, respectively, when compared with the vanilla GCC compiler, which does not perform any bounds checking.
    By: Tzi-cker Chiueh* (Stony Brook University)

    Published in: Proceedings of the 3rd International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC 2008),Gothenberg, Sweden, January 2008

    Please obtain a copy of this paper from the publisher listed above.
    T. Chiueh, 2008.

Programming Languages

  • Garbage Collection in the Next C++ Standard
    C++ has traditionally relied on manual memory management. Sometimes this has been augmented by limited reference counting, implemented in libraries, and requiring use of separate pointer types. In spite of the fact that conservative garbage collectors have been used with C for decades, and with C++ for almost as long, they have not been well-supported by language standards. This in turn has limited their use.

    We have led an effort to change this by supporting optional "transparent" garbage collection in the next C++ standard. This is designed to either garbage collect or detect leaks in code using normal unadorned C++ pointers. We initially describe an ambitious effort that would have allowed programmers to explicitly request garbage collection. It faced a number of challenges, primarily in correct interaction with existing libraries relying on explicit destructor invocation. This effort was eventually postponed to the next round of standardization. This initial effort was then temporarily replaced by minimal support in the language that officially allows garbage collected implementations. Such minimal support is included in the current committee draft for the next C++ standard. It imposes an additional language restriction that makes it safe to garbage collect C++ programs. Stating this restriction proved subtle.

    We also provide narrow interfaces that make it easy to both correct code violating this new restriction, and to supply hints to a conservative garbage collector to improve its performance. These provide interesting implementation challenges. We discuss partial solutions.
    By: Mike Spertus* (Symantec Research Labs) and Hans-J. Boehm (HP Laboratories)

    Published in: Proceedings of the 2009 International Symposium on Memory Management (ISMM '09), Dublin, Ireland, June 19-20, 2009, pp. 30-38.

    Please obtain a copy of this paper from the ACM.
    M. Spertus and H. Boehm, 2009.

Security

  • Toward a Standard Benchmark for Computer Security Research: The Worldwide Intelligence Network Environment (WINE)
    Unlike benchmarks that focus on performance or reliability evaluations, a benchmark for computer security must necessarily include sensitive code and data. Because these artifacts could damage systems or reveal personally identifiable information about the users affected by cyber attacks, publicly disseminating such a benchmark raises several scientific, ethical and legal challenges. We propose the Worldwide Intelligence Network Environment (WINE), a security-benchmarking approach based on rigorous experimental methods. WINE includes representative field data, collected worldwide from 240,000 sensors, for new empirical studies, and it will enable the validation of research on all the phases in the lifecycle of security threats. We tackle the key challenges for security benchmarking by designing a platform for repeatable experimentation on the WINE data sets and by collecting the metadata required for understanding the results. In this paper, we review the unique characteristics of the WINE data, we discuss why rigorous benchmarking will provide fresh insights on the security arms race and we propose a research agenda for this area.
    By: Tudor Dumitras and Darren Shou (Symantec Research Labs)

    Published in: Proceedings of the First EuroSys Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (EuroSys BADGERS 2011), Salzburg, Austria, April 2011

    Please obtain a copy of this paper from the ACM.
    T. Dumitras and D. Shou, 2011.
  • HARMUR: Storing and Analyzing Historic Data on Malicious Domains
    A large amount of work has been done to develop tools and techniques to detect and study the presence of threats on the web. This includes, for instance, the development of a variety of different client honeypot techniques for the detection and study of drive-by downloads, as well as the creation of blacklists to prevent users from visiting malicious web pages. Due to the extent of the web and the scale of the problem, existing work typically focuses on the collection of information on the current state of web pages and does not take into account the temporal dimension of the problem.

    In this paper we describe HARMUR, a security dataset developed in the context of the WOMBAT project that aims at exploring the dynamics of the security and contextual information associated to malicious domains. We detail the design decisions that have led to the creation of an easily extensible architecture, and describe the characteristics of the underlying dataset. Finally, we demonstrate through examples the value of the collected information, and the importance of tracking the evolution of the state of malicious domains to gather a more complete picture on the threat landscape.
    By: Corrado Leita (Symantec Research Labs) and Marco Cova** (University of Birmingham)

    Published in: Proceedings of the First EuroSys Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (EuroSys BADGERS 2011), Salzburg, Austria, April 2011

    Please obtain a copy of this paper from the ACM.
    C. Leita and M. Cova, 2011.
  • An Attack Surface Metric
    Measurement of software security is a long standing challenge to the research community. At the same time, practical security metrics and measurements are essential for secure software development. Hence the need for metrics is more pressing now due to a growing demand for secure software. In this paper, we propose to use a software system's attack surface measurement as an indicator of the system's security. We formalize the notion of a system's attack surface and introduce an attack surface metric to measure the attack surface in a systematic manner. Our measurement method is agnostic to a software system's implementation language and is applicable to systems of all sizes; we demonstrate our method by measuring the attack surfaces of small desktop applications and large enterprise systems implemented in C and Java. We conducted three exploratory empirical studies to validate our method. Software developers can mitigate their software's security risk by measuring and reducing their software's attack surfaces. Our attack surface reduction approach complements software industry's traditional code quality improvement approach for security risk mitigation and is useful in multiple phases of the software development lifecycle. Our collaboration with SAP demonstrates the use of our metric in the software development process.
    By: Pratyusa K. Manadhata* (Symantec Research Labs) and Jeannette M. Wing (Carnegie Mellon University)

    Published in: IEEE Transactions on Software Engineering, accepted for future publication and available online from the IEEE as a preprint, June 2010

    Please obtain a copy of this paper from the IEEE at http://www.computer.org/portal/web/csdl/doi/10.1109/TSE.2010.60
    P. Manadhata and J. Wing, 2010.
  • Responsibility for the Harm and Risk of Software Security Flaws
    Software vulnerabilities are a vexing problem for the state of information assurance and security. Who is responsible for the risk and harm of software security is controversial. Deliberation of the responsibility for harm and risk due to software security flaws requires considering how incentives (and disincentives) and network effects shape the practices of vendors and adopters, and the consequent effects on the state of software security. This chapter looks at these factors in more detail in the context of private markets and public welfare.
    By: Cassio Goldschmidt* (Symantec Research Labs); Melissa Jane Dark, and Hina Chaudhry (Purdue University)

    Published in: Information Assurance and Security Ethics in Complex Systems: Interdisciplinary Perspectives, by Melissa Jane Dark (Purdue University), August 31, 2010, IGI Global, August 31, 2010, pp. 104-131.

    Please obtain a copy of this chapter from the publisher listed above.
    C. Goldschmidt, M. Dark, and H. Chaudhry, 2010.
  • Advances in Topological Vulnerability Analysis
    Currently, network administrators must rely on labor-intensive processes for tracking network configurations and vulnerabilities, which requires a great deal of expertise and is error prone. The organization of networks and the inter dependencies of vulnerabilities are so complex as to make traditional vulnerability analysis inadequate. We describe a Topological Vulnerability Analysis (TVA) approach that analyzes vulnerability dependencies and shows all possible attack paths into a network. From models of the network vulnerabilities and potential attacker exploits, we discover attack paths (organized as graphs) that convey the impact of individual and combined vulnerabilities on overall security. We provide sophisticated attack graph visualizations, with high-level overviews and detail drill down. Decision support capabilities let analysts make optimal tradeoffs between safety and availability, and show how to best apply limited security resources. We employ efficient algorithms that scale well to larger networks.
    By: Steven Noel (George Mason University); Matthew Elder (Symantec Research Labs); Sushil Jajodia, Pramod Kalapa (George Mason University); Scott O’Hare, and Kenneth Prole (Secure Decisions, Division of Applied Visions Inc.)

    Published in: Proceedings of the Cybersecurity Applications & Technology Conference For Homeland Security (CATCH 2009), Washington, DC, March 2009

    Please obtain a copy of this paper from the publisher listed above.
    S. Noel, M. Elder, S. Jajodia, P. Kalapa, S. O’Hare, and K. Prole, 2009
  • Reducing E-Discovery Cost by Filtering Included Emails
    As businesses become more reliance on information technology, electronic information is often produced as vital evidence during civil litigation. The process of discovering electronic information as evidence is getting increasingly expensive as the volume of data explodes. This surging demand calls for a solution to reduce the cost associated with the discovery process. In this paper, we propose filtering included emails as a means to reduce review data volume, and describe efficient algorithms to identify those emails. Experiments show that this can reduce the number of emails to be reviewed by 20% in corporate email corpse.
    By: Tsuen-Wan “Johnny” Ngan (Symantec Research Labs)

    Published in: Proceedings of the Fifth Conference on Emails and Anti-Spam (CEAS 2008), August 2008

    Please obtain a copy of this paper from the publisher listed above.
    T. Ngan, 2008.
  • Data Space Randomization
    Over the past several years, US-CERT advisories, as well as most critical updates from software vendors, have been due to memory corruption vulnerabilities such as buffer overflows, heap overflows, etc. Several techniques have been developed to defend against the exploitation of these vulnerabilities, with the most promising defenses being based on randomization. Two randomization techniques have been explored so far: address space randomization (ASR) that randomizes the location of objects in virtual memory, and instruction set randomization (ISR) that randomizes the representation of code. We explore a third form of randomization called data space randomization (DSR) that randomizes the representation of data stored in program memory. Unlike ISR, DSR is effective against non-control data attacks as well as code injection attacks. Unlike ASR, it can protect against corruption of non-pointer data as well as pointer-valued data. Moreover, DSR provides a much higher range of randomization (typically 232 for 32-bit data) as compared to ASR. Other interesting aspects of DSR include (a) it does not share a weakness common to randomization-based defenses, namely, susceptibility to information leakage attacks, and (b) it is capable of detecting some exploits that are missed by full bounds-checking techniques, e.g., some of the overflows from one field of a structure to the next field. Our implementation results show that with appropriate design choices, DSR can achieve a performance overhead in the range of 5% to 30% for a range of programs.
    By: Sandeep Bhatkar (Symantec Research Labs) and R. Sekar (Stony Brook University)

    Publication: Proceedings of the 5th International Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2008), July 2008

    Please obtain a copy of this paper from the publisher listed above.
    S. Bhatkar and R. Sekar, 2008.
  • Engineering Sufficiently Secure Computing
    We propose an architecture of four complimentary technologies increasingly relevant to a growing number of home users and organizations: cryptography, separation kernels, formal verification, and rapidly improving techniques relevant to software defect density estimation. Cryptographic separation protects information in transmission and storage. Formally proven properties of separation kernel based secure virtualization can bound risk for information in processing. Then, within each strongly separated domain, risk can be measured as a function of people and technology within that domain. Where hardware, software, and their interactions are proven to behave as and only as desired under all circumstances, such hardware and software can be considered to not substantially increase risk. Where the size or complexity of software is beyond such formal proofs, we discuss estimating risk related to software defect densities, and emerging work related to binary analysis with potential for improving software defect density estimation.
    By: Brian Witten (Symantec Research Labs)

    Published in: Proceedings of the 22nd Annual Computer Security Applications Conferences (ACSAC 2006), Miami Beach, FL, December 2006

    Please obtain a copy of this paper from the publisher listed above.
    B. Witten, 2006.

Social Networks

  • Measurement and Gender-Specific Analysis of User Publishing Characteristics on MySpace
    Online social networks have become popular platforms for people to make connections, share information, and interact with each other. In an online social network, user publishing activities such as sending messages and posting photos represent online interactions between friends that involve the use of network and system resources. As more and more businesses use online social networks as a means to promote their “brand names”, a good understanding of user publishing characteristics is important not only for capacity planning (server scaling estimation, network bandwidth provisioning and performance tuning), but also for marketing analysis and security measures of online social networks. Recently there have been many efforts to measure and analyze various characteristics of online social networks. Most of these studies have focused on the network graph properties such as node degree distribution, clustering coefficient, connected components, etc. In this work, we measure and analyze the gender-specific user publishing characteristics on MySpace. Our results show that there are recognizable patterns with respect to profile attributes, user interactions, temporal usages, and blog contents for users of different genders. In particular, gender neutral profiles (most of them are music bands, TV shows, and other commercial sites) differ significantly from normal male and female profiles for several publishing patterns.
    By: William Gauvin* (Symantec Research Laboratories and University of Massachusetts Lowell), Bruno Ribeiro, Benyuan Liu, Don Towsley, and Jie Wang (University of Massachusetts Lowell)

    Published in: IEEE Network, September-October 2010, vol. 24, no. 5, pp. 38-43.

    Please obtain a copy of this paper from the IEEE.
    W. Gauvin, B. Ribeiro, B. Liu, D. Towsley, and J. Wang, 2010.

Storage

  • Building a High-performance Deduplication System
    Modern deduplication has become quite effective at eliminating duplicates in data, thus multiplying the effective capacity of disk-based backup systems, and enabling them as realistic tape replacements. Despite these improvements, single-node raw capacity is still mostly limited to tens or a few hundreds of terabytes, forcing users to resort to complex and costly multi-node systems, which usually only allow them to scale to single-digit petabytes. As the opportunities for deduplication efficiency optimizations become scarce, we are challenged with the task of designing deduplication systems that will effectively address the capacity, throughput, management and energy requirements of the petascale age. In this paper we present our high-performance deduplication prototype, designed from the ground up to optimize overall single-node performance, by making the best possible use of a node's resources, and achieve three important goals: scale to large capacity, provide good deduplication efficiency, and near-raw-disk throughput. Instead of trying to improve duplicate detection algorithms, we focus on system design aspects and introduce novel mechanisms—that we combine with careful implementations of known system engineering techniques. In particular, we improve single-node scalability by introducing progressive sampled indexing and grouped mark-and-sweep, and also optimize throughput by utilizing an event-driven, multi-threaded client-server interaction model. Our prototype implementation is able to scale to billions of stored objects, with high throughput, and very little or no degradation of deduplication efficiency. View PDF
    By: Fanglu Guo* and Petros Efstathopoulos* (Symantec Research Labs)

    Published in: Proceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), Portland, OR, USA, June 15–17, 2011, pp. 271-284.

    Conference Best Paper Award

    Please obtain a copy of this paper from The USENIX Association.
    F. Guo and P. Efstathopoulos, 2011.
  • Towards SIRF: Self-contained Information Retention Format
    Many organizations are now required to preserve and maintain access to large volumes of digital content for dozens of years. There is a need for preservation systems and processes to support such long-term retention requirements and enable the usability of those digital objects in the distant future, regardless of changes in technologies and designated communities. A key component in such preservation systems is the storage subsystem where the digital objects are located for most of their lifecycle. We describe SIRF (Self-contained Information Retention Format)—a logical storage container format specialized for long term retention. SIRF includes a set of digital preservation objects and a catalog with metadata related to the entire contents of the container as well as to the individual objects and their interrelationship. SIRF is being developed by the Storage Networking Industry Association (SNIA) with the intention of creating a standardized vendor-neutral storage format that will be interpretable by future preservation systems and that will simplify and reduce the costs of digital preservation. View PDF
    By: Simona Rabinovici-Cohen (IBM Haifa Labs), Mary G. Baker (HP Labs), Roger Cummings* (Symantec Research Labs), Samuel A. Fineberg (HP Software), and John Marberg (IBM Haifa labs)

    Published in: Proceedings of the 4th Annual International Systems and Storage Conference (SYSTOR 2011), Haifa, Israel, May 30-June 1, 2011

    Please obtain a copy of this paper from the ACM.
    S. Rabinovici-Cohen, M. Baker, R. Cummings, S. Fineberg, and J. Marberg, 2011.
  • U Can’t Touch This: Block-Level Protection for Portable Storage
    Advancements in portable storage have made it increasingly likely for users to carry large amounts of data with them, attaching their devices to multiple computers and transferring data between systems. This model poses new challenges for preserving data isolation policies, particularly when traditional information flow preserving operating systems no longer have control over the portable storage medium. Using secure disks and principles of label persistence from the Asbestos operating system, we propose mechanisms to address these concerns, by making the drive responsible for enforcing data isolation at the block level, and preventing block sharing between hosts that are not considered equally trusted. We describe models of partial protection and full disk labeling, and corresponding operations and preconditions necessary for the OS-disk interaction to occur. This model poses many new system design challenges and can lead to interesting new security mechanisms.
    By: Kevin R.B. Butler** (Pennsylvania State University) and Petros Efstathopoulos* (Symantec Research Labs)

    Published in: Proceedings of the 4th International Workshop on Software Support for Portable Storage (IWSSPS 2009), Grenoble, France, October 2009.

    Please obtain a copy of this paper from the ACM.
    K. Butler and P. Efstathopoulos, 2009.
  • Rethinking Deduplication Scalability
    Deduplication, a form of compression aiming to eliminate duplicates in data, has become an important feature of most commercial and research backup systems. Since the advent of deduplication, most research efforts have focused on maximizing deduplication efficiency---i.e., the offered compression ratio---and have achieved near-optimal usage of raw storage. However, the capacity goals of next-generation Petabyte systems requires a highly scalable design, able to overcome the current scalability limitations of deduplication. We advocate a shift towards scalability-centric design principles for deduplication systems, and present some of the mechanisms used in our prototype, aiming at high scalability, good deduplication efficiency, and high throughput.
    By: Petros Efstathopoulos* and Fanglu Guo* (Symantec Research Labs)

    Published in: Proceedings of the 2nd USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage'10), Boston, Massachusetts, June 2010.

    Please obtain a copy of this paper from USENIX.
    P. Efstathopoulos and F. Guo, 2008.
  • DAFT: Disk Geometry-Aware File System Traversal
    Bulk file access is a read access to a large number of files in a file system. Example applications that use bulk file access extensively are anti-virus (AV) scanner, file-level data back-up agent, file system defragmentation tool, etc. This paper describes the design, implementation, and evaluation of an optimization to modern file systems that is designed to improve the read efficiency of bulk file accesses. The resulting scheme, called DAFT (Disk geometry-Aware File system Traversal), provides a bulk file access application with individual files while fetching these files into memory in a way that respects the disk geometry and thus is as efficient as it can be. We have successfully implemented a fully operational DAFT prototype, and tested it with commercial AV scanners and data back-up agents. Empirical measurements on this prototype demonstrate that it can reduce the elapsed time of enumerating all files in a file system by a factor of 5 to 15 for both fragmented and non-fragmented file systems on fast and slow disks.
    By: Fanglu Guo* and Tzi-cker Chiueh* (Symantec Research Labs)

    Published in: Proceedings of the 17th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), London, UK, September 2009, pp. 484-493.

    Please obtain a copy of this paper from the IEEE.
    F. Guo and T. Chiueh, 2009.
  • An Incremental File System Consistency Checker for Block-Level CDP Systems
    A block-level continuous data protection (CDP) system logs every disk block update from an application server (e.g., a file or DBMS server) to a storage system so that any disk updates within a time window are undoable, and thus is able to provide a more flexible and efficient data protection service than conventional periodic data backup systems. Unfortunately, no existing block-level CDP systems can support arbitrary point-in-time snapshots that are guaranteed to be consistent with respect to the metadata of the application server. This deficiency seriously limits the flexibility in recovery point objective (RTO) of block-level CDP systems from the standpoint of the application servers whose data they protect. This paper describes an incremental file system check mechanism (iFSCK) that is designed to address this deficiency for file servers, and exploits file system-specific knowledge to quickly fix an arbitrary point-in-time block-level snapshot so that it is consistent with respect to file system metadata. Performance measurements taken from a fully operational iFSCK prototype show that iFSCK can turn a 10GB point-in-time block-level snapshot to be file-system consistent in less than 1 second, and takes less than 25% of the time required by the Fsck utility for vanilla ext3 under relaxed metadata consistency requirements.
    By: Maohua Lu**, Tzi-cker Chiueh* (Stony Brook University); and Shibiao Lin (Google Inc.)

    Published in: Proceedings of the IEEE 27th International Symposium on Reliable Distributed Systems (SRDS 2008), Napoli, Italy, October 2008

    Please obtain a copy of this paper from the publisher listed above.
    M. Lu and T. Chiueh, 2008.
  • Availability and Fairness Support for Storage QoS Guarantee
    Multi-dimensional storage virtualization (MDSV) allows multiple virtual disks, each with a distinct combination of capacity, latency and bandwidth requirements, to be multiplexed on a physical disk storage system with performance isolation. This paper presents novel design and implementation techniques that solve the availability guarantee and fairness assurance problems in multi-dimensional storage virtualization. First, we show that a measurement-based admission control algorithm can reduce the effective resource requirement of a virtual disk with availability guarantee by accurately estimating its resource needs without prior knowledge of its input workload characteristics. Moreover, to accurately factor disk access overhead into real-time disk request scheduling algorithm, we propose a virtual disk switching overhead extraction and distribution algorithm that can derive the intrinsic disk access overhead associated with each virtual disk so as to achieve perfect performance isolation. Finally, we develop an adaptive server time leap-forward algorithm to effectively address the short-term unfairness problem of virtual clock-based disk scheduler, the only known proportional-share scheduler that is based on wall-clock time and thus enables disk utilization efficiency optimization while delivering disk QoS guarantees.
    By: Peng Gang and Tzi-cker Chiueh* (Stony Brook University)

    Published in: Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, June 2008

    Please obtain a copy of this paper from the publisher listed above.
    P. Gang and T. Chiueh, 2008.

Theory

  • A Simple, Fast, and Compact Static Dictionary
    We present a new static dictionary that is very fast and compact, while also extremely easy to implement. A combination of properties make this algorithm very attractive for applications requiring large static dictionaries:

    1. High performance, with membership queries taking O(1)-time with a near-optimal constant.

    2. Continued high performance in external memory, with queries requiring only 1-2 disk seeks. If the dictionary has n items in {0, ..., m-1} and d is the number of bytes retrieved from disk on each read, then the average number of seeks is min(1.63, 1 + O((sqrt(n)log m)/d)).

    3. Efficient use of space, storing n items from a universe of size m in nlog m - 1/2n log n + O(n + log log m) bits. We prove this space bound with a novel application of the Kolmogorov-Smirnov distribution.

    4. Simplicity, with a 20-line pseudo-code construction algorithm and 4-line query algorithm.
    By: Scott Schneider* and Michael Spertus* (Symantec Research Labs)

    Published in: Proceedings of the 20th International Symposium on Algorithms and Computation (ISAAC 2009), Honolulu, Hawaii, USA, December 16-18, 2009, pp. 852-861, Springer LNCS 5878.

    Please obtain a copy of this paper from Springer.
    S. Schneider and M. Spertus, 2011.
  • Noncirculant Toeplitz Matrices All of Whose Powers are Toeplitz
    Let a, b and c be fixed complex numbers. Let Mn (a, b, c) be the n x n Toeplitz matrix all of whose entries above the diagonal are a, all of whose entries below the diagonal are b, and all of whose entries on the diagonal are c. For 1 < k < each k x k principal minor of Mn (a, b, c) has the same value. We find explicit and recursive formulae for the principal minors and characteristic polynomial of Mn (a, b, c). We also show that all complex polynomials in Mn (a, b, c) are Toeplitz matrices. In particular, the inverse of Mn (a, b, c) is a Toeplitz matrix when it exists.
    By: Kent Griffin*, Jeffrey L. Stuart (Pacific Lutheran University), and Michael J. Tsatsomeros (Washington State University)

    Published in: Czechoslovak Mathematical Journal (Mathematics and Statistics), Springer Netherlands, 2008

    Please obtain a copy of this paper from the publisher listed above.
    K. Griffin, J. Stuart, and M. Tsatsomeros, 2008.

Virtualization

  • Applications of Feather-Weight Virtual Machine
    A Feather-weight Virtual Machine (FVM) is an OS-level virtualization technology that enables multiple isolated execution environments to exist on a single Windows kernel. The key design goal of FVM is efficient resource sharing among VMs so as to minimize VM startup/shutdown cost and scale to a larger number of concurrent VM instances. As a result, FVM provides an effective platform for fault-tolerant and intrusion-tolerant applications that require frequent invocation and termination of dispensable VMs. This paper presents three complete applications of the FVM technology: scalable web site testing; shared binary service for application deployment and distributed Display-Only File Server (DOFS). To identify malicious web sites that exploit browser vulnerabilities, we use a web crawler to access untrusted sites, render their pages in multiple browsers each running in a separate VM, and monitor their execution behaviors. To allow Windows-based end user machines to share binaries that are stored, managed and patched on a central location, we run shared binaries in a special VM on the end user machine whose runtime environment is imported from the central binary server. To protect confidential files in a file server against information theft by insiders, we ensure that file viewing/editing programs run in a VM, which grants file content display but prevents file content from being saved on the host machine. In this paper, we show how to customize the generic FVM framework to accommodate the needs of the three applications, and present experimental results that demonstrate their performance and effectiveness.
    By: Yang Yu, Hariharan Kolam Govindarajan, Lap Chung-Lam and Tzi-cker Chiueh* (Stony Brook University)

    Published in: Proceedings of the 2008 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE08), Seattle, WA, March 2008

    Please obtain a copy of this paper from the publisher listed above.
    Y. Yu, H. Kolam Govindarajan, L. Chung-Lam, and T. Chiueh, 2008.
  • Graphics Engine Resource Management
    Modern consumer-grade 3D graphic cards boast a computation/memory resource that can easily rival or even exceed that of standard desktop PCs. Although these cards are mainly designed for 3D gaming applications, their enormous computational power has attracted developers to port an increasing number of scientific computation programs to these cards, including matrix computation, collision detection, cryptography, database sorting, etc. As more and more applications run on 3D graphic cards, there is a need to allocate the computation/memory resource on these cards among the sharing applications more fairly and efficiently. In this paper, we describe the design, implementation and evaluation of a Graphic Processing Unit (GPU) scheduler based on Deficit Round Robin scheduling that successfully allocates to every process an equal share of the GPU time regardless of their demand. This scheduler, called GERM, estimates the execution time of each GPU command group based on dynamically collected statistics, and controls each process’s GPU command production rate through its CPU scheduling priority. Measurements on the first GERM prototype show that this approach can keep the maximal GPU time consumption difference among concurrent GPU processes consistently below 5% for a variety of application mixes.
    By: Mikhail Bautin, Ashok Dwarakinath and Tzi-cker Chiueh* (Stony Brook University)

    Published in: Proceedings of the 15th Annual Multimedia Computing and Networking Conference (MMCN 2008), San Jose, CA, January 2008

    Please obtain a copy of this paper from the publisher listed above.
    M. Bautin, A. Dwarakinath, and T. Chiueh, 2008.